数据分析 数据科学_数据科学中的数据分析
數據分析 數據科學
資料剖析 (Data Profiling)
Data Profiling is a method of examining data from an existing supply and summarizing info this data. Your profile data to work out the accuracy, completeness, and validity of your data. Information identification is in dire straits several reasons, however, it's most typically a part of serving to work out information quality as an element of a bigger project. Commonly, Data Profiling is combined with an?ETL (Extract, Transform, and Load)?method to maneuver data from one system to a different. Once done properly, ETL and Data Profiling is combined to cleanse, enrich, and move quality information to a target location.
數據分析是一種檢查來自現有供應商的數據并匯總此數據信息的方法。 您的個人資料數據可以計算出數據的準確性,完整性和有效性。 信息識別陷入困境的原因有很多,但是,它通常是確定信息質量的一部分,這是大型項目的一個組成部分。 通常,數據分析與ETL(提取,轉換和加載)方法結合使用,可以將數據從一個系統轉移到另一個系統。 一旦正確完成,ETL和數據分析將結合起來,以清理,豐富質量信息并將其移動到目標位置。
For example, you may need to perform data profiling once migrating from a gift system to a brand new system. Data Profiling will facilitate establish data quality problems that require to be handled within the code after you move data into your new system Or you may need to perform data profiling as you progress data to a data warehouse for business analytics. Typically once data is captive to a data warehouse, ETL tools are accustomed to moving the Data. Data profiling is useful in characteristic what data quality problems should be fastened within the supply, and what data quality problems are fastened throughout the ETL method.
例如,從禮品系統遷移到全新系統后,您可能需要執行數據分析。 數據剖析有助于建立數據質量問題,這些問題需要在將數據移至新系統中之后在代碼中進行處理,或者在將數據前進到數據倉庫進行業務分析時可能需要執行數據剖析。 通常,一旦數據被捕獲到數據倉庫中,ETL工具就會習慣于移動數據。 數據概要分析有助于確定應在供應中解決哪些數據質量問題以及在整個ETL方法中解決哪些數據質量問題。
為什么要分析資料? (Why profile data?)
Data profiling permits you to answer the subsequent questions on your data:
數據分析使您可以回答有關數據的后續問題:
Is the data complete? Are there a blank or no values?
數據是否完整? 是否有空白或沒有值?
Is this data unique? How many distinct values are there? Is that the data duplicated?
此數據是否唯一? 有多少個不同的值? 數據是否重復?
Are there abnormal patterns in your data? What's the distribution of patterns in your data?
您的數據中是否存在異常模式? 數據中模式的分布是什么?
Are these the patterns I expect?
這些是我期望的模式嗎?
What varies values exist and are they expected? What are the utmost, minimum, and average values for given data? Are these the ranges I expect?
存在哪些不同的值,它們是預期的嗎? 給定數據的最大,最小和平均值是多少? 這些是我期望的范圍嗎?
Answering these queries helps you make sure that you're maintaining quality data, that — firms are progressively realizing — is that the cornerstone of a thriving business.
回答這些查詢有助于確保您正在維護質量數據(企業正在逐步實現),這是業務蓬勃發展的基石。
一個配置文件如何數據? (How does one profile data?)
Data profiling is performed in several ways that, however, there are roughly 3 base ways accustomed to analyze the info.
數據分析以幾種方式執行,但是,大約有3種基本方式習慣于分析信息。
Column profiling counts the number of times each price seems among every column during a table. This methodology helps to uncover the patterns among your data.
列分析計算表中每個列中每個價格出現的次數。 這種方法有助于發現數據中的模式。
Cross-column profiling appearance across columns to perform key and dependency analysis. Key analysis scans collections of values during a table to find a possible primary key. Dependency analysis determines the dependent relationships among a data set. Together, these analyses verify the relationships and dependencies among a table.
跨列的跨列分析外觀,以執行鍵和依賴關系分析。 鍵分析在表期間掃描值的集合,以查找可能的主鍵。 依賴性分析確定數據集之間的依賴性關系。 這些分析共同驗證了表之間的關系和依賴性。
Cross-table profiling appearance across tables to spot potential foreign keys. It additionally attempts to work out the similarities and variations in syntax and data varieties between tables to determine that data may well be redundant and which could be mapped along.
跨表的跨表分析外觀可發現潛在的外鍵。 此外,它嘗試找出表之間語法和數據種類的相似性和變化形式,以確定數據可能完全是冗余的并且可以沿數據映射。
Rule validation is usually thought of as the ultimate step in data profiling. This can be a proactive step of adding rules that check for the correctness and integrity of the info that's entered into the system.
通常將規則驗證視為數據概要分析的最終步驟。 這可以是添加規則的主動步驟,該規則將檢查輸入到系統中的信息的正確性和完整性。
These different ways could also be performed manually by an analyst, or they'll be performed by a service that will alter these queries.
這些不同的方式也可以由分析師手動執行,或者由將更改這些查詢的服務來執行。
數據分析挑戰 (Data profiling challenges)
Data profiling is commonly troublesome because of the sheer volume of data you'll get to profile. This can be very true if you're gazing at a gift system. A gift system might need years of older data with thousands of errors. Consultants advocate that you simply phase your data as a section of your data profiling method so you'll be able to see the forest for the trees.
數據分析通常很麻煩,因為您將要分析的數據量很大。 如果您盯著禮物系統,這可能是非常正確的。 禮物系統可能需要多年的舊數據,并且有數千個錯誤。 顧問們提倡您只需將數據作為數據分析方法的一部分進行分階段操作,就可以看到樹木的森林。
If you manually perform your data profiling, you should have the skill to run various queries and sift through the results to achieve meaningful insights regarding your data, which might eat up precious resources. Additionally, you may doubtless solely be ready to check a set of your overall data as a result of it's too long to travel through the complete data set.
如果您手動執行數據分析,則您應該具有運行各種查詢并篩選結果的技巧,以獲取有關數據的有意義的見解,這可能會消耗寶貴的資源。 此外,由于時間太長,無法遍歷完整的數據集,因此毫無疑問,您可能只準備檢查一組整體數據。
翻譯自: https://www.includehelp.com/data-science/data-profiling.aspx
數據分析 數據科學
總結
以上是生活随笔為你收集整理的数据分析 数据科学_数据科学中的数据分析的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: scala 去除重复元素_Scala程序
- 下一篇: ruby hash方法_Ruby中带有示