當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据分析数据清理_数据清理| 数据科学

發(fā)布時(shí)間：2025/3/11 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了数据分析数据清理_数据清理| 数据科学小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

數(shù)據(jù)分析數(shù)據(jù)清理

數(shù)據(jù)清理 (Data Cleaning)

Data cleaning is the way toward altering information to guarantee that it is right, precise, and significant. The definition may be straightforward, yet information cleaning is utilized in numerous situations. Likewise, information cleaning alludes to a large number of exercises. These exercises mean to improve the nature of your information. Generally, these assignments are cultivated by joining numerous different activities. The present blog entries will talk about the most significant information cleaning undertakings.

數(shù)據(jù)清理是更改信息以確保其正確，準(zhǔn)確和重要的方法。該定義可能很簡(jiǎn)單，但是在許多情況下都使用了信息清洗。同樣，信息清洗也涉及大量練習(xí)。這些練習(xí)旨在改善您信息的性質(zhì)。通常，通過分配許多不同的活動(dòng)來培養(yǎng)這些任務(wù)。當(dāng)前的博客文章將討論最重要的信息清潔工作。

輪廓匹配和數(shù)據(jù)標(biāo)準(zhǔn)化 (Outline Matching and Data Standardization)

Frequently, composition coordinating is the main errand you have to perform. Its point is to adjust the traits originating from new datasets with the ones in your current database.

通常，構(gòu)圖協(xié)調(diào)是您必須執(zhí)行的主要任務(wù)。它的目的是用當(dāng)前數(shù)據(jù)庫中的數(shù)據(jù)調(diào)整源自新數(shù)據(jù)集的特征。

Existing Customer Schema (Name, Country, Address, Phone)

現(xiàn)有客戶架構(gòu)(名稱，國(guó)家/地區(qū)，地址，電話)

Approaching Customer Schema (Country, City, Street, Apt, Phone)

接近客戶模式(國(guó)家，城市，街道，公寓，電話)

To coordinate these patterns and push ahead with your information coordinating activity, you have to devise a procedure that changes over each tuple in the Incoming Customer Schema to Existing Customer Schema.

為了協(xié)調(diào)這些模式并推進(jìn)您的信息協(xié)調(diào)活動(dòng)，您必須設(shè)計(jì)一個(gè)過程，以將“傳入客戶模式”中的每個(gè)元組轉(zhuǎn)換為“現(xiàn)有客戶模式”。

Another situation we will examine here alludes to a similar two constructions however accept that the information records about your clients don't contain postal districts. If you have to see what number of clients are there for a particular code, it is critical to have the right zip esteems.

我們將在這里檢查的另一種情況暗示類似的兩種構(gòu)造，但是我們接受關(guān)于您的客戶的信息記錄不包含郵政區(qū)。如果必須查看特定代碼的客戶端數(shù)量，那么擁有正確的zip信譽(yù)至關(guān)重要。

Nonetheless, similar standards apply when you have to keep up your item index database. You should ensure that all elements of an item are both communicated in similar units and that these qualities are not missing. If not, search questions will return mistaken outcomes. The errand that ensures all qualities are utilizing a similar show is called information institutionalization. This is the errand you ought to perform before other information cleaning exercises, for example, information coordinating and information deduplication. These are in no way, shape or form unimportant exercises and, frequently, it isn't practical for you to perform them physically.

但是，當(dāng)您必須保持商品索引數(shù)據(jù)庫時(shí)，也適用類似的標(biāo)準(zhǔn)。您應(yīng)確保一個(gè)項(xiàng)目的所有元素都以相似的單位進(jìn)行交流，并且不遺漏這些品質(zhì)。如果不是，搜索問題將返回錯(cuò)誤的結(jié)果。確保所有素質(zhì)都利用類似表演的方式被稱為信息制度化 。這是在執(zhí)行其他信息清除練習(xí)(例如，信息協(xié)調(diào)和重復(fù)數(shù)據(jù)刪除)之前應(yīng)該執(zhí)行的任務(wù)。這些絕不是無關(guān)緊要的形式或形式，并且通常來說，您不能實(shí)際進(jìn)行鍛煉。

資料比對(duì) (Data Matching)

The point of record coordinating is to coordinate every single record from a dataset with the records from another dataset. For the most part, you have to play out this action when you import new information. Thusly, you will ensure the new datasets don't present copy substances.

記錄協(xié)調(diào)的重點(diǎn)是將數(shù)據(jù)集中的每個(gè)記錄與另一個(gè)數(shù)據(jù)集中的記錄進(jìn)行協(xié)調(diào)。在大多數(shù)情況下，導(dǎo)入新信息時(shí)必須執(zhí)行此操作。因此，您將確保新的數(shù)據(jù)集不顯示復(fù)制物質(zhì)。

Consider a situation when you have to import another arrangement of client records into your business database. You should check if a similar client is spoken to in both approaching cluster or existing databases. You should keep just one record. Lamentably, because of composing mistakes or illustrative blunders, a similar record in the two pieces of information could appear to be changed. Subsequently, it probably won't coordinate the significant characteristics, for example, telephone, address, and name.

考慮一種情況，您必須將另一組客戶記錄導(dǎo)入到您的業(yè)務(wù)數(shù)據(jù)庫中。您應(yīng)該檢查在接近群集或現(xiàn)有數(shù)據(jù)庫中是否使用了類似的客戶端。您應(yīng)該只保留一個(gè)記錄。可悲的是，由于出現(xiàn)了錯(cuò)誤或說明性的錯(cuò)誤，兩條信息中的相似記錄似乎已被更改。隨后，它可能無法協(xié)調(diào)重要特征，例如電話，地址和名稱。

The trouble is regularly expanded on account of sections where the item depiction is a link of more than one characteristic. In this way, the objective of record coordinating is to discover sets of records in every one of the two informational collections which relate to a similar substance.

由于項(xiàng)目描述是多個(gè)特性鏈接的一部分，因此該問題會(huì)定期擴(kuò)大。通過這種方式，記錄協(xié)調(diào)的目的是在與相似物質(zhì)相關(guān)的兩個(gè)信息收集的每一個(gè)中發(fā)現(xiàn)記錄集。

The most significant difficulties you have to address right now:

您現(xiàn)在必須解決的最重要的困難是：

Recognize the criteria that guarantee two records are undoubtedly relating to a similar true element with the huge datasets accessible today, you need to locate the most proficient calculation technique. This strategy ought to have the option to decide the previously mentioned combines over huge arrangements of information.

認(rèn)識(shí)到保證兩條記錄無疑與當(dāng)今擁有巨大數(shù)據(jù)集的相似真實(shí)元素相關(guān)的標(biāo)準(zhǔn)，您需要找到最精通的計(jì)算技術(shù)。該策略應(yīng)具有選擇權(quán)，可以決定上述巨大信息組合的組合。

Luckily, few apps can assist you with conquering these obstacles. By utilizing its keen fluffy coordinating motor, our item is designed to locate the most obvious matches and the least bogus matches. Moreover, you can consolidate these outcomes with the adjustable information base library.

幸運(yùn)的是，很少有應(yīng)用程序可以幫助您克服這些障礙。通過使用其敏銳的蓬松協(xié)調(diào)馬達(dá)，我們的產(chǎn)品旨在定位最明顯的匹配項(xiàng)和最少的虛假匹配項(xiàng)。此外，您可以使用可調(diào)整的信息庫來合并這些結(jié)果。

資料復(fù)制 (Data Duplication)

Information deduplication intends to aggregate records in a dataset. Thusly, it ensures that each gathering is speaking to a similar true substance. For best outcomes, you ought to play out this procedure both when you populate the database just because and when you include new records. When contrasted with information coordinating, deduplication is generally including the extra gathering of coordinating records. This methodology permits the gatherings to on the whole parcel the information datasets.

信息重復(fù)數(shù)據(jù)刪除旨在聚合數(shù)據(jù)集中的記錄。因此，它確保每次聚會(huì)都在講類似的真實(shí)內(nèi)容。為了獲得最佳結(jié)果，在填充數(shù)據(jù)庫(包括添加新記錄)和添加新記錄時(shí)都應(yīng)執(zhí)行此過程。與信息協(xié)調(diào)相比，重復(fù)數(shù)據(jù)刪除通常包括額外收集的協(xié)調(diào)記錄。這種方法可以使收集者整體上收集信息數(shù)據(jù)集。

Consider a model where your database stores various records, for example,

考慮一個(gè)數(shù)據(jù)庫存儲(chǔ)各種記錄的模型，例如，

Nikon D750 Camera
尼康D750相機(jī)
Nikon D750 SLR
尼康D750單反
Nikon D750 Digital SLR
尼康D750數(shù)碼單反

This set has different records that speak to a similar element. Along these lines, you should be capable not exclusively to coordinate two of them however coordinate every one of the three records to a similar certifiable substance.

該集合具有不同的記錄，它們代表相似的元素。遵循這些原則，您不應(yīng)該只能夠協(xié)調(diào)其中的兩個(gè)，而應(yīng)將三個(gè)記錄中的每一個(gè)都協(xié)調(diào)到類似的可驗(yàn)證物質(zhì)。

資料剖析 (Data Profiling)

Since information cleaning is an intelligent procedure, it is fundamental for you to have the option to assess the nature of your information. You ought to have the option to do this both when the information cleaning process. Thusly, you will have the option to check its adequacy. We call his procedure information profiling. Its most significant objectives are to guarantee that your qualities coordinate with your desires.

由于信息清除是一種智能過程，因此您可以選擇評(píng)估信息的性質(zhì)，這一點(diǎn)至關(guān)重要。在信息清理過程中，您都應(yīng)該選擇同時(shí)執(zhí)行此操作。因此，您可以選擇檢查其適當(dāng)性。我們稱其為程序信息分析 。其最重要的目標(biāo)是確保您的品質(zhì)與您的期望相協(xié)調(diào)。

Consider that you may expect a client name and address to exceptionally recognize every client in your database. Along these lines, the number of exceptional tuples must be as nearest as conceivable to the complete number of passages in your database.

考慮到您可能希望客戶名和地址能異常識(shí)別數(shù)據(jù)庫中的每個(gè)客戶。遵循這些原則，異常元組的數(shù)量必須與數(shù)據(jù)庫中整個(gè)段落的數(shù)量盡可能接近。

Notwithstanding, even you may acquire subsets of components through a few SQL inquiries, this methodology is wasteful and tedious. Data Profiling/Statistics is anything but difficult to utilize and incredible information profiling programming made to assist you with finding designs in your informational collections. Besides, the module can check the nature of your information by examining esteem tallies, types, organizations, and culmination. The module gives a total arrangement of measurable information intended to help clean your information.

盡管如此，即使您可以通過一些SQL查詢來獲取組件的子集，這種方法也是浪費(fèi)和繁瑣的。數(shù)據(jù)剖析/統(tǒng)計(jì)幾乎沒有什么可利用的，而令人難以置信的信息剖析編程可幫助您在信息集合中查找設(shè)計(jì)。此外，該模塊還可以通過檢查自尊記錄，類型，組織和高潮來檢查您信息的性質(zhì)。該模塊提供了可衡量信息的整體安排，旨在幫助您清潔信息。

翻譯自: https://www.includehelp.com/data-science/data-cleaning.aspx