探索性数据分析(EDA)-不要问如何,不要问什么
數(shù)據(jù)科學(xué) , 機(jī)器學(xué)習(xí) (Data Science, Machine Learning)
This is part 1 in a series of articles guiding the reader through an entire data science project.
這是一系列文章的第1部分 ,指導(dǎo)讀者完成整個(gè)數(shù)據(jù)科學(xué)項(xiàng)目。
I am a new writer on Medium and would truly appreciate constructive criticism in the comments below.
我是Medium的新作家,在下面的評(píng)論中,我將非常感謝建設(shè)性的批評(píng)。
總覽 (Overview)
Lost in a Sea of Nothingness (sometimes feels like EDA) — from Unsplash迷失于虛無(wú)之海(有時(shí)感覺(jué)像EDA)—來(lái)自UnsplashWhat is EDA anyway?
無(wú)論如何,EDA是什么?
EDA or Exploratory Data Analysis is the process of understanding what data we have in our dataset before we start finding solutions to our problem. In other words — it is the act of analyzing the data without biased assumptions in order to effectively preprocess the dataset for modeling.
EDA或探索性數(shù)據(jù)分析是在開(kāi)始尋找問(wèn)題的解決方案之前了解我們數(shù)據(jù)集中的數(shù)據(jù)的過(guò)程。 換句話說(shuō),這是在沒(méi)有偏見(jiàn)的前提下分析數(shù)據(jù)的行為,以便有效地預(yù)處理數(shù)據(jù)集以進(jìn)行建模。
Why do we do EDA?
我們?yōu)槭裁匆M(jìn)行EDA?
The main reasons we do EDA are to verify the data in the dataset, to check if the data makes sense in the context of the problem, and even sometimes just to learn about the problem we are exploring. Remember:
我們進(jìn)行EDA的主要原因是為了驗(yàn)證數(shù)據(jù)集中的數(shù)據(jù),檢查數(shù)據(jù)是否在問(wèn)題范圍內(nèi)有意義,甚至有時(shí)只是了解我們正在探索的問(wèn)題。 記得:
xkcdxkcd的 “垃圾填埋”EDA中有哪些步驟,我應(yīng)該如何做? (What are the steps in EDA and how should I do each one?)
- Descriptive Statistics — get a high-level understanding of your dataset 描述性統(tǒng)計(jì)信息-全面了解您的數(shù)據(jù)集
- Missing values — come to terms with how bad your dataset is 缺失值-取決于數(shù)據(jù)集的嚴(yán)重程度
- Distributions and Outliers — and why countries that insist on using different units make our jobs so much harder 分布和異常值-以及為什么堅(jiān)持使用不同單位的國(guó)家使我們的工作變得如此困難
- Correlations — and why sometimes even the most obvious patterns still require some investigating 相關(guān)性-為什么有時(shí)即使是最明顯的模式仍需要進(jìn)行一些調(diào)查
關(guān)于Pandas分析的說(shuō)明 (A note on Pandas Profiling)
Pandas Profiling is probably the easiest way to do EDA quickly (although there are many other alternatives such as SweetViz ), the downside of using Pandas Profiling is that it can be slow to give you very in-depth analysis, even when not needed.
Pandas Profiling可能是快速進(jìn)行EDA的最簡(jiǎn)單方法(盡管還有許多其他選擇,例如SweetViz ),但是使用Pandas Profiling的不利之處在于,即使在不需要時(shí)也無(wú)法為您提供深入的分析。
I will describe below how I used Pandas Profiling for analyzing the Diabetics Readmission Dataset on Kaggle (https://www.kaggle.com/friedrichschneider/diabetic-dataset-for-readmission/data)
我將在下面介紹如何使用熊貓分析來(lái)分析Kaggle上的糖尿病再入院數(shù)據(jù)集( https://www.kaggle.com/friedrichschneider/diabetic-dataset-for-readmission/data )
To see the Pandas Profiling report, simply run the following:
要查看“熊貓分析”報(bào)告,只需運(yùn)行以下命令:
描述性統(tǒng)計(jì) (Descriptive Statistics)
For this stage I like to look at just a few key points:
在此階段,我只想看幾個(gè)關(guān)鍵點(diǎn):
- I look at the count to see if I have a significant amount of missing values for each specific feature. If there are many missing values for a certain feature I might want to discard it. 我查看一下計(jì)數(shù),以查看每個(gè)特定功能是否都有大量的缺失值。 如果某個(gè)功能缺少許多值,則可能要丟棄它。
- I look at the unique values (for categorical this will show up as NaN for pandas describe, but in Pandas Profiling we can see the distinct count). If a feature has only 1 unique value it will not help my model, so I discard it. 我查看了唯一的值(對(duì)于分類(lèi)而言,這將顯示為NaN用于熊貓描述,但在“熊貓剖析”中我們可以看到不同的計(jì)數(shù))。 如果某個(gè)要素只有1個(gè)唯一值,則對(duì)我的模型無(wú)濟(jì)于事,因此我將其丟棄。
- I look at the ranges of the values. If the max or min of a feature is significantly different from the mean and from the 75% / 25%, I might want to look into this further to understand if these values make sense in their context. 我看一下值的范圍。 如果特征的最大值或最小值與均值和75%/ 25%顯著不同,我可能需要進(jìn)一步研究以了解這些值在上下文中是否有意義。
缺失值 (Missing Values)
Almost every real-world dataset has missing values. There are many ways to deal with missing values — usually the techniques we use depend on the dataset and the context. Sometimes we can made educated guesses and/or impute the values. Instead of going through all the each method (there are many great medium articles out there describing the different methods in depth — see this great article by Jun Wu ), I will discuss how, sometimes, even though we are given a value in the data, the value is actually missing, and one particular method that allows us to ignore the hidden values for the time being.
幾乎每個(gè)現(xiàn)實(shí)世界的數(shù)據(jù)集都缺少值。 處理缺失值的方法有很多-通常,我們使用的技術(shù)取決于數(shù)據(jù)集和上下文。 有時(shí)我們可以進(jìn)行有根據(jù)的猜測(cè)和/或估算值。 我將不討論每種方法的全部問(wèn)題(那里有很多很棒的中篇文章深入地介紹了不同的方法,請(qǐng)參見(jiàn)Jun Wu的這篇很棒的文章 ),我將討論有時(shí),即使我們?cè)跀?shù)據(jù)中獲得了價(jià)值,也將如何討論。 ,實(shí)際上是缺少該值,還有一種特殊的方法可以讓我們暫時(shí)忽略隱藏的值。
The diabetes dataset is a great example of missing values hidden within the data. If we look at the ‘descriptive statistics’ we can see zero missing values, but a simple observation of one of the features, in this case, “payer_code” in the figure above, we can see that almost half of the samples have a category “?”. These are hidden missing values.
糖尿病數(shù)據(jù)集是隱藏在數(shù)據(jù)中的缺失值的一個(gè)很好的例子。 如果我們查看“描述性統(tǒng)計(jì)信息”,則可以看到零缺失值,但是簡(jiǎn)單觀察其中一個(gè)功能(在本例中為上圖中的“ payer_code”),我們可以看到幾乎有一半的樣本具有類(lèi)別“?”。 這些是隱藏的缺失值。
What should we do when half the samples have missing values? There is no one right answer (See Jun Wu’s article). Many would say just exclude the feature with many missing values from your model as there is no way to accurately impute them.
當(dāng)一半樣本的值缺失時(shí),我們?cè)撛趺崔k? 沒(méi)有一個(gè)正確的答案( 請(qǐng)參閱Wu Jun的文章 )。 許多人會(huì)說(shuō)只是從模型中排除掉具有許多缺失值的特征,因?yàn)闊o(wú)法準(zhǔn)確估算它們。
But there is one method many data scientists miss out on. If you are using a Decision Tree-based model (such as a GBM), then the tree can take a missing value as an input. Since all features will be turned into numeric values we can just encode “?” as an extreme value that is far from the range used in the dataset (such as 999,999), this way at the node, all samples with missing values will split to one side of the tree. If we find after modeling that this value is very important, we can come back to the EDA stage and try and understand (probably by using a domain expert) if there is valuable information in all the missing values of this specific feature. Some packages don’t even require you to encode missing values, such as LightGBM which automatically does this split.
但是,許多數(shù)據(jù)科學(xué)家錯(cuò)過(guò)了一種方法。 如果您使用的是基于決策樹(shù)的模型(例如GBM),則該樹(shù)可能會(huì)使用缺失值作為輸入。 由于所有功能都將轉(zhuǎn)換為數(shù)字值,因此我們只需編碼“?” 作為遠(yuǎn)離數(shù)據(jù)集中使用的范圍的極值(例如999,999),以這種方式在節(jié)點(diǎn)上,所有缺少值的樣本都將拆分到樹(shù)的一側(cè)。 如果在建模后發(fā)現(xiàn)此值非常重要,我們可以回到EDA階段并嘗試了解(可能通過(guò)使用領(lǐng)域?qū)<?在此特定功能的所有缺失值中是否都包含有價(jià)值的信息。 有些軟件包甚至不需要您對(duì)缺失值進(jìn)行編碼,例如LightGBM會(huì)自動(dòng)進(jìn)行此拆分。
Example of what the split where there are missing values looks like in the tree樹(shù)中缺少值的拆分示例行重復(fù) (Duplicate Rows)
Duplicate rows sometimes appear in datasets. It is very easy to solve (this is one solution using the pandas build-in method):
重復(fù)的行有時(shí)會(huì)出現(xiàn)在數(shù)據(jù)集中。 這很容易解決(這是使用pandas內(nèi)置方法的一種解決方案):
df.drop_duplicates(inplace=True)There is another type of duplicate rows that you need to be wary of. Say you have a dataset on patients. You might have many rows for each patient that represent taking a medication. These are not duplicates. We will explore how to deal with this kind of duplicate rows later in the series when we explore ‘Feature Engineering’.
您需要警惕另一種重復(fù)行。 假設(shè)您有一個(gè)有關(guān)患者的數(shù)據(jù)集。 對(duì)于每個(gè)代表正在服藥的患者,您可能會(huì)有很多行。 這些不是重復(fù)項(xiàng)。 在探索“功能工程”時(shí),我們將在本系列的后面部分探討如何處理這種重復(fù)行。
分布和異常值 (Distributions and Outliers)
The main reason to analyze the distributions and outliers in the dataset is to validate that the data is correct and makes sense. Another good reason to do this is to simplify the dataset.
分析數(shù)據(jù)集中的分布和離群值的主要原因是要驗(yàn)證數(shù)據(jù)正確無(wú)誤。 這樣做的另一個(gè)很好的理由是簡(jiǎn)化數(shù)據(jù)集。
驗(yàn)證數(shù)據(jù)集 (Validating the Dataset)
Let’s say we plot a histogram for the heights of the patient and we observe the following
假設(shè)我們繪制了一個(gè)針對(duì)患者身高的直方圖,我們觀察到以下
Histogram of Heights of Patients患者身高直方圖Clearly there is some kind of problem with the data. Here we can guess (due to the context) that 10% of the data has been measured in feet, and the rest in centimeters. We can then convert the rows where the height is less than 10 from feet to centimeters. Pretty simple. What do we do in a more complicated example, such as the one below?
顯然,數(shù)據(jù)存在某種問(wèn)題。 在這里,我們可以猜測(cè)(由于上下文),其中10%的數(shù)據(jù)以英尺為單位,其余數(shù)據(jù)以厘米為單位。 然后,我們可以將高度小于10的行從英尺轉(zhuǎn)換為厘米。 很簡(jiǎn)單 在一個(gè)更復(fù)雜的示例(例如下面的示例)中,我們?cè)撛趺醋?#xff1f;
Histogram of Heights of Patients患者身高直方圖Here, if we briefly look at the dataset and don’t check each and every feature, we will miss that patients’ heights are recorded as tall as even 6 meters, which doesn’t make sense (see Tallest People in the World). To solve this unit error, we must make some decisions on the cutoff: which heights are measured in feet and which in meters. Another option is to check if there is a correlation between height and country, for example, and we might find that all the feet measurements are from the US.
在這里,如果我們簡(jiǎn)單地查看數(shù)據(jù)集而不檢查每個(gè)要素,我們會(huì)錯(cuò)過(guò)記錄患者的身高甚至只有6米的高,這是沒(méi)有道理的(請(qǐng)參見(jiàn)世界上最高的人 )。 為了解決這個(gè)單位誤差,我們必須對(duì)截止值做出一些決定:哪些高度以英尺為單位,哪些高度以米為單位。 另一個(gè)選擇是,例如,檢查身高與國(guó)家/地區(qū)之間是否存在相關(guān)性,我們可能會(huì)發(fā)現(xiàn)所有的英尺測(cè)量值都來(lái)自美國(guó)。
離群值 (Outliers)
Another important thing is to check outliers. We can graph the different features either as box-plots or as a function of another feature (typically the target variable, but not necessarily). There are many statistics to check for outliers in the data, but often in EDA, we can identify them very easily. In the example below, we can immediately identify outliers (random data).
另一個(gè)重要的事情是檢查異常值。 我們可以將不同的特征繪制成箱線圖或作為另一個(gè)特征的函數(shù)(通常是目標(biāo)變量,但不一定)。 有許多統(tǒng)計(jì)數(shù)據(jù)可用于檢查數(shù)據(jù)中的異常值,但是在EDA中,我們經(jīng)常可以很容易地識(shí)別它們。 在下面的示例中,我們可以立即識(shí)別異常值(隨機(jī)數(shù)據(jù))。
Identifying Outliers in Visualisations識(shí)別可視化中的異常值It is important to check outliers to understand if these are errors in the dataset. This is a whole separate topic (See Natasha Sharma’s excellent article on the topic), but a very important one to understand whether or not to keep there are errors in the dataset.
重要的是檢查異常值,以了解這些是否是數(shù)據(jù)集中的錯(cuò)誤。 這是一個(gè)完整的主題(請(qǐng)參閱Natasha Sharma關(guān)于該主題的出色文章 ),但對(duì)于理解數(shù)據(jù)集中是否存在錯(cuò)誤非常重要。
簡(jiǎn)化數(shù)據(jù)集 (Simplifying the Dataset)
Another really important reason to do EDA is that we might want to simplify our dataset and or even just identify where to simplify the dataset.
進(jìn)行EDA的另一個(gè)真正重要的原因是,我們可能希望簡(jiǎn)化數(shù)據(jù)集,甚至只是確定簡(jiǎn)化數(shù)據(jù)集的位置。
Perhaps we can group certain features in our dataset? Take the target variable “Readmission” in the diabetes patient dataset. If we plot the different variables we find that readmission in under 30 days and in over 30 days, generally follows the same distribution across different features. If we merge them we can balance our dataset and get better predictions.
也許我們可以將數(shù)據(jù)集中的某些特征分組? 在糖尿病患者數(shù)據(jù)集中獲取目標(biāo)變量“再入院”。 如果我們繪制不同的變量,我們會(huì)發(fā)現(xiàn)30天內(nèi)和30天內(nèi)的重新錄入通常遵循不同特征的相同分布。 如果我們合并它們,我們可以平衡數(shù)據(jù)集并獲得更好的預(yù)測(cè)。
Dataset Evenly Distributed between Readmitted and Not數(shù)據(jù)集在重新允許和不重新分配之間平均分配If we check the distribution against different features we find that this still holds, take for example across genders
如果我們對(duì)照不同的特征檢查分布,我們發(fā)現(xiàn)它仍然成立,例如跨性別
We can see that the distribution between readmitted or not across genders is roughly the same我們可以看到,重新錄入的男女之間的分布大致相同We can check this across different features, but here the conclusions seem to be that the dataset is very balanced and we can probably combine ‘readmitted’ in over or under 30 days.
我們可以在不同的功能上進(jìn)行檢查,但是這里的結(jié)論似乎是數(shù)據(jù)集非常平衡,我們可以在30天內(nèi)或30天內(nèi)組合“重新提交”。
了解數(shù)據(jù)集 (Learn the Dataset)
Another very important reason to visualize the distributions of your datasets is to learn what you even have. Take the following population pyramid of ‘Patient Numbers’ by age and gender
可視化數(shù)據(jù)集分布的另一個(gè)非常重要的原因是學(xué)習(xí)您甚至擁有的內(nèi)容。 根據(jù)年齡和性別,獲取以下“患者人數(shù)”的人群金字塔
Distribution of Patient Number by Age and Gender — Credit to Lilach Goldshtein按年齡和性別劃分的患者人數(shù)分布-歸功于Lilach GoldshteinUnderstanding the distribution of age and gender in our dataset is essential in order to make sure we are reducing the bias between them as much as possible. Studies have discovered that many models are extremely biased, as they’ve only been trained on one gender or race (often men or white people, for example), so this is an extremely important step in the EDA.
為了確保我們盡可能減少兩者之間的偏差,了解數(shù)據(jù)集中的年齡和性別分布至關(guān)重要。 研究發(fā)現(xiàn),許多模型有很大的偏見(jiàn),因?yàn)樗鼈冎唤邮苓^(guò)一種性別或種族的訓(xùn)練(例如,通常是男性或白人),因此這是EDA中極為重要的一步。
相關(guān)性 (Correlations)
Often a lot of emphasis in EDA is on correlations, and often correlations are really interesting, but not wholly useful alone (see this article on interpreting basic correlations). A significant area of research in academia is how to identify causation versus correlation (for a brief intro see this Khan Academy lesson), often though domain experts can verify that a correlation is indeed causation.
EDA中通常會(huì)著重于相關(guān)性,而相關(guān)性通常確實(shí)很有趣,但并不完全有用(請(qǐng)參閱有關(guān)解釋基本相關(guān)性的本文 )。 盡管領(lǐng)域?qū)<铱梢则?yàn)證關(guān)聯(lián)確實(shí)是因果關(guān)系,但學(xué)術(shù)界研究的一個(gè)重要領(lǐng)域是如何確定因果關(guān)系與關(guān)聯(lián)性(有關(guān)簡(jiǎn)介,請(qǐng)參見(jiàn)本可汗學(xué)院的課程 )。
There are many ways to plot correlations, and different correlation methods to use. I will focus on three —Phi K, Cramer’s V, and ‘one-way analysis’.
有許多方法可以繪制相關(guān)性,并可以使用不同的相關(guān)方法。 我將專(zhuān)注于三個(gè)-披披K,克拉默五世和“單向分析”。
Phi K相關(guān) (Phi K Correlation)
Phi_K is a new correlation coefficient based on improvements to Pearson’s test of independence of two variables (see the documentation for more info). See below Phi K correlation from Pandas Profiling (one of several available correlation matrices)
Phi_K是一個(gè)新的相關(guān)系數(shù),它基于對(duì)Pearson的兩個(gè)變量的獨(dú)立性測(cè)試的改進(jìn)(請(qǐng)參閱文檔以獲取更多信息)。 參見(jiàn)下文,來(lái)自Pandas Profiling的Phi K相關(guān)(幾種可用的相關(guān)矩陣之一)
Phi K Correlation from Correlations in Pandas Profiling從熊貓分析中的相關(guān)性得出的Phi K相關(guān)性We can very easily identify a correlation between ‘Readmitted’ — our target variable — (the last row/column) and several other features such as: ‘Admission Type’, ‘Discharge Disposition’, ‘Admission Source’, ‘Payer Code’ and ‘Number of Lab Procedures’. This should be light a lightbulb for us, and we must dig deeper into each one to understand if this makes sense in the context of the problem (probably — if you have more procedures then you probably have a more significant problem and so you are more likely to be readmitted), and to help confirm conclusions that our model might find later in the project.
我們可以很容易地確定目標(biāo)變量“重新允許”(最后一行/列)與其他幾個(gè)功能之間的相關(guān)性,例如:“入場(chǎng)類(lèi)型”,“出院位置”,“入場(chǎng)來(lái)源”,“付款人代碼”和“ “實(shí)驗(yàn)室程序數(shù)量”。 對(duì)于我們來(lái)說(shuō),這應(yīng)該是一個(gè)燈泡,我們必須更深入地研究每個(gè)人,以了解在問(wèn)題的背景下這是否有意義(可能—如果您擁有更多的程序,那么您可能會(huì)遇到更嚴(yán)重的問(wèn)題,因此您會(huì)遇到更多可能會(huì)被重新承認(rèn)),并有助于確認(rèn)結(jié)論,我們的模型可能會(huì)在項(xiàng)目的后續(xù)階段找到。
克拉默V相關(guān) (Cramer’s V Correlation)
Cramer’s V is a great statistic to measure the correlation between two variables. In general, we are usually interested in the correlation between a feature and the target variable.
Cramer的V是衡量?jī)蓚€(gè)變量之間相關(guān)性的出色統(tǒng)計(jì)量。 通常,我們通常對(duì)特征和目標(biāo)變量之間的相關(guān)性感興趣。
Sometimes we can discover other interesting and sometimes surprising information from correlation diagrams, take for example one interesting fact discovered by Lilach Goldshtein in the diabetes dataset. Let us look at the Cramer’s V of ‘discharge_disposition_id’ (a categorical feature that indicates the reason a patient was discharged) and ‘readmitted’ (our target variable — whether or not the patient was readmitted).
有時(shí)我們可以從關(guān)聯(lián)圖中發(fā)現(xiàn)其他有趣且有時(shí)令人驚訝的信息,例如,Lilach Goldshtein在糖尿病數(shù)據(jù)集中發(fā)現(xiàn)的一個(gè)有趣事實(shí)。 讓我們看一下Cramer V的“ discharge_disposition_id”(表明患者出院原因的分類(lèi)特征)和“再入院”(我們的目標(biāo)變量-是否再入院)的CramerV。
Cramer’s V of Target Variable and Reason of Discharge — Credit to Lilach Goldshtein克萊默的目標(biāo)變量V和排放原因-歸功于Lilach GoldshteinWe note here that Discharge ID 11, 19, 20, and 21 have no readmitted patients — STRANGE!
我們?cè)谶@里注意到,出院ID 11、19、20和21沒(méi)有再入院的患者-STRANGE!
Let’s check what these IDs are:
讓我們檢查一下這些ID是什么:
Credit to Lilach Goldshtein歸功于Lilach GoldshteinThese people were never readmitted because sadly, they passed away.
這些人從未被遺忘,因?yàn)樗麄儾恍疫^(guò)世了。
This is a very obvious note — and we probably didn’t need to dig into data correlations to identify this fact — but such observations are often completely missed. Now a decision needs to be made regarding what to do with these samples — do we include them in the model or not? Probably not, but again that is up to the data scientist. What is important at the EDA stage is that we find these occurrences.
這是一個(gè)非常明顯的注解-我們可能無(wú)需深入研究數(shù)據(jù)相關(guān)性即可識(shí)別這一事實(shí)-但經(jīng)常會(huì)完全忽略這種觀察。 現(xiàn)在需要決定如何處理這些樣本-我們是否將它們包括在模型中? 可能不是,但這再次取決于數(shù)據(jù)科學(xué)家。 在EDA階段重要的是我們發(fā)現(xiàn)這些情況。
單向分析 (One-Way Analysis)
Honestly, if you do one thing I outline in this entire article — do this. One-way analysis can pick up on many of the different observations I’ve touched on in this article, in one graph.
老實(shí)說(shuō),如果您做一件事,我將在整篇文章中概述-做到這一點(diǎn)。 單向分析可以在一幅圖中獲得我在本文中涉及的許多不同觀察結(jié)果。
One-Way Analysis of Two Different Features兩種不同功能的單向分析The above graphs show us the percentage of the dataset represented by a certain range and the median of the target variable in that range (this is not the diabetes dataset, but rather a dataset to be used for regression). On the left, we can see that most samples fall in the range 73.5–90.5 and that there is no linear correlation between the feature and the target. On the other hand, on the right-hand side we can see that the feature is directly correlated with the target and that in each group there is a good spread of samples.
上圖顯示了由某個(gè)范圍表示的數(shù)據(jù)集的百分比以及該范圍內(nèi)目標(biāo)變量的中位數(shù)(這不是糖尿病數(shù)據(jù)集,而是用于回歸的數(shù)據(jù)集)。 在左側(cè),我們可以看到大多數(shù)樣本都在73.5–90.5范圍內(nèi),并且特征與目標(biāo)之間沒(méi)有線性相關(guān)。 另一方面,在右側(cè),我們可以看到該特征與目標(biāo)直接相關(guān),并且在每個(gè)組中樣本分布良好。
The groups were chosen using a single Decision Tree to split optimally.
使用單個(gè)決策樹(shù)選擇組以進(jìn)行最佳拆分。
This is a great way to analyze the dataset. We can see the distribution of the samples in a specific feature, we can see outliers if there are any (none in these examples) and we can identify missing values (either we encode them first as extreme numerical values as described before, or if it is a categorical feature we will see the label as “NaN” or in the diabetes case “?”).
這是分析數(shù)據(jù)集的好方法。 我們可以看到特定特征中樣本的分布,可以看到是否有異常值(在這些示例中沒(méi)有),并且可以識(shí)別缺失值(可以像之前所述將它們首先編碼為極限數(shù)值,或者是一種分類(lèi)功能,我們會(huì)看到標(biāo)簽為“ NaN”或在糖尿病病例中為“???”。
結(jié)論 (Conclusion)
As you have probably noticed by now — there is no one size fits all for EDA. In this article, I decided not to dive too deep into how to do each part of the analysis (most can be done with simple Pandas or Pandas Profiling methods), but rather explain what can we learn from each step and to help those who want to learn why each step is important.
正如您現(xiàn)在可能已經(jīng)注意到的那樣-EDA沒(méi)有一種適合所有的尺寸。 在本文中,我決定不深入探討如何進(jìn)行分析的每個(gè)部分(大多數(shù)可以通過(guò)簡(jiǎn)單的Pandas或Pandas Profiling方法完成),而是解釋我們可以從每個(gè)步驟中學(xué)到什么,并幫助需要的人了解為什么每個(gè)步驟都很重要。
In real-world datasets there are almost always missing values, errors in the data, unbalanced data, and biased data. EDA is the first step in tackling a data science project to learn what data we have and evaluate its validity.
在實(shí)際數(shù)據(jù)集中,幾乎總是缺少值,數(shù)據(jù)中的錯(cuò)誤,不平衡的數(shù)據(jù)和有偏差的數(shù)據(jù)。 EDA是解決數(shù)據(jù)科學(xué)項(xiàng)目的第一步,以了解我們擁有的數(shù)據(jù)并評(píng)估其有效性。
I would like to thank Lilach Goldshtein for her excellent talk on EDA which inspired this medium article.
我要感謝Lilach Goldshtein在EDA上的精彩演講,這啟發(fā)了這篇中等文章。
請(qǐng)繼續(xù)關(guān)注經(jīng)典數(shù)據(jù)科學(xué)項(xiàng)目中的后續(xù)步驟 (Stay tuned for the next steps in a classic data science project)
Part 1 Exploratory Data Analysis (EDA) — Don’t ask how, ask what.
第1部分探索性數(shù)據(jù)分析(EDA)-不要問(wèn)如何,不要問(wèn)什么。
Part 2 Preparing your Dataset for Modelling — Quickly and Easily
第2部分 -快速,輕松地為建模準(zhǔn)備數(shù)據(jù)集
Part 3 Feature Engineering — 10X your model’s abilities
第3部分特征工程-將模型的能力提高10倍
Part 4 What is a GBM and how do I tune it?
第4部分什么是GBM,我該如何調(diào)整?
Part 5 GBM Explainability — What can I actually use SHAP for?
第5部分 GBM可解釋性-我實(shí)際上可以將SHAP用于什么?
(Hopefully) Part 6 How to actually get a dataset and a sample project
(希望如此)第6部分如何實(shí)際獲取數(shù)據(jù)集和示例項(xiàng)目
翻譯自: https://medium.com/towards-artificial-intelligence/exploratory-data-analysis-eda-dont-ask-how-ask-what-2e29703fb24a
總結(jié)
以上是生活随笔為你收集整理的探索性数据分析(EDA)-不要问如何,不要问什么的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到一只大青蛙是什么意思
- 下一篇: 安卓代码还是xml绘制页面_我们应该绘制