认识数据分析_认识您的最佳探索数据分析新朋友
認(rèn)識(shí)數(shù)據(jù)分析
Visualization often plays a minimal role in the data science and model-building process, yet Tukey, the creator of Exploratory Data Analysis, specifically advocated for the heavy use of visualization to address the limitations of numerical indicators.
可視化通常在數(shù)據(jù)科學(xué)和模型構(gòu)建過程中起著最小的作用,但是“探索性數(shù)據(jù)分析”的創(chuàng)建者Tukey特別提倡大量使用可視化來(lái)解決數(shù)字指標(biāo)的局限性。
Everyone’s heard — and understands — a picture equals a thousand words, and following this logic, a visualization of the data is worth at least as much as dozens of statistical metrics, from quartiles to means to standard deviations to mean absolute errors to kurtosis to entropy. Wherever there is an abundance of data, it is best understood when it is visualized.
每個(gè)人都能聽到并理解,一幅圖片等于一千個(gè)單詞,按照這種邏輯,數(shù)據(jù)的可視化至少值幾十種統(tǒng)計(jì)指標(biāo),從四分位數(shù)到均值到標(biāo)準(zhǔn)差再到絕對(duì)誤差,峰度到熵。 無(wú)論何時(shí)何地都有大量數(shù)據(jù),最好以可視化方式理解。
Exploratory Data Analysis was created to investigate the data, emphasizing visualization because it was more informative. This short article will present one of the most useful tools in visual EDA and how to interpret it.
創(chuàng)建了探索性數(shù)據(jù)分析來(lái)研究數(shù)據(jù),并強(qiáng)調(diào)可視化,因?yàn)樗咝畔⑿浴?這篇簡(jiǎn)短的文章將介紹可視化EDA中最有用的工具之一,以及如何解釋它。
Seaborn’s pairplot is magical: at its most simple, it gives us a rich and informational visual representation of univariate and bivariate relationships within the data. For instance, consider two pairplots below, created with one line of code, sns.pairplot(data) (the second adding hue=’species’ as a parameter).
Seaborn的pairplot是不可思議的:最簡(jiǎn)單的說,它為我們提供了數(shù)據(jù)中單變量和雙變量關(guān)系的豐富且信息化的視覺表示。 例如,考慮下面的兩個(gè)pairplot ,它們由一行代碼sns.pairplot(data) (第二個(gè)將hue='species'作為參數(shù)添加)。
There’s so much information to be gleaned about the data, be it the success of classification (how much entropy/overlap is there between classes), potential results of a feature selection process, variance, and what the best choice of model may be, based on these observed attributes. The pairplot is like an unfolding of multidimensional space.
有關(guān)數(shù)據(jù)的信息太多,包括分類是否成功(類別之間存在多少熵/重疊),特征選擇過程的潛在結(jié)果,方差以及最佳模型選擇,這些觀察到的屬性。 對(duì)圖就像多維空間的展開。
Usually, people stop at the one-liner pairplot, but with a few more lines or even words of code, we can reap even more information and insights.
通常,人們會(huì)停留在單線對(duì)圖上,但是只要再增加幾行甚至是代碼的話,我們就可以獲取更多的信息和見解。
For one, pairplots can get notoriously large. To select a subset of the variables to be displayed, use the vars parameter, which can be set to a list of variable names. For instance, sns.pairplot(data,vars=[‘a(chǎn)’,’b’]) would only give the relationships between the two columns ‘a(chǎn)’ and ‘b’, being aa, ab, ba, and bb. Alternatively, one can specify x_vars and y_vars (each lists) to be the variables for each of those axes.
首先,成對(duì)的圖可以變得很大。 要選擇要顯示的變量的子集,請(qǐng)使用vars參數(shù),可以將其設(shè)置為變量名列表。 例如, sns.pairplot(data,vars=['a','b'])僅給出兩列'a'和'b'之間的關(guān)系,即aa , ab , ba和bb 。 或者,可以將x_vars和y_vars (每個(gè)列表)指定為每個(gè)軸的變量。
The result of setting the first two plots (setting the vars parameter) is a symmetrical grid of plots:
設(shè)置前兩個(gè)圖(設(shè)置vars參數(shù))的結(jié)果是一個(gè)對(duì)稱的圖網(wǎng)格:
The third plot sets the y-component to only one variable — ‘sepal_length’ — and the x-component to all the columns of the data. This returns the interactions between that one column and all other columns. Note that for the first column — when it is paired against itself — and the fifth column — where it is paired against a categorical variable, the scatterplot is not an appropriate plot. We’ll explore how to deal with this later.
第三'sepal_length'圖將y分量設(shè)置為僅一個(gè)變量'sepal_length' ,并將x分量設(shè)置為數(shù)據(jù)的所有列。 這將返回該一列與所有其他列之間的交互。 請(qǐng)注意,對(duì)于第一列(與它自身配對(duì))和第五列(與類別變量配對(duì)),散點(diǎn)圖不是合適的圖。 稍后我們將探討如何處理。
By adding a kind=’reg’ keyword into your pairplot, you can get linear regression fits for the data. This is a great gage as to the linearity and variance of your data, which can lead to decisions about which types of models, both supervised and unsupervised, to choose. Additionally, since pairplots are symmetrical, to a) declutter the plot and b) reduce long loading times, setting corner=True removes the upper-right half, which is a duplicate.
通過在您的對(duì)圖中添加kind='reg'關(guān)鍵字,您可以獲得數(shù)據(jù)的線性回歸擬合。 對(duì)于數(shù)據(jù)的線性和方差,這是一個(gè)很好的衡量標(biāo)準(zhǔn),它可以決定要選擇哪種類型的模型,包括監(jiān)督模型和非監(jiān)督模型。 此外,由于成對(duì)圖是對(duì)稱的,因此要a)整理曲線圖和b)減少較長(zhǎng)的加載時(shí)間,設(shè)置corner=True將刪除右上半部分,這是重復(fù)項(xiàng)。
Regression plot — left, corner plot — right回歸圖-左圖,角圖-右圖The pairplot alone, however, is relatively limited in its ability to easily and intuitively display several relationships between variables. It is merely an interface to access the pairgrid, which is the real generator behind the ‘pairplot’. Properly handling visualization through pairgrid can yield valuable results.
然而, pairplot在其容易且直觀地顯示變量之間的幾種關(guān)系的能力方面相對(duì)有限。 它僅僅是訪問pairgrid的接口, pairgrid是“ pairplot ”背后的真正生成器。 通過pairgrid正確處理可視化pairgrid會(huì)產(chǎn)生有價(jià)值的結(jié)果。
Grids in seaborn are initialized to a variable, most commonly g (for grid).For instance, we may write g=sns.PairGrid(data). When grids are initialized, they are completely empty, but they will be filled in with visualizations soon. The grid is a method to access and visualize cross-feature aspects of the data in an efficient and clean way.
seaborn中的網(wǎng)格被初始化為一個(gè)變量,最常見的是g (對(duì)于網(wǎng)格)。例如,我們可以寫g=sns.PairGrid(data) 。 初始化網(wǎng)格后,它們將完全為空,但是很快將被可視化填充。 網(wǎng)格是一種以有效且干凈的方式訪問和可視化數(shù)據(jù)的跨功能方面的方法。
We can use map methods to fill in the grid with data. For instance, calling g.map(sns.scatterplot) fills the grid with scatterplots. We can also pass in the model’s parameters: in g.map(sns.kdeplot,shade=True), shade is a parameter of sns.kdeplot but it can be specified in the mapping. Since this is a grid, all the data is sorted out; we only need to call the type of plot.
我們可以使用地圖方法用數(shù)據(jù)填充網(wǎng)格。 例如,調(diào)用g.map(sns.scatterplot)用散點(diǎn)圖填充網(wǎng)格。 我們還可以傳入模型的參數(shù):在g.map(sns.kdeplot,shade=True) ,shade是sns.kdeplot的參數(shù),但可以在映射中指定。 由于這是一個(gè)網(wǎng)格,因此將所有數(shù)據(jù)整理出來(lái); 我們只需要調(diào)用情節(jié)類型即可。
Note that the diagonals are still scatterplots. We can change this by using g.map_offdiag(sns.scatterplot) for plots not on the diagonal and g.map_diag(plt.hist) for plots on the diagonal. Note that we are able to use plotting objects from other libraries.
請(qǐng)注意,對(duì)角線仍然是散點(diǎn)圖。 我們可以通過改變這個(gè)g.map_offdiag(sns.scatterplot)未對(duì)角和情節(jié)g.map_diag(plt.hist)的對(duì)角線上的地塊。 注意,我們能夠使用其他庫(kù)中的繪圖對(duì)象。
We can do one better. Since the top and bottom halves are identical, we can change the plot type between the top and bottom halves using g.map_upper and g.map_lower. In this example, we compare the fits of quadratic and linear regression on the same data by varying the order parameter in seaborn’s regression plot, regplot.
我們可以做得更好。 由于上半部分和下半部分相同,因此我們可以使用g.map_upper和g.map_lower在上半部分和下半部分之間更改繪圖類型。 在此示例中,我們通過更改seaborn回歸圖regplot中的order參數(shù),比較了二次回歸和線性回歸在同一數(shù)據(jù)上的擬合regplot 。
To specify a hue, we can add the hue=’species’ parameter into the initialization of the PairGrid. Note that we cannot do something like g.map(sns.scatterplot, hue=’species’) because mapping is simply a visualization of the data, not a reprocessing of it. All the data is processed in the initialization of the grid, so all things data-related must be processed then.
要指定色調(diào),我們可以將hue='species'參數(shù)添加到PairGrid的初始化中。 請(qǐng)注意,我們無(wú)法執(zhí)行g(shù).map(sns.scatterplot, hue='species')因?yàn)橛成渲皇菙?shù)據(jù)的可視化,而不是數(shù)據(jù)的重新處理。 所有數(shù)據(jù)都在網(wǎng)格的初始化中處理,因此所有與數(shù)據(jù)相關(guān)的事物都必須進(jìn)行處理。
Pairgrids are often used to build complex plots, but for the purposes of EDA, the operations covered should be enough.
Pairgrids通常用于構(gòu)建復(fù)雜的地塊,但就EDA而言,所涉及的操作應(yīng)足夠。
With a few more lines of code, you’ve been able to maximize the information gained from the pairplot and pairgrids. Here are some tips to take away as much insight as you can from it.
再多幾行代碼,您就可以最大化從pairplot和pairgrids獲得的信息。 這里有一些技巧,您可以從中獲得盡可能多的見識(shí)。
- Look for curvatures and transformations (e.g. Tukey’s ladder of powers) that can be used to improve model performance. 尋找可用于改善模型性能的曲率和變換(例如Tukey的冪階)。
Approach features by how well they work in their entire row or column. For example, petal_width and petal_length perform well in separating classes along their designated axis very well across all other features. The same cannot be said for sepal_width, where there is much overlap along their axis. This means that it provides less information, can may be good cause for us to run a feature importance and remove it if it provides a negligible boost in predictive power.
通過功能在整個(gè)行或整個(gè)列中的性能來(lái)評(píng)估功能。 例如,在所有其他petal_width ,沿著它們的指定軸分隔類時(shí), petal_width和petal_length性能很好。 sepal_width不能說sepal_width ,因?yàn)樗鼈兊妮S上有很多重疊。 這意味著它提供的信息較少,如果它對(duì)預(yù)測(cè)能力的提升可忽略不計(jì),則可能是促使我們發(fā)揮功能重要性并予以刪除的良好原因。
- Find how much data points vary from a regression fit (you can try different degrees as well) to get a visual understanding of how stable/stationary the data is. If data points vary widely from the fit and/or a fit must have a high degree to fit the data well, using methods like standardization or normalization may be helpful. 查找與回歸擬合有多少不同的數(shù)據(jù)點(diǎn)(您也可以嘗試不同的程度),以直觀了解數(shù)據(jù)的穩(wěn)定性/平穩(wěn)性。 如果數(shù)據(jù)點(diǎn)與擬合值相差很大,并且/或者擬合度必須高度匹配才能很好地?cái)M合數(shù)據(jù),則使用標(biāo)準(zhǔn)化或歸一化等方法可能會(huì)有所幫助。
- Spend a decent amount of time looking at visual bivariate representations of your data, playing around with comparisons and chart types. There are countless operations you can do to your data, and the purpose of EDA is not to give you answers but to spike your interest in taking a particular action. Data is different every time; no standard procedure fits all sizes. 花大量的時(shí)間查看數(shù)據(jù)的可視雙變量表示形式,進(jìn)行比較和圖表類型。 您可以對(duì)數(shù)據(jù)執(zhí)行無(wú)數(shù)操作,而EDA的目的不是給您答案,而是激發(fā)您對(duì)采取特定行動(dòng)的興趣。 每次數(shù)據(jù)都不一樣; 沒有適合所有尺寸的標(biāo)準(zhǔn)程序。
翻譯自: https://towardsdatascience.com/meet-your-new-best-exploratory-data-analysis-friend-772a60864227
認(rèn)識(shí)數(shù)據(jù)分析
總結(jié)
以上是生活随笔為你收集整理的认识数据分析_认识您的最佳探索数据分析新朋友的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到断头蛇预示着什么
- 下一篇: 天池幸福感的数据处理_了解幸福感与数据(