iris数据集 测试集_IRIS数据集的探索性数据分析
iris數(shù)據(jù)集 測(cè)試集
Let’s explore one of the simplest datasets, The IRIS Dataset which basically is a data about three species of a Flower type in form of its sepal length, sepal width, petal length, and petal width. The data set consists of 50 samples from each of the three species of Iris (Iris setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Our objective is to classify a new flower as belonging to one of the 3 classes given the 4 features.
讓我們探索最簡(jiǎn)單的數(shù)據(jù)集之一,IRIS數(shù)據(jù)集,該數(shù)據(jù)集基本上是有關(guān)花類(lèi)型的三種物種的數(shù)據(jù),其形式為萼片長(zhǎng)度,萼片寬度,花瓣長(zhǎng)度和花瓣寬度。 所述數(shù)據(jù)集包括從每三個(gè)物種鳶尾的50個(gè)樣品( 山鳶尾 , 虹膜錦葵 ,和變色鳶尾 )。 從每個(gè)樣品中測(cè)量出四個(gè)特征: 萼片和花瓣的長(zhǎng)度和寬度,以厘米為單位。 我們的目標(biāo)是根據(jù)4個(gè)特征將新花歸為3類(lèi)之一。
Download IRIS data from here.
從此處下載IRIS數(shù)據(jù)。
Here I'm importing the libraries in ipython notebook using Anaconda Navigator(download: https://www.anaconda.com/products/individual). which can be useful in our exploratory data analysis like pandas, matplotlib, numpy and seaborn.
在這里,我使用Anaconda Navigator(下載: https ://www.anaconda.com/products/individual)在ipython Notebook中導(dǎo)入庫(kù)。 這對(duì)我們的探索性數(shù)據(jù)分析(如熊貓 , matplotlib , numpy和seaborn)很有用 。
Exploring the data探索數(shù)據(jù) Exploring the data探索數(shù)據(jù)Here, IRIS is a balanced dataset because the number of data points for every class Setosa, Virginica, and Versicolor is 50. If the classes are having the different numbers of data points each then it’s an imbalanced dataset.
在這里,IRIS是一個(gè)平衡的數(shù)據(jù)集,因?yàn)镾etosa,Virginica和Versicolor每個(gè)類(lèi)的數(shù)據(jù)點(diǎn)數(shù)均為50。如果每個(gè)類(lèi)的數(shù)據(jù)點(diǎn)數(shù)均不同,則它是一個(gè)不平衡的數(shù)據(jù)集。
2D散點(diǎn)圖 (2D Scatter Plot)
By using the pandas object we created before we can plot a simple 2D graph of the features we give as x and y parameters of the plot() method of pandas. Matplotlib method show() helps to actually plot the data.
通過(guò)使用我們創(chuàng)建的pandas對(duì)象,我們可以繪制簡(jiǎn)單的二維圖形來(lái)繪制作為pandas plot()方法的x和y參數(shù)的要素。 Matplotlib方法show()有助于實(shí)際繪制數(shù)據(jù)。
2D Scatter Plot2D散點(diǎn)圖But by Seaborn we can plot a more informative graph by color-coding by each flower type.
但是通過(guò)Seaborn,我們可以通過(guò)每種花的顏色編碼來(lái)繪制更具信息量的圖。
2D Scatter Plot using Seaborn使用Seaborn的2D散點(diǎn)圖Here in the above graph notice that Blue Setosa points can be easily separated from Orange Versicolor and Green Verginica points by simply drawing a line but the Orange and Green points are still complex to be separated because they are overlapping. So by using sepal_length and sepal_width features of the data we can get this much information.
在上圖中,通過(guò)簡(jiǎn)單畫(huà)一條線可以很容易地將Blue Setosa點(diǎn)與Orange Versicolor點(diǎn)和Green Verginica點(diǎn)分離,但是Orange點(diǎn)和Green點(diǎn)由于重疊而仍然很復(fù)雜,難以分離。 因此,通過(guò)使用數(shù)據(jù)的sepal_length和sepal_width功能,我們可以獲得很多信息。
2D散點(diǎn)圖:對(duì)圖 (2D Scatter Plot: Pair Plot)
Pair Plot by Seaborn is capable of drawing multiple 2D Scatter Plots for each possible combination of features in one go.
Seaborn的結(jié)對(duì)圖能夠一次性繪制多個(gè)2D散點(diǎn)圖,以用于每種可能的特征組合。
Pair Plot by SeabornSeaborn的配對(duì)圖 Pair Plots對(duì)圖So here if we observe the pair plots then we can say petal_length and petal_width are the most essential features to identify various flower types. While Setosa can be easily linearly separable, Virnica and Versicolor have some overlap. So we can separate them by a line and some “if-else” conditions.
因此,在這里,如果我們觀察對(duì)圖,那么我們可以說(shuō)花瓣長(zhǎng)度和花瓣寬度是識(shí)別各種花朵類(lèi)型的最基本特征。 雖然Setosa可以很容易地線性分離,但Virnica和Versicolor有一些重疊。 因此,我們可以通過(guò)一行和一些“ if-else”條件將它們分開(kāi)。
一維散點(diǎn)圖,直方圖,PDF和CDF (1D Scatter Plot, Histogram, PDF & CDF)
1D Scatter Plot of Petal-Length花瓣長(zhǎng)度的一維散點(diǎn)圖As we can observe the graph, it's very hard to make sense as points are overlapping a lot. There are better ways to visualize the scatter plots. By Seaborn, we can plot a Probability Distribution Function cum Histogram.
正如我們可以觀察到的圖形一樣,由于點(diǎn)重疊很多,很難理解。 有更好的方法可視化散點(diǎn)圖。 通過(guò)Seaborn,我們可以繪制概率分布函數(shù)和直方圖 。
Histogram : Histogram is the plot representing the frequency counts of each data window of the feature for which the plot is drawn (Bar shapes in the graph).
直方圖 :直方圖是表示繪制該圖的要素的每個(gè)數(shù)據(jù)窗口的頻率計(jì)數(shù)的圖(圖中的條形)。
PDF : Probability Density Function is basically a smoothed histogram. Every point on the PDF represents the probability for that particular value in the data (bell shaped curve in the graph). PDF gets formatted using Kernel Density Estimation. For each value of the point on x-axis, y-axis value represents its probabily of occuring in the dataset. More the y value more of that value exists in the dataset.
PDF : 概率密度函數(shù)基本上是平滑的直方圖。 PDF上的每個(gè)點(diǎn)都代表數(shù)據(jù)中該特定值(圖中的鐘形曲線)的概率。 使用內(nèi)核密度估計(jì)來(lái)格式化PDF。 對(duì)于x軸上每個(gè)點(diǎn)的值,y軸值表示其在數(shù)據(jù)集中出現(xiàn)的概率。 y值越大,數(shù)據(jù)集中存在的值越多。
PDF & Histogram of petal_length花瓣長(zhǎng)度的PDF和直方圖 PDF & Histogram of petal_length花瓣長(zhǎng)度的PDF和直方圖 PDF &Histogram of petal_width花瓣寬度的PDF和直方圖 PDF &Histogram of petal_width花瓣寬度的PDF和直方圖 PDF &Histogram of sepal_lengthPDF和Sepal_length的直方圖 PDF &Histogram of sepal_lengthPDF和Sepal_length的直方圖 PDF &Histogram of sepal_widthPDF格式的sepal_width PDF &Histogram of sepal_widthPDF格式的sepal_widthNow from these graphs, we can observe that by using just one feature a simple model can be formed by if..else condition as if(petal_length) < 2.5 then flower type is Setosa.
現(xiàn)在從這些圖形中,我們可以觀察到,僅使用一個(gè)功能,就可以通過(guò)if..else條件( if(petal_length)<2.5)形成簡(jiǎn)單模型, 然后花朵類(lèi)型為Setosa 。
Now, what if we need the percentage of Versicolor points having a petal_length of less than 5 ? here comes CDF in our rescue!
現(xiàn)在,如果我們需要花瓣長(zhǎng)度小于5的Versicolor點(diǎn)的百分比呢? CDF來(lái)了!
CDF: Cumulative Density Function is the cumulative sum of the PDF. Every point on the CDF curve represents integration of the PDF till that point of CDF. Below is the histogram of the Yield. Every point on the CDF represents how much percentage of the total points belong to below that point.
CDF:累積密度函數(shù)是PDF的累積和。 CDF曲線上的每個(gè)點(diǎn)都代表PDF到CDF為止的積分。 以下是收益的直方圖。 CDF上的每個(gè)點(diǎn)代表該點(diǎn)以下的總點(diǎn)數(shù)百分比。
To construct a histogram, the first step is to “bin” the range of values — that is, divide the entire range of values into a series of intervals — and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent and are often (but are not required to be) of equal size(for more information: https://www.datacamp.com/community/tutorials/histograms-matplotlib).
要構(gòu)建直方圖,第一步是將值的范圍“ bin”(即,將值的整個(gè)范圍劃分為一系列間隔),然后計(jì)算每個(gè)間隔中有多少值。 通常將bin指定為變量的連續(xù),不重疊的間隔。 垃圾箱(間隔)必須相鄰,并且經(jīng)常(但不是必須)大小相等(有關(guān)更多信息,請(qǐng)?jiān)L問(wèn): https : //www.datacamp.com/community/tutorials/histograms-matplotlib )。
Now by plotting of CDF of petal_length for various types of flowers in a combined manner we can get an overall picture of the data.
現(xiàn)在,通過(guò)組合繪制各種類(lèi)型花朵的petlet_length的CDF,可以得到數(shù)據(jù)的整體圖。
Mean, Variance and Standard Deviation
均值,方差和標(biāo)準(zhǔn)差
Mean: https://en.wikipedia.org/wiki/Mean
意思是: https : //en.wikipedia.org/wiki/Mean
Variance: https://en.wikipedia.org/wiki/Variance
差異: https : //en.wikipedia.org/wiki/Variance
Standard Deviation: https://en.wikipedia.org/wiki/Standard_deviation
標(biāo)準(zhǔn)偏差: https : //en.wikipedia.org/wiki/Standard_deviation
Median, Percentile, Quantile, MAD, IQR
中位數(shù),百分位數(shù),分位數(shù),MAD,IQR
Median: https://en.wikipedia.org/wiki/Median
中位數(shù): https : //en.wikipedia.org/wiki/Median
Percentile: https://en.wikipedia.org/wiki/Percentile
百分位數(shù): https : //en.wikipedia.org/wiki/Percentile
Quantile: https://en.wikipedia.org/wiki/Quantile
分位數(shù): https : //en.wikipedia.org/wiki/Quantile
MAD: Median Absolute Deviation: https://en.wikipedia.org/wiki/Median_absolute_deviation
MAD:中位數(shù)絕對(duì)偏差: https : //en.wikipedia.org/wiki/Median_absolute_deviation
IQR: Interquantile Range: https://en.wikipedia.org/wiki/Interquartile_range
IQR:分位數(shù)范圍: https ://en.wikipedia.org/wiki/Interquartile_range
箱形圖 (Box Plots)
Box plots with whiskers is another method for visualizing the 1D Scatter Plot more intuitively. The boxes in the graph represent Interquantile Range as the first horizontal line from the bottom of the box represents 25th percentile value, the middle line represents the 50th percentile and the top line represents the 75th percentile. The black lines outside of the boxes are called whiskers. It’s not fixed what whiskers represent but it might be the minimum value of the feature at below horizontal line and maximum value at the top horizontal line in some cases.
帶晶須的箱形圖是另一種更直觀地可視化1D散布圖的方法。 圖中的框代表分位數(shù)范圍,因?yàn)閺目虻撞块_(kāi)始的第一條水平線代表第25個(gè)百分位數(shù),中線代表第50個(gè)百分位數(shù),頂線代表第75個(gè)百分位數(shù)。 盒子外面的黑線稱(chēng)為晶須。 晶須代表什么并不確定,但在某些情況下可能是特征在水平線以下的最小值和在水平線頂部的最大值。
小提琴圖 (Violin Plots)
Violin plot by Seaborn combine PDF and Box-Plot. As in the below plot, on all three colors, PDFs of petal_length are on the sides of the shape, and in the center in black, there is a representation of Box-Plots.
Seaborn的小提琴圖結(jié)合了PDF和Box-Plot。 如下圖所示,在所有三種顏色上,petlet_length的PDF都位于形狀的側(cè)面,而黑色的中心則是Box-Plots的表示形式。
多元概率密度:輪廓圖 (Multivariate Probability Density: Contour Plot)
Seaborn provides jointplot() method for contours. The name is “jointplot” because it represents Contours as well as PDFs on the edges. More the darker the region the more the probability of occurring that value of features for which the graph is plotted.
Seaborn提供了用于輪廓的jointplot()方法。 名稱(chēng)為“ jointplot”,因?yàn)樗硎?strong>輪廓以及邊緣的PDF 。 區(qū)域越黑,繪制該圖的要素的值出現(xiàn)的可能性就越大。
翻譯自: https://medium.com/swlh/exploratory-data-analysis-of-iris-dataset-2ab58e1a5dc6
iris數(shù)據(jù)集 測(cè)試集
總結(jié)
以上是生活随笔為你收集整理的iris数据集 测试集_IRIS数据集的探索性数据分析的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到鸡蛋破碎什么意思
- 下一篇: 梦到捡到王八有什么预兆吗