當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

探索性数据分析（EDA）：Python

發(fā)布時間：2023/11/29 python 31 豆豆

生活随笔收集整理的這篇文章主要介紹了探索性数据分析（EDA）：Python 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

什么是探索性數(shù)據(jù)分析(EDA)？ (What is Exploratory Data Analysis(EDA)?)

If we want to explain EDA in simple terms, it means trying to understand the given data much better, so that we can make some sense out of it.

如果我們想用簡單的術(shù)語來解釋EDA，則意味著試圖更好地理解給定的數(shù)據(jù)，以便我們可以從中獲得一些意義。

We can find a more formal definition in Wikipedia.

我們可以在Wikipedia中找到更正式的定義。

In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

在統(tǒng)計學(xué)中， 探索性數(shù)據(jù)分析是一種分析數(shù)據(jù)集以總結(jié)其主要特征的方法，通常使用視覺方法。可以使用統(tǒng)計模型，也可以不使用統(tǒng)計模型，但是EDA主要用于查看數(shù)據(jù)可以在形式建模或假設(shè)檢驗任務(wù)之外告訴我們的內(nèi)容。

EDA in Python uses data visualization to draw meaningful patterns and insights. It also involves the preparation of data sets for analysis by removing irregularities in the data.

Python中的EDA使用數(shù)據(jù)可視化來繪制有意義的模式和見解。它還涉及通過消除數(shù)據(jù)中的不規(guī)則性來準(zhǔn)備用于分析的數(shù)據(jù)集。

Based on the results of EDA, companies also make business decisions, which can have repercussions later.

根據(jù)EDA的結(jié)果，公司還制定業(yè)務(wù)決策，這可能會在以后產(chǎn)生影響。

If EDA is not done properly then it can hamper the further steps in the machine learning model building process.
如果EDA處理不當(dāng)，則可能會妨礙機(jī)器學(xué)習(xí)模型構(gòu)建過程中的其他步驟。
If done well, it may improve the efficacy of everything we do next.
如果做得好，它可能會提高我們接下來做的每件事的功效。

In this article we’ll see about the following topics:

在本文中，我們將看到以下主題：

Data Sourcing

資料采購

Data Cleaning

數(shù)據(jù)清理

Univariate analysis

單變量分析

Bivariate analysis

雙變量分析

Multivariate analysis

多元分析

1.數(shù)據(jù)采購 (1. Data Sourcing)

Data Sourcing is the process of finding and loading the data into our system. Broadly there are two ways in which we can find data.

數(shù)據(jù)采購是查找數(shù)據(jù)并將其加載到我們的系統(tǒng)中的過程。大致而言，我們可以通過兩種方式查找數(shù)據(jù)。

Private Data

私人數(shù)據(jù)

Public Data

公開資料

Private Data

私人數(shù)據(jù)

As the name suggests, private data is given by private organizations. There are some security and privacy concerns attached to it. This type of data is used for mainly organizations internal analysis.

顧名思義，私有數(shù)據(jù)是由私有組織提供的。附加了一些安全和隱私問題。這類數(shù)據(jù)主要用于組織內(nèi)部分析。

Public Data

公開資料

This type of Data is available to everyone. We can find this in government websites and public organizations etc. Anyone can access this data, we do not need any special permissions or approval.

此類數(shù)據(jù)適用于所有人。我們可以在政府網(wǎng)站和公共組織等中找到此文件。任何人都可以訪問此數(shù)據(jù)，我們不需要任何特殊權(quán)限或批準(zhǔn)。

We can get public data on the following sites.

我們可以在以下站點上獲取公共數(shù)據(jù)。

https://data.gov
https://data.gov
https://data.gov.uk
https://data.gov.uk
https://data.gov.in
https://data.gov.in
https://www.kaggle.com/
https://www.kaggle.com/
https://archive.ics.uci.edu/ml/index.php
https://archive.ics.uci.edu/ml/index.php
https://github.com/awesomedata/awesome-public-datasets
https://github.com/awesomedata/awesome-public-datasets

The very first step of EDA is Data Sourcing, we have seen how we can access data and load into our system. Now, the next step is how to clean the data.

EDA的第一步是數(shù)據(jù)采購，我們已經(jīng)看到了如何訪問數(shù)據(jù)并將其加載到系統(tǒng)中。現(xiàn)在，下一步是如何清除數(shù)據(jù)。

2.數(shù)據(jù)清理 (2. Data Cleaning)

After completing the Data Sourcing, the next step in the process of EDA is Data Cleaning. It is very important to get rid of the irregularities and clean the data after sourcing it into our system.

完成數(shù)據(jù)采購后，EDA過程的下一步是數(shù)據(jù)清理 。將數(shù)據(jù)來源到我們的系統(tǒng)中之后，消除不規(guī)則并清理數(shù)據(jù)非常重要。

Irregularities are of different types of data.

不規(guī)則性是不同類型的數(shù)據(jù)。

Missing Values
缺失值
Incorrect Format
格式錯誤
Incorrect Headers
標(biāo)頭不正確
Anomalies/Outliers
異常/異常值

To perform the data cleaning we are using a sample data set, which can be found here.

為了執(zhí)行數(shù)據(jù)清理，我們使用樣本數(shù)據(jù)集，可在此處找到。

We are using Jupyter Notebook for analysis.

我們正在使用Jupyter Notebook進(jìn)行分析。

First, let’s import the necessary libraries and store the data in our system for analysis.

首先，讓我們導(dǎo)入必要的庫并將數(shù)據(jù)存儲在系統(tǒng)中進(jìn)行分析。

#import the useful libraries. import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline# Read the data set of "Marketing Analysis" in data. data= pd.read_csv("marketing_analysis.csv")# Printing the data data

Now, the data set looks like this,

現(xiàn)在，數(shù)據(jù)集看起來像這樣，

Marketing Analysis Dataset市場分析數(shù)據(jù)集

If we observe the above dataset, there are some discrepancies in the Column header for the first 2 rows. The correct data is from the index number 1. So, we have to fix the first two rows.

如果我們觀察上述數(shù)據(jù)集，則前兩行的“列”標(biāo)題中會有一些差異。正確的數(shù)據(jù)來自索引號1。因此，我們必須修復(fù)前兩行。

This is called Fixing the Rows and Columns. Let’s ignore the first two rows and load the data again.

這稱為修復(fù)行和列。 讓我們忽略前兩行，然后再次加載數(shù)據(jù)。

#import the useful libraries. import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline# Read the file in data without first two rows as it is of no use. data = pd.read_csv("marketing_analysis.csv",skiprows = 2)#print the head of the data frame. data.head()

Now, the dataset looks like this, and it makes more sense.

現(xiàn)在，數(shù)據(jù)集看起來像這樣，并且更有意義。

Dataset after fixing the rows and columns修復(fù)行和列后的數(shù)據(jù)集

Following are the steps to be taken while Fixing Rows and Columns:

以下是修復(fù)行和列時要采取的步驟：

Delete Summary Rows and Columns in the Dataset.

刪除數(shù)據(jù)集中的摘要行和列。

Delete Header and Footer Rows on every page.

在每個頁面上刪除頁眉和頁腳行。

Delete Extra Rows like blank rows, page numbers, etc.

刪除多余的行，例如空白行，頁碼等。

We can merge different columns if it makes for better understanding of the data

如果可以更好地理解數(shù)據(jù)，我們可以合并不同的列

Similarly, we can also split one column into multiple columns based on our requirements or understanding.

同樣，我們也可以根據(jù)我們的要求或理解將一列分為多列。

Add Column names, it is very important to have column names to the dataset.

添加列名稱，將列名稱添加到數(shù)據(jù)集非常重要。

Now if we observe the above dataset, the customerid column has of no importance to our analysis, and also the jobedu column has both the information of job and education in it.

現(xiàn)在，如果我們觀察以上數(shù)據(jù)集， customerid列對我們的分析就不重要了，而jobedu列中也包含job和education的信息。

So, what we’ll do is, we’ll drop the customerid column and we’ll split the jobedu column into two other columns job and education and after that, we’ll drop the jobedu column as well.

因此，我們要做的是，刪除customerid列，并將jobedu列拆分為job和education另外兩個列，然后，我們也刪除jobedu列。

# Drop the customer id as it is of no use. data.drop('customerid', axis = 1, inplace = True)#Extract job & Education in newly from "jobedu" column. data['job']= data["jobedu"].apply(lambda x: x.split(",")[0]) data['education']= data["jobedu"].apply(lambda x: x.split(",")[1])# Drop the "jobedu" column from the dataframe. data.drop('jobedu', axis = 1, inplace = True)# Printing the Dataset data

Now, the dataset looks like this,

現(xiàn)在，數(shù)據(jù)集看起來像這樣，

Customerid and jobedu columns and adding job and education columnsCustomerid和Jobedu列，并添加Job和Education列

Missing Values

缺失值

If there are missing values in the Dataset before doing any statistical analysis, we need to handle those missing values.

如果在進(jìn)行任何統(tǒng)計分析之前數(shù)據(jù)集中有缺失值，我們需要處理這些缺失值。

There are mainly three types of missing values.

缺失值主要有三種類型。

MCAR(Missing completely at random): These values do not depend on any other features.

MCAR(完全隨機(jī)消失)：這些值不依賴于任何其他功能。

MAR(Missing at random): These values may be dependent on some other features.

MAR(隨機(jī)丟失)：這些值可能取決于某些其他功能。

MNAR(Missing not at random): These missing values have some reason for why they are missing.

MNAR(不隨機(jī)遺漏)：這些遺漏值有其遺失的某些原因。

Let’s see which columns have missing values in the dataset.

讓我們看看數(shù)據(jù)集中哪些列缺少值。

# Checking the missing values
data.isnull().sum()

The output will be,

輸出將是

Null Values in Data Set數(shù)據(jù)集中的空值

As we can see three columns contain missing values. Let’s see how to handle the missing values. We can handle missing values by dropping the missing records or by imputing the values.

如我們所見，三列包含缺失值。讓我們看看如何處理缺失的值。我們可以通過刪除丟失的記錄或估算值來處理丟失的值。

Drop the missing Values

刪除缺失的值

Let’s handle missing values in the age column.

讓我們處理age列中的缺失值。

# Dropping the records with age missing in data dataframe. data = data[~data.age.isnull()].copy()# Checking the missing values in the dataset. data.isnull().sum()

Let’s check the missing values in the dataset now.

讓我們現(xiàn)在檢查數(shù)據(jù)集中的缺失值。

Missing Values after handling age column處理年齡列后缺少值

Let’s impute values to the missing values for the month column.

讓我們將值估算為月份列的缺失值。

Since the month column is of an object type, let’s calculate the mode of that column and impute those values to the missing values.

由于month列是對象類型，讓我們計算該列的模式并將這些值歸為缺失值。

# Find the mode of month in data month_mode = data.month.mode()[0]# Fill the missing values with mode value of month in data. data.month.fillna(month_mode, inplace = True)# Let's see the null values in the month column. data.month.isnull().sum()

Now output is,

現(xiàn)在的輸出是

# Mode of month is'may, 2017'# Null values in month column after imputing with mode0

Handling the missing values in the Response column. Since, our target column is Response Column, if we impute the values to this column it’ll affect our analysis. So, it is better to drop the missing values from Response Column.

處理“ 響應(yīng)”列中的缺失值。由于我們的目標(biāo)列是“響應(yīng)列”，因此，如果將值估算給該列，則會影響我們的分析。因此，最好從“響應(yīng)列”中刪除丟失的值。

#drop the records with response missing in data.data = data[~data.response.isnull()].copy()# Calculate the missing values in each column of data framedata.isnull().sum()

Let’s check whether the missing values in the dataset have been handled or not,

讓我們檢查一下數(shù)據(jù)集中的缺失值是否已得到處理，

All the missing values have been handled所有遺漏的值均已處理

We can also, fill the missing values as ‘NaN’ so that while doing any statistical analysis, it won’t affect the outcome.

我們還可以將缺失值填充為“ NaN”，以便在進(jìn)行任何統(tǒng)計分析時都不會影響結(jié)果。

Handling Outliers

處理異常值

We have seen how to fix missing values, now let’s see how to handle outliers in the dataset.

我們已經(jīng)看到了如何糾正缺失值，現(xiàn)在讓我們看看如何處理數(shù)據(jù)集中的異常值。

Outliers are the values that are far beyond the next nearest data points.

離群值是遠(yuǎn)遠(yuǎn)超出下一個最近數(shù)據(jù)點的值。

There are two types of outliers:

有兩種異常值：

Univariate outliers: Univariate outliers are the data points whose values lie beyond the range of expected values based on one variable.

單變量離群值：單變量離群值是數(shù)據(jù)點，其值超出基于一個變量的期望值的范圍。

Multivariate outliers: While plotting data, some values of one variable may not lie beyond the expected range, but when you plot the data with some other variable, these values may lie far from the expected value.

多變量離群值：在繪制數(shù)據(jù)時，一個變量的某些值可能不會超出預(yù)期范圍，但是在使用其他變量繪制數(shù)據(jù)時，這些值可能與預(yù)期值相差很遠(yuǎn)。

So, after understanding the causes of these outliers, we can handle them by dropping those records or imputing with the values or leaving them as is, if it makes more sense.

因此，在理解了這些異常值的原因之后，我們可以通過刪除這些記錄或使用值進(jìn)行插值或按原樣保留它們(如果更有意義)來處理它們。

Standardizing Values

標(biāo)準(zhǔn)化價值

To perform data analysis on a set of values, we have to make sure the values in the same column should be on the same scale. For example, if the data contains the values of the top speed of different companies’ cars, then the whole column should be either in meters/sec scale or miles/sec scale.

要對一組值進(jìn)行數(shù)據(jù)分析，我們必須確保同一列中的值應(yīng)在同一范圍內(nèi)。例如，如果數(shù)據(jù)包含不同公司汽車最高速度的值，則整列應(yīng)以米/秒為單位或英里/秒為單位。

Now, that we are clear on how to source and clean the data, let’s see how we can analyze the data.

現(xiàn)在，我們很清楚如何獲取和清除數(shù)據(jù)，讓我們看看如何分析數(shù)據(jù)。

3.單變量分析 (3. Univariate Analysis)

If we analyze data over a single variable/column from a dataset, it is known as Univariate Analysis.

如果我們通過數(shù)據(jù)集中的單個變量/列來分析數(shù)據(jù)，則稱為單變量分析。

Categorical Unordered Univariate Analysis:

分類無序單變量分析：

An unordered variable is a categorical variable that has no defined order. If we take our data as an example, the job column in the dataset is divided into many sub-categories like technician, blue-collar, services, management, etc. There is no weight or measure given to any value in the ‘job’ column.

無序變量是沒有定義順序的分類變量。以我們的數(shù)據(jù)為例，數(shù)據(jù)集中的job列分為許多子類別，例如技術(shù)員，藍(lán)領(lǐng)，服務(wù)，管理等。“ job ”中的任何值都沒有權(quán)重或度量柱。

Now, let’s analyze the job category by using plots. Since Job is a category, we will plot the bar plot.

現(xiàn)在，讓我們使用繪圖來分析工作類別。由于Job是一個類別，因此我們將繪制條形圖。

# Let's calculate the percentage of each job status category. data.job.value_counts(normalize=True)#plot the bar graph of percentage job categories data.job.value_counts(normalize=True).plot.barh() plt.show()

The output looks like this,

輸出看起來像這樣，

By the above bar plot, we can infer that the data set contains more number of blue-collar workers compared to other categories.

通過上面的條形圖，我們可以推斷出與其他類別相比，數(shù)據(jù)集包含更多的藍(lán)領(lǐng)工人。

Categorical Ordered Univariate Analysis:

分類有序單變量分析：

Ordered variables are those variables that have a natural rank of order. Some examples of categorical ordered variables from our dataset are:

有序變量是具有自然順序的那些變量。我們的數(shù)據(jù)集中的分類有序變量的一些示例是：

Month: Jan, Feb, March……
月：一月，二月，三月……
Education: Primary, Secondary,……
教育程度：小學(xué)，中學(xué)，……

Now, let’s analyze the Education Variable from the dataset. Since we’ve already seen a bar plot, let’s see how a Pie Chart looks like.

現(xiàn)在，讓我們從數(shù)據(jù)集中分析教育變量。由于我們已經(jīng)看過條形圖，因此讓我們看一下餅圖的外觀。

#calculate the percentage of each education category. data.education.value_counts(normalize=True)#plot the pie chart of education categories data.education.value_counts(normalize=True).plot.pie() plt.show()

The output will be,

輸出將是

By the above analysis, we can infer that the data set has a large number of them belongs to secondary education after that tertiary and next primary. Also, a very small percentage of them have been unknown.

通過以上分析，我們可以推斷出該數(shù)據(jù)集中有大量數(shù)據(jù)屬于該高等教育之后的中學(xué)。而且，其中很小的一部分是未知的。

This is how we analyze univariate categorical analysis. If the column or variable is of numerical then we’ll analyze by calculating its mean, median, std, etc. We can get those values by using the describe function.

這就是我們分析單變量分類分析的方式。如果列或變量是數(shù)字，那么我們將通過計算其平均值，中位數(shù)，std等進(jìn)行分析。我們可以使用describe函數(shù)獲得這些值。

data.salary.describe()

The output will be,

輸出將是

4.雙變量分析 (4. Bivariate Analysis)

If we analyze data by taking two variables/columns into consideration from a dataset, it is known as Bivariate Analysis.

如果我們通過考慮數(shù)據(jù)集中的兩個變量/列來分析數(shù)據(jù)，則稱為雙變量分析。

a) Numeric-Numeric Analysis:

a)數(shù)值分析：

Analyzing the two numeric variables from a dataset is known as numeric-numeric analysis. We can analyze it in three different ways.

分析數(shù)據(jù)集中的兩個數(shù)字變量稱為數(shù)字?jǐn)?shù)值分析。我們可以通過三種不同的方式對其進(jìn)行分析。

Scatter Plot
散點圖
Pair Plot
對圖
Correlation Matrix
相關(guān)矩陣

Scatter Plot

散點圖

Let’s take three columns ‘Balance’, ‘Age’ and ‘Salary’ from our dataset and see what we can infer by plotting to scatter plot between salary balance and age balance

讓我們從數(shù)據(jù)集中獲取三列“ Balance”，“ Age”和“ Salary”列，看看通過繪制散布在salary balance和age balance之間的圖可以推斷出什么

#plot the scatter plot of balance and salary variable in data plt.scatter(data.salary,data.balance) plt.show()#plot the scatter plot of balance and age variable in data data.plot.scatter(x="age",y="balance") plt.show()

Now, the scatter plots looks like,

現(xiàn)在，散點圖看起來像

Scatter Plots散點圖

Pair Plot

對圖

Now, let’s plot Pair Plots for the three columns we used in plotting Scatter plots. We’ll use the seaborn library for plotting Pair Plots.

現(xiàn)在，讓我們?yōu)槔L制散點圖時使用的三列繪制成對圖。我們將使用seaborn庫來繪制對圖。

#plot the pair plot of salary, balance and age in data dataframe. sns.pairplot(data = data, vars=['salary','balance','age']) plt.show()

The Pair Plot looks like this,

配對圖看起來像這樣，

Pair Plots for Age, balance, Salary年齡，余額，薪水的成對圖

Correlation Matrix

相關(guān)矩陣

Since we cannot use more than two variables as x-axis and y-axis in Scatter and Pair Plots, it is difficult to see the relation between three numerical variables in a single graph. In those cases, we’ll use the correlation matrix.

由于在散點圖和成對圖中我們不能使用兩個以上的變量作為x軸和y軸，因此很難在單個圖中看到三個數(shù)值變量之間的關(guān)系。在這種情況下，我們將使用相關(guān)矩陣。

# Creating a matrix using age, salry, balance as rows and columns data[['age','salary','balance']].corr()#plot the correlation matrix of salary, balance and age in data dataframe. sns.heatmap(data[['age','salary','balance']].corr(), annot=True, cmap = 'Reds') plt.show()

First, we created a matrix using age, salary, and balance. After that, we are plotting the heatmap using the seaborn library of the matrix.

首先，我們使用年齡，薪水和余額創(chuàng)建一個矩陣。之后，我們使用矩陣的seaborn庫繪制熱圖。

b) Numeric - Categorical Analysis

b)數(shù)值-分類分析

Analyzing the one numeric variable and one categorical variable from a dataset is known as numeric-categorical analysis. We analyze them mainly using mean, median, and box plots.

從數(shù)據(jù)集中分析一個數(shù)字變量和一個分類變量稱為數(shù)字分類分析。我們主要使用均值，中位數(shù)和箱形圖來分析它們。

Let’s take salary and response columns from our dataset.

讓我們從數(shù)據(jù)集中獲取salary和response列。

First check for mean value using groupby

首先使用groupby檢查平均值

#groupby the response to find the mean of the salary with response no & yes separately.data.groupby('response')['salary'].mean()

The output will be,

輸出將是

Response and Salary using mean響應(yīng)和薪水均值

There is not much of a difference between the yes and no response based on the salary.

基于薪水，是與否之間的區(qū)別不大。

Let’s calculate the median,

讓我們計算中位數(shù)

#groupby the response to find the median of the salary with response no & yes separately.data.groupby('response')['salary'].median()

The output will be,

輸出將是

By both mean and median we can say that the response of yes and no remains the same irrespective of the person’s salary. But, is it truly behaving like that, let’s plot the box plot for them and check the behavior.

通過均值和中位數(shù)，我們可以說是和否的響應(yīng)保持不變，而與人的薪水無關(guān)。但是，是否確實如此，讓我們?yōu)樗鼈兝L制箱形圖并檢查行為。

#plot the box plot of salary for yes & no responses.sns.boxplot(inp1.response, inp1.salary)
plt.show()

The box plot looks like this,

箱形圖看起來像這樣，

As we can see, when we plot the Box Plot, it paints a very different picture compared to mean and median. The IQR for customers who gave a positive response is on the higher salary side.

如我們所見，當(dāng)繪制箱形圖時，與均值和中位數(shù)相比，它繪制的圖像非常不同。給予積極回應(yīng)的客戶的IQR在較高的薪資方面。

This is how we analyze Numeric-Categorical variables, we use mean, median, and Box Plots to draw some sort of conclusions.

這就是我們分析數(shù)值分類變量的方式，我們使用均值，中位數(shù)和箱形圖得出某種結(jié)論。

c) Categorical — Categorical Analysis

c)分類-分類分析

Since our target variable/column is the Response rate, we’ll see how the different categories like Education, Marital Status, etc., are associated with the Response column. So instead of ‘Yes’ and ‘No’ we will convert them into ‘1’ and ‘0’, by doing that we’ll get the “Response Rate”.

由于我們的目標(biāo)變量/列是響應(yīng)率，因此我們將看到“教育”，“婚姻狀況”等不同類別如何與“響應(yīng)”列相關(guān)聯(lián)。因此，通過將其轉(zhuǎn)換為“ 1”和“ 0”，而不是“是”和“否”，我們將獲得“響應(yīng)率”。

#create response_rate of numerical data type where response "yes"= 1, "no"= 0 data['response_rate'] = np.where(data.response=='yes',1,0) data.response_rate.value_counts()

The output looks like this,

輸出看起來像這樣，

Let’s see how the response rate varies for different categories in marital status.

讓我們看看婚姻狀況不同類別的回應(yīng)率如何變化。

#plot the bar graph of marital status with average value of response_rate data.groupby('marital')['response_rate'].mean().plot.bar() plt.show()

The graph looks like this,

該圖看起來像這樣，

By the above graph, we can infer that the positive response is more for Single status members in the data set. Similarly, we can plot the graphs for Loan vs Response rate, Housing Loans vs Response rate, etc.

通過上圖，我們可以推斷出，對于數(shù)據(jù)集中的單個狀態(tài)成員而言，肯定的響應(yīng)更大。同樣，我們可以繪制貸款與響應(yīng)率，住房貸款與響應(yīng)率等圖表。

5.多元分析 (5. Multivariate Analysis)

If we analyze data by taking more than two variables/columns into consideration from a dataset, it is known as Multivariate Analysis.

如果我們通過考慮數(shù)據(jù)集中兩個以上的變量/列來分析數(shù)據(jù)，則稱為多變量分析。

Let’s see how ‘Education’, ‘Marital’, and ‘Response_rate’ vary with each other.

讓我們來看看“教育”，“婚姻”和“回應(yīng)率”之間如何變化。

First, we’ll create a pivot table with the three columns and after that, we’ll create a heatmap.

首先，我們將創(chuàng)建一個包含三列的數(shù)據(jù)透視表，然后創(chuàng)建一個熱圖。

result = pd.pivot_table(data=data, index='education', columns='marital',values='response_rate') print(result)#create heat map of education vs marital vs response_rate sns.heatmap(result, annot=True, cmap = 'RdYlGn', center=0.117) plt.show()

The Pivot table and heatmap looks like this,

數(shù)據(jù)透視表和熱圖如下所示，

Based on the Heatmap we can infer that the married people with primary education are less likely to respond positively for the survey and single people with tertiary education are most likely to respond positively to the survey.

根據(jù)熱圖，我們可以推斷出，受過初等教育的已婚人士對調(diào)查的正面React不太可能，而受過高等教育的單身人士對調(diào)查的正面回應(yīng)的可能性最大。

Similarly, we can plot the graphs for Job vs marital vs response, Education vs poutcome vs response, etc.

同樣，我們可以繪制工作，婚姻，回應(yīng)，教育，結(jié)果與回應(yīng)等圖表。

結(jié)論 (Conclusion)

This is how we’ll do Exploratory Data Analysis. Exploratory Data Analysis (EDA) helps us to look beyond the data. The more we explore the data, the more the insights we draw from it. As a data analyst, almost 80% of our time will be spent understanding data and solving various business problems through EDA.

這就是我們進(jìn)行探索性數(shù)據(jù)分析的方式。探索性數(shù)據(jù)分析(EDA)可幫助我們超越數(shù)據(jù)范圍。我們越探索數(shù)據(jù)，就越能從中獲得見識。作為數(shù)據(jù)分析師，我們將有近80％的時間用于通過EDA了解數(shù)據(jù)和解決各種業(yè)務(wù)問題。

Thank you for reading and Happy Coding!!!

感謝您的閱讀和快樂編碼！！！

在這里查看我以前有關(guān)Python的文章 (Check out my previous articles about Python here)

Indexing in Pandas Dataframe using Python
使用Python在Pandas Dataframe中建立索引
Seaborn: Python
Seaborn：Python
Pandas: Python
熊貓：Python
Matplotlib: Python
Matplotlib：Python
NumPy: Python
NumPy：Python
Data Visualization and its Importance: Python
數(shù)據(jù)可視化及其重要性：Python
Time Complexity and Its Importance in Python
時間復(fù)雜度及其在Python中的重要性
Python Recursion or Recursive Function in Python
Python中的Python遞歸或遞歸函數(shù)

翻譯自: https://towardsdatascience.com/exploratory-data-analysis-eda-python-87178e35b14

總結(jié)

以上是生活随笔為你收集整理的探索性数据分析（EDA）：Python的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。