python进阶指南_Python特性工程动手指南
python進(jìn)階指南
介紹 (Introduction)
In this guide, I will walk through how to utilize data manipulating to extract features manually.
在本指南中,我將逐步介紹如何利用數(shù)據(jù)處理來(lái)手動(dòng)提取特征。
Manual feature engineering could be exhausting and needs plenty of time, experience, and domain knowledge experience to develop the right features. There are many automatic feature engineering tools available, like the FeatureTools and the AutoFeat. However, manual feature engineering is essential to understand those advanced tools. Furthermore, it would help build a robust and generic model. I will use the home-credit-default-risk dataset available on the Kaggle platform. I will use only two tables bureauand bureau_balancefrom the main folder. According to the dataset description on the competition page, the tables are the following:
手動(dòng)要素工程可能會(huì)很累,并且需要大量的時(shí)間,經(jīng)驗(yàn)和領(lǐng)域知識(shí)經(jīng)驗(yàn)才能開(kāi)發(fā)出正確的要素。 有許多可用的自動(dòng)功能工程工具,例如FeatureTools和AutoFeat。 但是,手動(dòng)功能工程對(duì)于理解這些高級(jí)工具至關(guān)重要。 此外,這將有助于構(gòu)建健壯且通用的模型。 我將使用Kaggle平臺(tái)上可用的home-credit-default-risk數(shù)據(jù)集。 我將只使用主文件夾中的兩個(gè)表bureau和bureau_balance 。 根據(jù)比賽頁(yè)面上的數(shù)據(jù)集描述,下表如下:
bureau.csv
Bureau.csv
- This table includes all clients’ previous credits from other financial institutions that reported to the Credit Bureau. 該表包括已向信用局報(bào)告的所有其他金融機(jī)構(gòu)客戶以前的信用。
bureau_balance.csv
Bureau_balance.csv
- Monthly balances of earlier loans in the Credit Bureau. 信用局中較早貸款的每月余額。
- This table has one row for each month of the history of every previous loan reported to the Credit Bureau. 對(duì)于向信用局報(bào)告的每筆先前貸款的歷史記錄,此表每個(gè)月都有一行。
本教程將涵蓋主題 (Topics will be covered in this tutorial)
1.讀取和整理數(shù)據(jù) (1. Reading and Munging the data)
I will start by importing some important libraries that would help in understanding the data.
我將首先導(dǎo)入一些有助于理解數(shù)據(jù)的重要庫(kù)。
# pandas and numpy for data manipulationimport pandas as pdimport numpy as np# matplotlib and seaborn for plotting
import matplotlib.pyplot as plt
import seaborn as sns# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')
plt.style.use('fivethirtyeight')
I will start analyzing the bureau.csv first:
我將首先開(kāi)始分析Bureau.csv:
# Read in bureaubureau = pd.read_csv('../input/home-credit-default-risk/bureau.csv')
bureau.head()
This table has 1716428 observations and 17 feature.
該表具有1716428個(gè)觀測(cè)值和17個(gè)功能。
SK_ID_CURR int64SK_ID_BUREAU int64
CREDIT_ACTIVE object
CREDIT_CURRENCY object
DAYS_CREDIT int64
CREDIT_DAY_OVERDUE int64
DAYS_CREDIT_ENDDATE float64
DAYS_ENDDATE_FACT float64
AMT_CREDIT_MAX_OVERDUE float64
CNT_CREDIT_PROLONG int64
AMT_CREDIT_SUM float64
AMT_CREDIT_SUM_DEBT float64
AMT_CREDIT_SUM_LIMIT float64
AMT_CREDIT_SUM_OVERDUE float64
CREDIT_TYPE object
DAYS_CREDIT_UPDATE int64
AMT_ANNUITY float64
dtype: object
We need to get how many previous loans per client id which is SK_ID_CURR. We can get that using pandas aggregation functions groupby and count(). Then store the new results in a new dataframe after renaming the SK_ID_BUREAU into previous_loan_count for readability.
我們需要獲取每個(gè)客戶ID以前有多少筆貸款,即SK_ID_CURR 。 我們可以使用pandas聚合函數(shù)groupby和count().得到它c(diǎn)ount(). 然后,將SK_ID_BUREAU重命名為previous_loan_count以提高可讀性后,將新結(jié)果存儲(chǔ)在新數(shù)據(jù)SK_ID_BUREAU 。
# groupby client-id, count #previous loansfrom pandas import DataFrameprev_loan_count = bureau.groupby('SK_ID_CURR', as_index = False).count().rename(columns = {'SK_ID_BUREAU': 'previous_loan_count'})
The new prev_loan_count has only 305811 observations. Now, I will merge the prev_loan_count dataframe into the train dataset through the client id SK_ID_CURR then fill the missing values with 0. Finally, check if the new column has been added using the dtypes function.
新的prev_loan_count只有305811個(gè)觀測(cè)值。 現(xiàn)在,我將通過(guò)客戶端ID SK_ID_CURR將prev_loan_count數(shù)據(jù)幀合并到train數(shù)據(jù)集中,然后用0填充缺少的值。最后,檢查是否已使用dtypes函數(shù)添加了新列。
# join with the training dataframe# read train.csvpd.set_option('display.max_column', None)
train = pd.read_csv('../input/home-credit-default-risk/application_train.csv')
train = train.merge(prev_loan_count, on = 'SK_ID_CURR', how = 'left')# fill the missing values with 0train['previous_loan_count'] = train['previous_loan_count'].fillna(0)
train['previous_loan_count'].dtypesdtype('float64')
It is already there!
它已經(jīng)在那里!
2.研究相關(guān)性 (2. Investigate correlation)
The next step is to explore the Pearson correlation value or (r-value) between attributes through feature importance. It is not a measure of importance for new variables; however, it provides a reference of whether a variable will be helpful to the model or not.
下一步是通過(guò)特征重要性探索屬性之間的皮爾遜相關(guān)值或( r值 )。 它不是衡量新變量重要性的方法; 但是,它提供了變量是否對(duì)模型有用的參考。
Higher correlation with respect to the dependent variable means any change in that variable would lead to a significant change in the dependent variable. So, in the next step, I would look into the highest absolute value of r-value relative to the dependent variable.
相對(duì)于因變量的更高相關(guān)性意味著該變量的任何變化都將導(dǎo)致因變量的重大變化。 因此,在下一步中,我將研究r值相對(duì)于因變量的最高絕對(duì)值。
The Kernel Density Estimator (KDE) is the best to describe relation between dependent and independent variable.
核密度估計(jì)器(KDE)最能描述因變量和自變量之間的關(guān)系。
# Plots the disribution of a variable colored by value of the dependent variabledef kde_target(var_name, df):# Calculate the correlation coefficient between the new variable and the target
corr = df['TARGET'].corr(df[var_name])
# Calculate medians for repaid vs not repaid
avg_repaid = df.loc[df['TARGET'] == 0, var_name].median()
avg_not_repaid = df.loc[df['TARGET'] == 1, var_name].median()
plt.figure(figsize = (12, 6))
# Plot the distribution for target == 0 and target == 1
sns.kdeplot(df.loc[df['TARGET'] == 0, var_name], label = 'TARGET == 0')
sns.kdeplot(df.loc[df['TARGET'] == 1, var_name], label = 'TARGET == 1')
# label the plot
plt.xlabel(var_name); plt.ylabel('Density'); plt.title('%s Distribution' % var_name)
plt.legend();
# print out the correlation
print('The correlation between %s and the TARGET is %0.4f' % (var_name, corr)) # Print out average values
print('Median value for loan that was not repaid = %0.4f' % avg_not_repaid) print('Median value for loan that was repaid = %0.4f' % avg_repaid)
Then check the distribution of the previous_loan_count against Target
然后針對(duì)Target檢查previous_loan_count的分布
kde_target('previous_loan_count', train)The KDE plot for the previous_loan_countprevious_loan_count的KDE圖It is hard to see any significant correlation between the TARGETand the previous_loan_count . There is no significant correlation can be detected from the diagram. So, more variables need to be investigated using aggregation functions.
很難看到TARGET和previous_loan_count之間有任何顯著相關(guān)性。 從圖中無(wú)法檢測(cè)到明顯的相關(guān)性。 因此,需要使用聚合函數(shù)研究更多變量。
3.匯總數(shù)字列 (3. Aggregate numeric columns)
I will pick the numeric columns grouped by client id then apply the statistics functions min, max, sum, mean, and count to get a summary statistics for per numeric feature.
我將選擇按客戶ID分組的數(shù)字列,然后應(yīng)用統(tǒng)計(jì)函數(shù)min, max, sum, mean, and count以獲得每個(gè)數(shù)字功能的摘要統(tǒng)計(jì)信息。
# Group by the client id, calculate aggregation statisticsbureau_agg = bureau.drop(columns = ['SK_ID_BUREAU']).groupby('SK_ID_CURR', as_index = False).agg(['count', 'mean', 'min','max','sum']).reset_index()
Creating a new name for each columns for readability sake. Then merge with the train dataset.
為便于閱讀,請(qǐng)為每列創(chuàng)建一個(gè)新名稱(chēng)。 然后與train數(shù)據(jù)集合并。
# List of column namescolumns = ['SK_ID_CURR']# Iterate through the variables namesfor var in bureau_agg.columns.levels[0]:
# Skip the id name
if var != 'SK_ID_CURR':
# Iterate through the stat names
for stat in bureau_agg.columns.levels[1][:-1]:
# Make a new column name for the variable and stat
columns.append('bureau_%s_%s' % (var, stat))# Assign the list of columns names as the dataframe column names
bureau_agg.columns = columns# merge with the train dataset
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')
Getting the correlation with the TARGET variable then sort the correlations by the absolute value using the sort_values()Python function.
獲取與TARGET變量的相關(guān)性,然后使用sort_values() Python函數(shù)按絕對(duì)值對(duì)相關(guān)性進(jìn)行排序。
# Calculate correlation between variables and the dependent variable# Sort the correlations by the absolute valuenew_corrs = train.drop(columns=['TARGET']).corrwith(train['TARGET']).sort_values(ascending=False)
new_corrs[:15]correlation with the TARGET variable與TARGET變量的相關(guān)性
Now check the KDE plot for the newly created variables
現(xiàn)在檢查KDE圖以了解新創(chuàng)建的變量
kde_target('bureau_DAYS_CREDIT_mean', train)The correlation between bureau_DAYS_CREDIT_mean and the TARGETBureau_DAYS_CREDIT_mean與TARGET之間的相關(guān)性As illustrated, again the correlation is very weak and could be just noise. Furthermore, a larger negative number indicates the loan was further before the current loan application.
如圖所示,相關(guān)性再次非常弱,可能僅僅是噪聲。 此外,較大的負(fù)數(shù)表示該筆貸款比當(dāng)前的貸款申請(qǐng)還早。
4.獲取bureau_balance的統(tǒng)計(jì)信息 (4. Get stats for the bureau_balance)
bureau_balance = pd.read_csv('../input/home-credit-default-risk/bureau_balance.csv')bureau_balance.head()bureau_balance.csvBureau_balance.csv5.調(diào)查分類(lèi)變量 (5. Investigating the categorical variables)
The following function iterate over the dataframe and pick the categorical column and create a dummy variable to it.
以下函數(shù)遍歷數(shù)據(jù)框并選擇類(lèi)別列,并為其創(chuàng)建一個(gè)虛擬變量。
def process_categorical(df, group_var, col_name):"""Computes counts and normalized counts for each observation
of `group_var` of each unique category in every categorical variable
Parameters
--------
df : dataframe
The dataframe to calculate the value counts for.
group_var : string
The variable by which to group the dataframe. For each unique
value of this variable, the final dataframe will have one row
col_name : string
Variable added to the front of column names to keep track of columnsReturn
--------
categorical : dataframe
A dataframe with counts and normalized counts of each unique category in every categorical variable
with one row for every unique value of the `group_var`.
"""
# pick the categorical column
categorical = pd.get_dummies(df.select_dtypes('O'))
# put an id for each column
categorical[group_var] = df[group_var]
# aggregate the group_var
categorical = categorical.groupby(group_var).agg(['sum', 'mean'])
columns_name = []
# iterate over the columns in level 0
for var in categorical.columns.levels[0]:
# iterate through level 1 for stats
for stat in ['count', 'count_norm']:
# make new column name
columns_name.append('%s_%s_%s' %(col_name, var, stat))
categorical.columns = columns_name
return categorical
This function will return a stats of sum and mean for each categorical column.
此函數(shù)將為每個(gè)分類(lèi)列返回sum和mean的統(tǒng)計(jì)信息。
bureau_count = process_categorical(bureau, group_var = 'SK_ID_CURR',col_name = 'bureau')Do the same for bureau_balance
對(duì)bureau_balance執(zhí)行相同的操作
bureau_balance_counts = process_categorical(df = bureau_balance, group_var = 'SK_ID_BUREAU', col_name = 'bureau_balance')Now, we have the calculations on each loan. We need to aggregate for each client. I will merging all the previous dataframes together then aggregate the statistics again grouped by the SK_ID_CURR.
現(xiàn)在,我們有了每筆貸款的計(jì)算。 我們需要為每個(gè)客戶匯總。 我將所有先前的數(shù)據(jù)幀合并在一起,然后再次匯總按SK_ID_CURR分組的統(tǒng)計(jì)信息。
# dataframe grouped by the loanbureau_by_loan = bureau_balance_agg.merge(bureau_balance_counts, right_index = True, left_on = 'SK_ID_BUREAU', how = 'outer')# Merge to include the SK_ID_CURR
bureau_by_loan = bureau[['SK_ID_BUREAU', 'SK_ID_CURR']].merge(bureau_by_loan, on = 'SK_ID_BUREAU', how = 'left')# Aggregate the stats for each client
bureau_balance_by_client = agg_numeric(bureau_by_loan.drop(columns = ['SK_ID_BUREAU']), group_var = 'SK_ID_CURR', col_name = 'client')
6.將計(jì)算出的特征插入訓(xùn)練數(shù)據(jù)集中 (6. Insert computed feature into train dataset)
original_features = list(train.columns)print('Original Number of Features: ', len(original_features))
The output : Original Number of Features: 122
輸出:原始功能數(shù)量:122
# Merge with the value counts of bureautrain = train.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')# Merge with the stats of bureau
train = train.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')# Merge with the monthly information grouped by client
train = train.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')new_features = list(train.columns)
print('Number of features using previous loans from other institutions data: ', len(new_features))# Number of features using previous loans from other institutions data: 333
Output is: Number of features using previous loans from other institutions data: 333
輸出為:使用先前從其他機(jī)構(gòu)獲得的數(shù)據(jù)的要素?cái)?shù)量:333
7.檢查丟失的數(shù)據(jù) (7. Check the missing data)
It is very important to check missing data in the training set after merging the new features.
合并新功能后,檢查訓(xùn)練集中的缺失數(shù)據(jù)非常重要。
# Function to calculate missing values by column# Functdef missing_percent(df):"""Computes counts and normalized counts for each observation
of `group_var` of each unique category in every categorical variable
Parameters
--------
df : dataframe
The dataframe to calculate the value counts for.Return
--------
mis_column : dataframe
A dataframe with missing information .
"""
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_table = pd.concat([mis_val, mis_percent], axis=1)
# Rename the columns
mis_columns = mis_table.rename(
columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})
# Sort the table by percentage of missing descending
mis_columns = mis_columns[
mis_columns.iloc[:,1] != 0].sort_values(
'Percent of Total Values', ascending=False).round(2)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_columnstrain_missing = missing_percent(train)
train_missing.head()train_missingtrain_missing
There are a quite number of columns that have a plenty of missing data. I am going to drop any column that have missing data than 90%.
有相當(dāng)多的列缺少大量數(shù)據(jù)。 我將刪除所有缺少數(shù)據(jù)超過(guò)90%的列。
missing_vars_train = train_missing.loc[train_missing['Percent of Total Values'] > 90, 'Percent of Total Values']len(missing_vars_train)
# 0
I will do the same for the test data
我將對(duì)測(cè)試數(shù)據(jù)進(jìn)行相同的操作
# Read in the test dataframetest = pd.read_csv('../input/home-credit-default-risk/application_test.csv')# Merge with the value counts of bureau
test = test.merge(bureau_counts, on = 'SK_ID_CURR', how = 'left')# Merge with the stats of bureau
test = test.merge(bureau_agg, on = 'SK_ID_CURR', how = 'left')# Merge with the value counts of bureau balance
test = test.merge(bureau_balance_by_client, on = 'SK_ID_CURR', how = 'left')
Then, will align the train and test dataset together and check their shape and same columns.
然后,將train和test數(shù)據(jù)集對(duì)齊,并檢查它們的形狀和相同的列。
# create a train target labeltrain_label = train['TARGET']# align both dataframes, this will remove TARGET column
train, test = train.align(test, join='inner', axis = 1)
train['TARGET'] = train_label
print('Training Data Shape: ', train.shape)
print('Testing Data Shape: ', test.shape)#Training Data Shape: (307511, 333)
#Testing Data Shape: (48744, 332)
Let’s check the missing percent on the test set.
讓我們檢查test集上丟失的百分比。
test_missing = missing_percent(test)test_missing.head()test_missing.head()test_missing.head()
8.相關(guān)性 (8. Correlations)
I will check the correlation with the TARGET variable and the newly created features.
我將檢查與TARGET變量和新創(chuàng)建的功能的相關(guān)性。
# calculate correlation for all dataframescorr_train = train.corr()# Sort the resulted values in an ascending order
corr_train = corr_train.sort_values('TARGET', ascending = False)# show the ten most positive correlations
pd.DataFrame(corr_train['TARGET'].head(10))the top 10 correlated feature with the target variable與目標(biāo)變量相關(guān)的前10個(gè)相關(guān)特征
As observed from the sample above, the most correlated variables are the variables that were engineered earlier. However, correlation doesn’t mean causation that’s why we need to assess those correlations and pick the variables that have deeper influence on the TARGET . To do so, I will stick with the KDE plot.
從上面的樣本中可以看出,最相關(guān)的變量是較早設(shè)計(jì)的變量。 但是,相關(guān)性并不意味著因果關(guān)系,這就是為什么我們需要評(píng)估那些相關(guān)性并選擇對(duì)TARGET有更深影響的變量。 為此,我將堅(jiān)持使用KDE圖。
kde_target('bureau_DAYS_CREDIT_mean', train)KDE plot for the bureau_DAYS_CREDIT_mean該局的KDE圖_DAYS_CREDIT_meanThe plot says that the applicants with a greater number of monthly record per loan tends to repay the new loan. Let’s look more into the bureau_CREDIT_ACTIVE_Active_count_norm variable to see if this is true.
情節(jié)說(shuō),每筆貸款的每月記錄數(shù)量較多的申請(qǐng)人傾向于償還新的貸款。 讓我們進(jìn)一步看一下bureau_CREDIT_ACTIVE_Active_count_norm變量,看是否為真。
kde_target('bureau_CREDIT_ACTIVE_Active_count_norm', train)KDE plot for the bureau_CREDIT_ACTIVE_Active_count_norm局的KDE圖_CREDIT_ACTIVE_Active_count_normThe correlation here is very weak, we can’t notice any significance.
這里的相關(guān)性很弱,我們沒(méi)有注意到任何意義。
9.共線性 (9. Collinearity)
I will set a threshold of 80% to remove any highly correlated variables with the TARGET
我將閾值設(shè)置為80%,以使用TARGET刪除所有高度相關(guān)的變量
# Set the thresholdthreshold = 0.8# Empty dictionary to hold correlated variables
above_threshold_vars = {}# For each column, record the variables that are above the thresholdfor col in corr_train:
above_threshold_vars[col] = list(corr_train.index[corr_train[col] > threshold])# Track columns to remove and columns already examined
cols_to_remove = []
cols_seen = []
cols_to_remove_pair = []# Iterate through columns and correlated columnsfor key, value in above_threshold_vars.items():
# Keep track of columns already examined
cols_seen.append(key)
for x in value:
if x == key:
next
else:
# Only want to remove one in a pair
if x not in cols_seen:
cols_to_remove.append(x)
cols_to_remove_pair.append(key)
cols_to_remove = list(set(cols_to_remove))
print('Number of columns to remove: ', len(cols_to_remove))
The output is: Number of columns to remove: 134
輸出為:要?jiǎng)h除的列數(shù):134
Then, we can remove those column from the dataset as a preparation step to use for the model building
然后,我們可以從數(shù)據(jù)集中刪除這些列,作為準(zhǔn)備步驟以用于模型構(gòu)建
rain_corrs_removed = train.drop(columns = cols_to_remove)test_corrs_removed = test.drop(columns = cols_to_remove)
print('Training Corrs Removed Shape: ', train_corrs_removed.shape)
print('Testing Corrs Removed Shape: ', test_corrs_removed.shape)
Training Corrs Removed Shape: (307511, 199)Testing Corrs Removed Shape: (48744, 198)
訓(xùn)練芯去除形狀:(307511,199)測(cè)試芯去除形狀:(48744,198)
摘要 (Summary)
The purpose of this tutorial was to introduce you to many concepts that may seem confusing at the beginning:
本教程的目的是向您介紹許多在開(kāi)始時(shí)可能會(huì)令人困惑的概念:
翻譯自: https://towardsdatascience.com/hands-on-guide-to-feature-engineering-de793efc785
python進(jìn)階指南
總結(jié)
以上是生活随笔為你收集整理的python进阶指南_Python特性工程动手指南的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 苹果8有128g内存的吗
- 下一篇: python集群_使用Python集群文