當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

数据挖掘流程（三）：特征工程

發(fā)布時(shí)間：2025/4/5 编程问答 46 豆豆

生活随笔收集整理的這篇文章主要介紹了数据挖掘流程（三）：特征工程小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

數(shù)據(jù)和特征決定了機(jī)器學(xué)習(xí)的上限，而模型和算法只是逼近這個(gè)上限而已。

特征工程是利用數(shù)據(jù)領(lǐng)域的相關(guān)知識(shí)來(lái)創(chuàng)建能夠使機(jī)器學(xué)習(xí)算法達(dá)到最佳性能的特征的過(guò)程。

特征工程流程：這些過(guò)程不是必須全部要有，需要根據(jù)業(yè)務(wù)需求和數(shù)據(jù)格式特點(diǎn)，適宜調(diào)整！

- 數(shù)據(jù)理解EDA
- 特征清洗
- 特征構(gòu)造
- 特征選擇
- 特征降維
- 特征類別不平衡

特征工程

1. 數(shù)據(jù)理解EDA

1.1 數(shù)據(jù)簡(jiǎn)略觀測(cè)

1.2 數(shù)據(jù)統(tǒng)計(jì)

1.3 數(shù)據(jù)正態(tài)性檢驗(yàn)

1.4 繪圖

2. 特征清洗

2.1 特征分類不平衡

2.2 缺失值處理

2.3 異常值處理

2.4 數(shù)據(jù)轉(zhuǎn)換

2.5 數(shù)據(jù)分桶

2.6 一人多次

3. 特征構(gòu)造（特征生成）

4. 特征選擇

4.1 Filter（過(guò)濾式），單變量特征選擇

4.2 Wrapper（包裝法）

4.3 Embedded（嵌入法）

5. 特征降維

5.1 PCA

5.2 LDA

6. 直接預(yù)測(cè)

特征工程

1. 數(shù)據(jù)理解EDA

這一步最重要的是形成分類變量名列表和連續(xù)變量名列表。這樣做的好處：

1）方便查看分類變量數(shù)據(jù)分布。分類變量正負(fù)樣本比例全是1或95%是1的，沒(méi)意義，可以刪去；連續(xù)變量缺失率大于50%和數(shù)值分布范圍

2）方便后面的相關(guān)性檢測(cè)。分類變量用卡方檢驗(yàn)；連續(xù)變量用t檢驗(yàn)或方差分析。

1.1 數(shù)據(jù)簡(jiǎn)略觀測(cè)

head()
shape
unique()、nunique()
相關(guān)統(tǒng)計(jì)量。describe()
數(shù)據(jù)類型。info()
pandas_profiling數(shù)據(jù)報(bào)告，不建議。因?yàn)樵跀?shù)據(jù)量大時(shí)，pandas_profiling生成的數(shù)據(jù)報(bào)告可能出錯(cuò)、生成的圖較大較慢。

1.2 數(shù)據(jù)統(tǒng)計(jì)

print('----------------全體變量數(shù)據(jù)統(tǒng)計(jì)描述----------------------') # 統(tǒng)計(jì)全變量體系各變量的平均數(shù)、上下四分位數(shù)、缺失率 feature_list=[] mean_list=[] up_quarter_list=[] down_quarter_list=[] miss_list=[]for i in df_model.columns:data = df_model[i]stat_result = pd.DataFrame(data.describe())# print(stat_result)mean_value=stat_result.loc['mean',i]up_quarter=stat_result.loc['25%',i]down_quarter=stat_result.loc['75%',i]num=stat_result.loc['count',i]miss_rate=1-num/df_model.shape[0]miss_rate="%.2f%%" % (miss_rate * 100) # 百分?jǐn)?shù)輸出feature_list.append(i)mean_list.append(round(mean_value,2))up_quarter_list.append(round(up_quarter,2))down_quarter_list.append(round(down_quarter,2))miss_list.append(miss_rate)df_stat=pd.DataFrame({'特征':feature_list,'平均值':mean_list,'上四分位':up_quarter_list,'下四分位':down_quarter_list,'缺失率':miss_list}) df_stat=df_stat.reset_index(drop=True)writer=pd.ExcelWriter(project_path+'/data/v2.0/df_全體變量數(shù)據(jù)統(tǒng)計(jì).xlsx') df_stat.to_excel(writer) writer.save()

1.3 數(shù)據(jù)正態(tài)性檢驗(yàn)

數(shù)據(jù)正態(tài)性檢驗(yàn)，是為了方便相關(guān)性分析和顯著性分析。當(dāng)樣本量巨大時(shí)，可以近似認(rèn)為數(shù)據(jù)符合正態(tài)分布，不用做正態(tài)性檢驗(yàn)。

SPSS。
- P-P圖/Q-Q圖
- k-s和s-w檢驗(yàn)。
- 直方圖。Analysis--統(tǒng)計(jì)描述--頻率
python。詳見(jiàn)特征選擇-相關(guān)性分析
查看特征的偏度和峰度

1.4 繪圖

畫出原始的數(shù)據(jù)
畫出他們的簡(jiǎn)單的統(tǒng)計(jì)特征（mean plots, box plots, residual plots)
畫出不同的數(shù)據(jù)間的相關(guān)性
- 小提琴圖。相當(dāng)于進(jìn)階版箱線圖，可以看出某個(gè)值附近分布的頻率。
- 直方圖。便于觀察數(shù)據(jù)分布
- 箱線圖。便于觀察數(shù)據(jù)的異常情況，以及不同數(shù)據(jù)間的對(duì)比。
- 時(shí)序圖。便于觀察數(shù)據(jù)特點(diǎn)，例如是否具有周期性、震蕩幅度等

2. 特征清洗

2.1 特征分類不平衡

分類變量正負(fù)樣本分類不平衡，少類別提供信息太少，沒(méi)有學(xué)會(huì)如何判別少數(shù)類。

刪除。分類變量正負(fù)樣本比例全是1或95%是1的，沒(méi)意義，可以刪去
重采樣

過(guò)采樣是針對(duì)minority樣本，欠采樣是針對(duì)majority樣本；而綜合采樣是既對(duì)minority樣本，又對(duì)majority樣本，同時(shí)進(jìn)行操作的方法

過(guò)采樣over-sampling。smote，adasyn，TabGan，CTGAN(github)
欠采樣under-sampling。cluster centrolds，Tomek's links，Edited Nearest Neighbours，AllKNN，Condensed Nearest Neighbour，MearMiss-1,2,3
- 嘗試其他評(píng)價(jià)指標(biāo)。AUC
- 調(diào)整θ值
- 選擇其他模型:決策樹等;

例：

原始數(shù)據(jù)（Original）：未經(jīng)過(guò)任何采樣處理（1831X21）每條數(shù)據(jù)有21個(gè)特征。其中正例176個(gè)（9.6122%），反例1655個(gè)（90.3878%）

欠采樣（Undersampling）：從反例中隨機(jī)選擇176個(gè)數(shù)據(jù)，與正例合并（352X21）

過(guò)采樣（Oversampling）：從正例中反復(fù)抽取并生成1655個(gè)數(shù)據(jù)（勢(shì)必會(huì)重復(fù)），并與反例合并（3310X21）

SMOTE：也是一種過(guò)采樣方法。SMOTE通過(guò)找到正例中數(shù)據(jù)的近鄰，來(lái)合成新的1655-176=1479個(gè)“新正例”，并與原始數(shù)據(jù)合并（3310X21）

欠采樣

from imblearn.under_sampling import TomekLinksX_train = train_df.drop(['id', 'type'], axis=1) y = train_df['label'] tl = TomekLinks() X_us, y_us = tl.fit_sample(X_train, y) print(X_us.groupby(['label']).size()) # label # 0 36069 # 1 2757

SMOTE

from imblearn.over_sampling import SMOTE smote = SMOTE(k_neighbors=5, random_state=42) X_res, y_res = smote.fit_resample(X_train, y) X_res.groupby(['label']).size() # label # 0 37243 # 1 37243

ADASYN

from imblearn.over_sampling import ADASYN adasyn = ADASYN(n_neighbors=5, random_state=42) X_res, y_res = adasyn.fit_resample(X_train, y) X_res.groupby(['label']).size()# label # 0 37243 # 1 36690

綜合采樣

from imblearn.combine import SMOTETomeksmote_tomek = SMOTETomek(random_state=0) X_res, y_res = smote_tomek.fit_sample(X_train, y) X_res.groupby(['label']).size() # label # 0 36260 # 1 36260

結(jié)果：

1）過(guò)采樣（右上）只是單純的重復(fù)了正例，因此會(huì)過(guò)分強(qiáng)調(diào)已有的正例。如果其中部分點(diǎn)標(biāo)記錯(cuò)誤或者是噪音，那么錯(cuò)誤也容易被成倍的放大。因此最大的風(fēng)險(xiǎn)就是對(duì)正例過(guò)擬合

2）欠采樣（左下）拋棄了大部分反例數(shù)據(jù)，從而弱化了中間部分反例的影響，可能會(huì)造成偏差很大的模型。當(dāng)然，如果數(shù)據(jù)不平衡但兩個(gè)類別基數(shù)都很大，或許影響不大。數(shù)據(jù)總是寶貴的，拋棄數(shù)據(jù)是很奢侈的，因此另一種常見(jiàn)的做法是反復(fù)做欠采樣，生成1655/176=9

3）SMOTE（右下）可以看出和過(guò)采樣（右上）有了明顯的不同，因?yàn)椴粏渭兪侵貜?fù)正例了，而是在局部區(qū)域通過(guò)K-近鄰生成了新的正例。

2.2 缺失值處理

刪除。缺失率超過(guò)50%的變量刪除
傳統(tǒng)方法。（均值、中位數(shù)）
機(jī)器學(xué)習(xí)。（隨機(jī)森林rf插補(bǔ)、xgboost）

# 使用隨機(jī)森林對(duì)缺失值進(jìn)行插補(bǔ) import pandas as pd pd.set_option('mode.chained_assignment', None) import numpy as np from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import GridSearchCV def missing_value_interpolation(df,missing_list=[]):df = df.reset_index(drop=True)# 提取存在缺失值的列名if not missing_list:for i in df.columns:if df[i].isnull().sum() > 0:missing_list.append(i)missing_list_copy = missing_list.copy()# 用該列未缺失的值訓(xùn)練隨機(jī)森林，然后用訓(xùn)練好的rf預(yù)測(cè)缺失值for i in range(len(missing_list)):name=missing_list[0]df_missing = df[missing_list_copy]# 將其他列的缺失值用0表示。missing_list.remove(name)for j in missing_list:df_missing[j]=df_missing[j].astype('str').apply(lambda x: 0 if x=='nan' else x)df_missing_is = df_missing[df_missing[name].isnull()]df_missing_not = df_missing[df_missing[name].notnull()]y = df_missing_not[name]x = df_missing_not.drop([name],axis=1)# 列出參數(shù)列表tree_grid_parameter = {'n_estimators': list((10, 50, 100, 150, 200))}# 進(jìn)行參數(shù)的搜索組合grid = GridSearchCV(RandomForestRegressor(),param_grid=tree_grid_parameter,cv=3)#rfr=RandomForestRegressor(random_state=0,n_estimators=100,n_jobs=-1)#根據(jù)已有數(shù)據(jù)去擬合隨機(jī)森林模型grid.fit(x, y)rfr = RandomForestRegressor(n_estimators=grid.best_params_['n_estimators'])rfr.fit(x, y)#預(yù)測(cè)缺失值predict = rfr.predict(df_missing_is.drop([name],axis=1))#填補(bǔ)缺失值df.loc[df[name].isnull(),name] = predictreturn df

GAN。偽造數(shù)據(jù)、fake sample

2.3 異常值處理

刪除。箱線圖分析刪除異常值box-plot

# 過(guò)濾異常值，大于正常值超過(guò)100倍！ def filter_exce_value(df,feature):# 過(guò)濾文字!!!!!!!!!!!!!!!!!!!!!!!!!!!df=df[df[feature].str.contains('\d')]# 過(guò)濾異常大值!!!!!!!!!!!!!!!!!!!!!!!!!!median_value=df[feature].median()df[feature]=df[feature].apply(lambda x: x if abs(float(x)) < (100 * abs(median_value)) else np.nan)df=df[df[feature].notnull()]return df

孤立森林
長(zhǎng)尾截?cái)?/li>

2.4 數(shù)據(jù)轉(zhuǎn)換

一般是用于連續(xù)變量不滿足正態(tài)分布的時(shí)候

最重要的一點(diǎn)：如果對(duì)因變量進(jìn)行數(shù)據(jù)轉(zhuǎn)換，要記得對(duì)模型預(yù)測(cè)結(jié)果進(jìn)行恢復(fù)！

正態(tài)糾偏（修復(fù)偏斜特征），box-cox轉(zhuǎn)換

Box-Cox變換通過(guò)對(duì)因變量進(jìn)行變換，使得變換過(guò)的向量與回歸自變量具有線性相依關(guān)系，誤差也服從正態(tài)分布．誤差各分量是等方差且相互獨(dú)立。Box-Cox變換兼顧了變量在時(shí)間序列維度上的回歸特性，所以也可以用于時(shí)間序列方面的預(yù)測(cè)。

from scipy.stats import boxcox boxcox_transformed_data = boxcox(original_data)

在一些情況下（P值正態(tài)化處理，所以優(yōu)先使用BOX-COX轉(zhuǎn)換，但是當(dāng)P值>0.003時(shí)兩種方法均可，優(yōu)先考慮普通的平方變換

其他非正態(tài)數(shù)據(jù)轉(zhuǎn)換
- 對(duì)數(shù)變換（log）
- 平方根轉(zhuǎn)換
- 倒數(shù)轉(zhuǎn)換
- 平方根后取到數(shù)，平方根后再取反余弦，冪轉(zhuǎn)換

for col in continuous_list:df_final_10_1[col] = df_final_10_1[col].apply(lambda x: np.log(x) if x > 0 else np.nan if x!=x else 0)

中心化。把數(shù)據(jù)整體移動(dòng)到以0為中心點(diǎn)的位置，將數(shù)據(jù)減去這個(gè)數(shù)據(jù)集的平均值
標(biāo)準(zhǔn)化（Z-score）。(x-mean)/std
歸一化（Max-min）。(x-min)/(max-min)。從經(jīng)驗(yàn)上說(shuō)，歸一化是讓不同維度之間的特征在數(shù)值上有一定的比較性，可以大大提高分類器的準(zhǔn)確性。

minmax = MinMaxScaler() num_data_minmax = minmax.fit_transform(num_data) num_data_minmax = pd.DataFrame(num_data_minmax, columns=num_data.columns, index=num_data.index)

轉(zhuǎn)換數(shù)據(jù)類型（astype）
獨(dú)熱編碼（one-hot Encoder）

"""類別特征某些需要獨(dú)熱編碼一下""" hot_features = ['bodyType', 'fuelType', 'gearbox', 'notRepairedDamage'] cat_data_hot = pd.get_dummies(cat_data, columns=hot_features)

標(biāo)簽編碼（Label Encoder）

2.5 數(shù)據(jù)分桶

醫(yī)學(xué)數(shù)據(jù)挖掘里用處不大

等頻分桶
等距分桶
Best-KS分桶
卡方分桶

2.6 一人多次

3. 特征構(gòu)造（特征生成）

在特征構(gòu)造的時(shí)候，需要借助一些業(yè)務(wù)知識(shí)（比如醫(yī)學(xué)中的BMI、肌酐轉(zhuǎn)化率），遵循的一般原則就是需要發(fā)揮想象力，盡可能多的創(chuàng)造特征，不用先考慮哪些特征可能好，可能不好，先彌補(bǔ)這個(gè)廣度。

醫(yī)學(xué)數(shù)據(jù)挖掘一般不需要考慮數(shù)值、類別和時(shí)間特征。

數(shù)值特征
類別特征
時(shí)間特征

4. 特征選擇

filter--主要對(duì)應(yīng)單變量特征選擇；wrapper--主要對(duì)應(yīng)多個(gè)特征選擇。

特征選擇原因：對(duì)于一個(gè)特定的學(xué)習(xí)算法來(lái)說(shuō)，哪一個(gè)特征是有效的是未知的。因此，需要從所有特征中選擇出對(duì)于學(xué)習(xí)算法有益的相關(guān)特征。而且在實(shí)際應(yīng)用中，經(jīng)常會(huì)出現(xiàn)維度災(zāi)難問(wèn)題。如果只選擇所有特征中的部分特征構(gòu)建模型，那么可以大大減少學(xué)習(xí)算法的運(yùn)行時(shí)間，也可以增加模型的可解釋性

特征選擇原則：獲取盡可能小的特征子集，不顯著降低分類精度、不影響分類分布以及特征子集應(yīng)具有穩(wěn)定、適應(yīng)性強(qiáng)等特點(diǎn)

4.1 Filter（過(guò)濾式），單變量特征選擇

filter按照發(fā)散性或相關(guān)性對(duì)各個(gè)特征進(jìn)行評(píng)分，設(shè)定閾值或者待選擇閾值的個(gè)數(shù)，選擇特征。

優(yōu)點(diǎn)：運(yùn)行速度快，是一種非常流行的特征選擇方法。

缺點(diǎn)：無(wú)法提供反饋，特征選擇的標(biāo)準(zhǔn)/規(guī)范的制定是在特征搜索算法中完成，學(xué)習(xí)算法無(wú)法向特征搜索算法傳遞對(duì)特征的需求。另外，可能處理某個(gè)特征時(shí)由于任意原因表示該特征不重要，但是該特征與其他特征結(jié)合起來(lái)則可能變得很重要

相關(guān)性檢驗(yàn)。分別計(jì)算每個(gè)特征與輸出值之間的相關(guān)系數(shù)，設(shè)定一個(gè)閾值，選擇相關(guān)系數(shù)大于閾值的部分特征

https://note.youdao.com/s/9HR1GEQG

顯著性檢驗(yàn)
- ?t檢驗(yàn)
- 卡方檢驗(yàn)
- 方差檢驗(yàn)。
- 非參數(shù)檢驗(yàn)

https://note.youdao.com/s/aTVlqmDy

互信息
Relief

獨(dú)立樣本t檢驗(yàn)和Mann-Whitney U test

discrete_list = ['gender'] continuous_list = [x for x in df_model.columns if x not in discrete_list] # 高低劑量組利伐沙班服藥前后WBC顯著性檢驗(yàn) from scipy.stats import kstest,shapiro import scipy.stats as st from scipy.stats import chi2_contingency ##檢驗(yàn)是否正態(tài) def norm_test(data):if len(data) > 30:norm, p = kstest(data, 'norm')else:norm, p = shapiro(data)#print(t,p)if p>=0.05:return Trueelse:return Falsedef test2(data_b, data_p):if norm_test(data_b) and norm_test(data_p):x = 1y = '獨(dú)立樣本T檢驗(yàn)'t, p = st.ttest_ind(list(data_b),list(data_p), nan_policy='omit')else:x = 0y = 'Mann-Whitney U檢驗(yàn)'t,p = st.mannwhitneyu(list(data_b),list(data_p))return x,y,t,p def sig_test(df_high,df_low,list):field_list=[]y_list=[]t_list=[]p_list=[]result_list=[]high_mean_list=[]low_mean_list=[]# high_num_list=[]# high_rate_list=[]# low_num_list=[]# low_rate_list=[]for i in range(len(list)):field=list[i]df_high_nt=df_high[df_high[field].notnull()]data_high=df_high_nt[field]high_mean=round(data_high.mean(),2)# high_num=df_high_nt.shape[0]# all_num=df_high.shape[0] + df_low.shape[0]# high_rate = "%.2f%%" % (round(high_num/all_num) * 100)df_low_nt=df_low[df_low[field].notnull()]data_low=df_low_nt[field]low_mean=round(data_low.mean(),2)# low_num=df_low_nt.shape[0]# low_rate="%.2f%%" % (round(low_num/all_num) * 100)if data_high.shape[0] >= 10 and data_low.shape[0]>=10:x,y,t,p = test2(data_high, data_low)if p <=0.05:sig='顯著'else:sig='不顯著'field_list.append(field)y_list.append(y)t_list.append(t)p_list.append(p)result_list.append(sig)high_mean_list.append(high_mean)low_mean_list.append(low_mean)df_result=pd.DataFrame({'特征':field_list,'高劑量均值':high_mean_list,'低劑量均值':low_mean_list,'檢驗(yàn)指標(biāo)':y_list,'t值':t_list,'p值':p_list,'顯著性結(jié)果':result_list})return df_result # 住院時(shí)長(zhǎng)到用藥時(shí)長(zhǎng)的顯著性檢驗(yàn) df_inp_time=sig_test(df_lfsb_high,df_lfsb_low,['住院時(shí)長(zhǎng)']) df_inp_time=df_inp_time.reset_index(drop=True)writer=pd.ExcelWriter(project_path+r'/data/result/df_高低劑量組住院時(shí)長(zhǎng)顯著性檢驗(yàn).xlsx') df_inp_time.to_excel(writer) writer.save()

卡方檢驗(yàn)

## 卡方檢驗(yàn) print('----------------------卡方檢驗(yàn)-------------------------') from scipy.stats import chi2_contingencyr1 = [] r2 = [] tran_test['MPA類藥物'] = tran_test['MPA類藥物'].astype('str') for i in range(len(np.unique(tran_test['MPA類藥物']))):r1.append(tran_test[(tran_test['group'] == 0) & (tran_test['MPA類藥物'] == np.unique(tran_test['MPA類藥物'])[i])].shape[0])r2.append(tran_test[(tran_test['group'] == 1) & (tran_test['MPA類藥物'] == np.unique(tran_test['MPA類藥物'])[i])].shape[0])abcd = np.array([r1, r2]) print(abcd) result = chi2_contingency(abcd) print(result)tran_x_1 = tran_x_1.drop(['group'], axis=1) test_x_1 = test_x_1.drop(['group'], axis=1)print(tran_x_1.columns)

pearsonr檢驗(yàn)

from scipy import stats r, p = stats.pearsonr(x,y)

spearmanr檢驗(yàn)

# 連續(xù)變量，spearmanr相關(guān)性檢驗(yàn)(統(tǒng)計(jì)量r); print('--------------------------計(jì)算連續(xù)變量的spearmanr相關(guān)性系數(shù)---------------------------------') from scipy import stats t_list = [] p_list = [] q_list = []for i in continuous_list:# 刪除連續(xù)變量中的<、>號(hào)tdm_7_other_filter[i] = tdm_7_other_filter[i].astype('str').apply(lambda x: re.sub(r'<|>', '',x))x= tdm_7_other_filter[tdm_7_other_filter[i].astype('float').notnull()][i]y= tdm_7_other_filter[tdm_7_other_filter[i].astype('float').notnull()]['test_result']t, p = stats.spearmanr(x,y)t = round(t, 2)p = round(p, 3)q = '斯皮爾曼'# print(i, t, p)t_list.append(t)p_list.append(p)q_list.append(q) df_spearmanr= pd.DataFrame(data={'連續(xù)檢測(cè)指標(biāo)': continuous_list,'t值': t_list,'p值': p_list,'方法': q_list}) df_spearmanr_1 = df_spearmanr[df_spearmanr['p值'] <= 0.05] df_spearmanr_2 = df_spearmanr[df_spearmanr['p值'] >= 0.05] # 顯著性不成立 df_spearmanr = pd.concat([df_spearmanr_1,df_spearmanr_2], axis=0)df_spearmanr=df_spearmanr.sort_values(by=['p值'],ascending=True) df_spearmanr = df_spearmanr.reset_index() del df_spearmanr['index']writer = pd.ExcelWriter(project_path + '/result/df_12_其他檢測(cè)指標(biāo)連續(xù)變量的spearmanr相關(guān)性檢測(cè).xlsx') df_spearmanr.to_excel(writer) writer.save()

4.2 Wrapper（包裝法）

根據(jù)目標(biāo)函數(shù)（通常是預(yù)測(cè)效果評(píng)分）作為評(píng)價(jià)函數(shù)，每次選擇若干特征，排除若干特征。

主要方法：遞歸特征消除算法。

優(yōu)點(diǎn)：對(duì)特征進(jìn)行搜索時(shí)圍繞學(xué)習(xí)算法展開(kāi)的，對(duì)特征選擇的標(biāo)準(zhǔn)/規(guī)范是在學(xué)習(xí)算法的需求中展開(kāi)的，能夠考慮學(xué)習(xí)算法所屬的任意學(xué)習(xí)偏差，從而確定最佳子特征，真正關(guān)注的是學(xué)習(xí)問(wèn)題本身。由于每次嘗試針對(duì)特定子集時(shí)必須運(yùn)行學(xué)習(xí)算法，所以能夠關(guān)注到學(xué)習(xí)算法的學(xué)習(xí)偏差/歸納偏差，因此封裝能夠發(fā)揮巨大的作用。

缺點(diǎn)：運(yùn)行速度遠(yuǎn)慢于過(guò)濾算法，實(shí)際應(yīng)用用封裝方法沒(méi)有過(guò)濾方法流行。

逐步向前（Forward stepwise）

# 判斷文件路徑是否存在，如果不存在則創(chuàng)建該路徑 def mkdir(path):folder = os.path.exists(path)if not folder: # 判斷是否存在文件夾如果不存在則創(chuàng)建為文件夾os.makedirs(path) # makedirs 創(chuàng)建文件時(shí)如果路徑不存在會(huì)創(chuàng)建這個(gè)路徑df = pd.read_excel(project_path+'/data/v2.0/建模用數(shù)據(jù)集（未插補(bǔ)）20210525-3.xlsx') if 'Unnamed: 0' in df.columns:df = df.drop(['Unnamed: 0'], axis=1) continuous_list = ['年齡', '身高(cm)', '體重(kg)', 'BMI', '他克莫司頻次', '他克莫司單次劑量', '他克莫司日劑量','C反應(yīng)蛋白_檢測(cè)結(jié)果', '丙氨酸氨基轉(zhuǎn)移酶_檢測(cè)結(jié)果', '中性粒細(xì)胞總數(shù)_檢測(cè)結(jié)果', '低密度脂蛋白膽固醇_檢測(cè)結(jié)果','凝血酶原時(shí)間比率_檢測(cè)結(jié)果', '天門冬氨酸氨基轉(zhuǎn)移酶_檢測(cè)結(jié)果', '尿素_檢測(cè)結(jié)果', '尿酸_檢測(cè)結(jié)果','平均RBC血紅蛋白濃度_檢測(cè)結(jié)果', '平均紅細(xì)胞體積_檢測(cè)結(jié)果', '平均紅細(xì)胞血紅蛋白量_檢測(cè)結(jié)果', '平均血小板容積_檢測(cè)結(jié)果','總膽固醇_檢測(cè)結(jié)果', '總膽紅素_檢測(cè)結(jié)果', '總蛋白_檢測(cè)結(jié)果', '極低密度脂蛋白膽固醇_檢測(cè)結(jié)果', '活化部分凝血活酶時(shí)間_檢測(cè)結(jié)果','淋巴細(xì)胞總數(shù)_檢測(cè)結(jié)果', '球蛋白_檢測(cè)結(jié)果', '甘油三酯_檢測(cè)結(jié)果', '白/球比值_檢測(cè)結(jié)果', '白細(xì)胞計(jì)數(shù)_檢測(cè)結(jié)果','白蛋白_檢測(cè)結(jié)果', '直接膽紅素_檢測(cè)結(jié)果', '紅細(xì)胞比積測(cè)定_檢測(cè)結(jié)果', '肌酐_檢測(cè)結(jié)果', '葡萄糖_檢測(cè)結(jié)果','血小板計(jì)數(shù)_檢測(cè)結(jié)果', '血漿D-二聚體測(cè)定_檢測(cè)結(jié)果', '血紅蛋白測(cè)定_檢測(cè)結(jié)果', '轉(zhuǎn)氨酶比值_檢測(cè)結(jié)果', '間接膽紅素_檢測(cè)結(jié)果','非高密度脂蛋白膽固醇_檢測(cè)結(jié)果', '高密度脂蛋白膽固醇_檢測(cè)結(jié)果', '乳酸脫氫酶_檢測(cè)結(jié)果', '心型肌酸激酶_檢測(cè)結(jié)果','肌酸激酶_檢測(cè)結(jié)果', '尿白細(xì)胞(儀器定量)_檢測(cè)結(jié)果', '尿紅細(xì)胞(儀器定量)_檢測(cè)結(jié)果', 'TDM檢測(cè)結(jié)果' ] #連續(xù)變量取log df_final_10_1 = df.copy() #df_final_11_1 = df_final_11.copy() for col in continuous_list:df_final_10_1[col] = df_final_10_1[col].apply(lambda x: np.log(x) if x > 0 else np.nan if x!=x else 0) def model_xy(model):x = model[model.columns[2:-1]]y = model['TDM檢測(cè)結(jié)果']return x, y col=['身高(cm)', '他克莫司日劑量', '其他免疫抑制劑', '低密度脂蛋白膽固醇_檢測(cè)結(jié)果', '平均紅細(xì)胞體積_檢測(cè)結(jié)果', '平均紅細(xì)胞血紅蛋白量_檢測(cè)結(jié)果','白細(xì)胞計(jì)數(shù)_檢測(cè)結(jié)果', '直接膽紅素_檢測(cè)結(jié)果', '紅細(xì)胞比積測(cè)定_檢測(cè)結(jié)果'] df_model_4 = df_final_10_1.copy() x4, y4 = model_xy(df_model_4) all_all_results = [] for j in range(1,52):for xy in [[x4, y4]]:train_x, test_x, train_y, test_y = train_test_split(xy[0],xy[1],test_size=0.2,random_state=78)# 津源xgboost模型sfs = SFS(xgb.XGBRegressor(max_depth=5,learning_rate=0.01,n_estimators=500,min_child_weight=0.5,eta=0.1,gamma=0.5,reg_lambda=10,subsample=0.5,colsample_bytree=0.8,nthread=4,scale_pos_weight=1),k_features=j,forward=True,floating=False,verbose=2,scoring='r2',cv=3)sfs = sfs.fit(train_x, train_y)# 逐步向前篩選結(jié)果，包括特征個(gè)數(shù)，最優(yōu)特征組合及其r2sfs_result = sfs.subsets_print(sfs_result)df_sfs = pd.DataFrame(sfs_result)# DataFrame轉(zhuǎn)置df_sfs_T=pd.DataFrame(df_sfs.values.T,index=df_sfs.columns,columns=df_sfs.index)df_sfs_T=df_sfs_T.reset_index(drop=True)# 保存逐步向前篩選結(jié)果r2_list=list(df_sfs_T['avg_score'])feature_list=list(df_sfs_T['feature_names'])# 根據(jù)逐步向前測(cè)試結(jié)果篩選最優(yōu)特征組合r2_max=max(r2_list)print(r2_max)r2_max_index=r2_list.index(r2_max)df_feature_select=df_sfs_T.iloc[r2_max_index:r2_max_index+1,:]all_all_results.append(df_feature_select) df_feature_select=all_all_results[0] for j in range(1,len(all_all_results)):df_feature_select=pd.concat([df_feature_select,all_all_results[j]],axis=0) df_feature_select=df_feature_select.reset_index(drop=True) # 保存模型測(cè)試和測(cè)試結(jié)果到本地文件 writer = pd.ExcelWriter(project_path + '/data/v2.0/df_逐步向前特征測(cè)試結(jié)果.xlsx') df_feature_select.to_excel(writer) writer.save()

逐步向后（Backward stepwise）

4.3 Embedded（嵌入法）

用model進(jìn)行訓(xùn)練，得到各個(gè)特征的權(quán)值系數(shù)，根據(jù)系數(shù)從大到小選擇特征。

特征選擇完成后，還能基于特征選擇完成的特征和模型訓(xùn)練出的超參數(shù)，再次訓(xùn)練優(yōu)化。

主要思想：在模型既定的情況下學(xué)習(xí)出對(duì)提高模型準(zhǔn)確性最好的特征。也就是在確定模型的過(guò)程中，挑選出那些對(duì)模型的訓(xùn)練有重要意義的特征。

# 重要性評(píng)分 import catboost,xgboost model_boost=xgboost.XGBRegressor() model_boost.fit(tran_x,tran_y) importance = model_boost.feature_importances_ print(tran_x.columns) print(importance)df_importance= pd.DataFrame(data={'特征':tran_x.columns,'重要性評(píng)分':importance}) df_importance['重要性評(píng)分']=df_importance['重要性評(píng)分'].apply(lambda x: round(x,3)) df_importance=df_importance.sort_values(['重要性評(píng)分'],ascending=False) df_importance=df_importance.reset_index(drop=True) writer = pd.ExcelWriter(project_path + '/result/df_19_模型重要性評(píng)分.xlsx') df_importance.to_excel(writer) writer.save()

L1正則化/Lasso regression

L1正則化將系數(shù)w的l1范數(shù)作為懲罰項(xiàng)加到損失函數(shù)上。Lasso能夠挑出一些優(yōu)質(zhì)特征，同時(shí)讓其他特征的系數(shù)趨于0。當(dāng)如需要減少特征數(shù)的時(shí)候它很有用，但是對(duì)于數(shù)據(jù)理解來(lái)說(shuō)不是很好用。

L2正則化/Ridge regression

L2正則化對(duì)于特征選擇來(lái)說(shuō)一種穩(wěn)定的模型，不像L1正則化那樣，系數(shù)會(huì)因?yàn)榧?xì)微的數(shù)據(jù)變化而波動(dòng)。所以L2正則化和L1正則化提供的價(jià)值是不同的，L2正則化對(duì)于特征理解來(lái)說(shuō)更加有用：表示能力強(qiáng)的特征對(duì)應(yīng)的系數(shù)是非零。

5. 特征降維

5.1 PCA

5.2 LDA

6. 直接預(yù)測(cè)

boosting體系用于基因分析挖掘

SVM體系（kernel函數(shù)進(jìn)行更改），適用于缺失值和異常值存在的情況

DeepLearning，aml-net，tabnet，TabTransformer

不均衡：loss入手，focal loss，teacher-student-network（多網(wǎng)絡(luò)互學(xué)習(xí)）
不均衡+小樣本：GAN體系，對(duì)比學(xué)習(xí)體系（學(xué)習(xí)特征的表征向量+下游任務(wù)預(yù)測(cè)）
小樣本：建議使用傳統(tǒng)機(jī)器學(xué)習(xí)，svm優(yōu)先（穩(wěn)定性強(qiáng)）；加了正則化的線性模型（L1正則--Lasso回歸，L2正則--Ridge回歸-->導(dǎo)致的問(wèn)題是泛化，正則是學(xué)習(xí)的時(shí)候盡量不要給它強(qiáng)規(guī)則，而重點(diǎn)學(xué)習(xí)數(shù)據(jù)分布和推理邏輯，有一定的降維效果）

總結(jié)

以上是生活随笔為你收集整理的数据挖掘流程（三）：特征工程的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

数据挖掘流程（三）：特征工程

特征工程

1. 數(shù)據(jù)理解EDA

1.1 數(shù)據(jù)簡(jiǎn)略觀測(cè)

1.2 數(shù)據(jù)統(tǒng)計(jì)

1.3 數(shù)據(jù)正態(tài)性檢驗(yàn)

1.4 繪圖

2. 特征清洗

2.1 特征分類不平衡

2.2 缺失值處理

2.3 異常值處理

2.4 數(shù)據(jù)轉(zhuǎn)換

2.5 數(shù)據(jù)分桶

2.6 一人多次

3. 特征構(gòu)造（特征生成）

4. 特征選擇

4.1 Filter（過(guò)濾式），單變量特征選擇

4.2 Wrapper（包裝法）

4.3 Embedded（嵌入法）

5. 特征降維

5.1 PCA

5.2 LDA

6. 直接預(yù)測(cè)

總結(jié)