當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Feature Engineering 特征工程 4. Feature Selection

發(fā)布時(shí)間：2024/7/5 编程问答 42 豆豆

生活随笔收集整理的這篇文章主要介紹了 Feature Engineering 特征工程 4. Feature Selection 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

文章目錄

- 1. Univariate Feature Selection 單變量特征選擇
- 2. L1 regularization L1正則

learn from https://www.kaggle.com/learn/feature-engineering

上一篇：Feature Engineering 特征工程 3. Feature Generation

經(jīng)過(guò)各種編碼和特征生成后，通常會(huì)擁有成百上千個(gè)特征。這可能導(dǎo)致兩個(gè)問(wèn)題：

首先，擁有的特征越多，就越有可能過(guò)擬合
其次，擁有的特征越多，訓(xùn)練模型和優(yōu)化超參數(shù)所需的時(shí)間就越長(zhǎng)。使用較少的特征可以加快預(yù)測(cè)速度，但會(huì)降低預(yù)測(cè)準(zhǔn)確率

為了解決這些問(wèn)題，使用特征選擇技術(shù)來(lái)為模型保留最豐富的特征

1. Univariate Feature Selection 單變量特征選擇

最簡(jiǎn)單，最快的方法是基于單變量統(tǒng)計(jì)檢驗(yàn)

統(tǒng)計(jì)label對(duì)每個(gè)單一特征的依賴程度
在scikit-learn特征選擇模塊中，feature_selection.SelectKBest返回 K 個(gè)最佳特征
對(duì)于分類問(wèn)題，該模塊提供了三種不同的評(píng)分功能： $χ2\chi^2$ ，ANOVA F-value和mutual information score
F-value測(cè)量特征變量和目標(biāo)之間的線性相關(guān)性。這意味著如果是非線性關(guān)系，得分可能會(huì)低估特征與目標(biāo)之間的關(guān)系
mutual information score是非參數(shù)的，可以捕獲非線性關(guān)系

from sklearn.feature_selection import SelectKBest, f_classiffeature_cols = baseline_data.columns.drop('outcome')# Keep 5 features 保留5個(gè)最好的特征 selector = SelectKBest(f_classif, k=5)# 評(píng)價(jià)函數(shù)，保留特征數(shù)量 X_new = selector.fit_transform(baseline_data[feature_cols],baseline_data['outcome'])# 特征，標(biāo)簽 X_new array([[2015., 5., 9., 18., 1409.],[2017., 13., 22., 31., 957.],[2013., 13., 22., 31., 739.],...,[2010., 13., 22., 31., 238.],[2016., 13., 22., 31., 1100.],[2011., 13., 22., 31., 542.]])

但是，上面犯了嚴(yán)重的錯(cuò)誤，特征選擇時(shí)fit，把所有數(shù)據(jù)用進(jìn)去了，會(huì)造成數(shù)據(jù)泄露
我們應(yīng)該只用訓(xùn)練集來(lái)進(jìn)行fit，選擇特征

feature_cols = baseline_data.columns.drop('outcome') train, valid, _ = get_data_splits(baseline_data)# Keep 5 features selector = SelectKBest(f_classif, k=5)X_new = selector.fit_transform(train[feature_cols], train['outcome'])# 區(qū)別，僅用訓(xùn)練集 X_new array([[2.015e+03, 5.000e+00, 9.000e+00, 1.800e+01, 1.409e+03],[2.017e+03, 1.300e+01, 2.200e+01, 3.100e+01, 9.570e+02],[2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 7.390e+02],...,[2.011e+03, 1.300e+01, 2.200e+01, 3.100e+01, 5.150e+02],[2.015e+03, 1.000e+00, 3.000e+00, 2.000e+00, 1.306e+03],[2.013e+03, 1.300e+01, 2.200e+01, 3.100e+01, 1.084e+03]])

可以看見，兩種情況下，選擇了不同的特征
現(xiàn)在，我們需要把得到的特征數(shù)值，轉(zhuǎn)換回去，并丟棄其他未選擇的特征

# Get back the features we've kept, zero out all other features selected_features = pd.DataFrame(selector.inverse_transform(X_new), index=train.index, columns=feature_cols) selected_features.head() goalhourdaymonthyearcategorycurrencycountrycategory_currencycategory_countrycurrency_countrycount_7_daystime_since_last_project

0	2015.0	5.0	9.0	18.0	1409.0
1	2017.0	13.0	22.0	31.0	957.0
2	2013.0	13.0	22.0	31.0	739.0
3	2012.0	13.0	22.0	31.0	907.0
4	2015.0	13.0	22.0	31.0	1429.0

我們發(fā)現(xiàn)逆轉(zhuǎn)換回去后，未被選擇的特征都是0.0，需要丟棄它們

# Dropped columns have values of all 0s, so var is 0, drop them # 保留方差不為0的 selected_columns = selected_features.columns[selected_features.var() != 0]# Get the valid dataset with the selected features. valid[selected_columns].head() yearcurrencycountrycurrency_countrycount_7_days

302896	2015	13	22	31	1534.0
302897	2013	13	22	31	625.0
302898	2014	5	9	18	851.0
302899	2014	13	22	31	1973.0
302900	2014	5	9	18	2163.0

2. L1 regularization L1正則

單變量方法在做出選擇決定時(shí)一次只考慮一個(gè)特征

相反，我們可以通過(guò)將所有特征包括在具有L1正則化的線性模型中來(lái)使用所有特征進(jìn)行特征篩選

與懲罰系數(shù)平方的 L2（Ridge）回歸相比，這種類型的正則化（有時(shí)稱為L(zhǎng)asso）會(huì)懲罰系數(shù)的絕對(duì)大小

隨著L1正則化強(qiáng)度的提高，對(duì)于預(yù)測(cè)目標(biāo)而言次要的特征將設(shè)置為0

對(duì)于回歸問(wèn)題，可以使用sklearn.linear_model.Lasso
分類問(wèn)題，可以使用sklearn.linear_model.LogisticRegression
這些都可以跟sklearn.feature_selection.SelectFromModel一起使用，來(lái)選擇非零系數(shù)

from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import SelectFromModeltrain, valid, _ = get_data_splits(baseline_data)X, y = train[train.columns.drop("outcome")], train['outcome']# Set the regularization parameter C=1 logistic = LogisticRegression(C=1, penalty="l1", random_state=7).fit(X, y) model = SelectFromModel(logistic, prefit=True)X_new = model.transform(X) X_new array([[1.000e+03, 1.200e+01, 1.100e+01, ..., 1.900e+03, 1.800e+01,1.409e+03],[3.000e+04, 4.000e+00, 2.000e+00, ..., 1.630e+03, 3.100e+01,9.570e+02],[4.500e+04, 0.000e+00, 1.200e+01, ..., 1.630e+03, 3.100e+01,7.390e+02],...,[2.500e+03, 0.000e+00, 3.000e+00, ..., 1.830e+03, 3.100e+01,5.150e+02],[2.600e+03, 2.100e+01, 2.300e+01, ..., 1.036e+03, 2.000e+00,1.306e+03],[2.000e+04, 1.600e+01, 4.000e+00, ..., 9.200e+02, 3.100e+01,1.084e+03]])

類似于單變量測(cè)試，返回具有選定特征的數(shù)組。我們要將它們轉(zhuǎn)換為DataFrame，以便獲得選定的特征列

# Get back the kept features as a DataFrame with dropped columns as all 0s selected_features = pd.DataFrame(model.inverse_transform(X_new), index=X.index,columns=X.columns)# Dropped columns have values of all 0s, keep other columns selected_columns = selected_features.columns[selected_features.var() != 0]

通常，使用L1正則化進(jìn)行特征選擇比單變量測(cè)試更強(qiáng)大
但是在具有大量數(shù)據(jù)和大量特征的情況下，L1正則化的特征選擇速度也會(huì)很慢
在大型數(shù)據(jù)集上，單變量測(cè)試將更快，但預(yù)測(cè)性能可能會(huì)更差

完成課程和練習(xí)，獲得證書一張，繼續(xù)加油！🚀🚀🚀

上一篇：Feature Engineering 特征工程 3. Feature Generation

創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎(jiǎng)勵(lì)來(lái)咯，堅(jiān)持創(chuàng)作打卡瓜分現(xiàn)金大獎(jiǎng)

總結(jié)

以上是生活随笔為你收集整理的Feature Engineering 特征工程 4. Feature Selection的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： LeetCode 293. 翻转游戏
下一篇： LeetCode 1101. 彼此熟识的