當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

通过模型进行特征选择

發布時間：2025/3/21 编程问答 41 豆豆

生活随笔收集整理的這篇文章主要介紹了通过模型进行特征选择小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Feature selection using SelectFromModel

SelectFromModel

sklearn在Feature selection模塊中內置了一個SelectFromModel，該模型可以通過Model本身給出的指標對特征進行選擇，其作用與其名字高度一致，select （feature） from model。
SelectFromModel 是一個通用轉換器,其需要的Model只需要帶有conef_或者feature_importances屬性,那么就可以作為SelectFromModel的Model來使用. 如果相關的coef_ 或者 featureimportances 屬性值低于預先設置的閾值，這些特征將會被認為不重要并且移除掉。除了指定數值上的閾值之外，還可以通過給定字符串參數來使用內置的啟發式方法找到一個合適的閾值。可以使用的啟發式方法有 mean 、 median 以及使用浮點數乘以這些（例如，0.1*mean ）。

根據基礎學習的不同，在estimator中有兩種選擇方式

第一種是基于L1的特征選擇，使用L1正則化的線性模型會得到稀疏解，當目標是降低維度的時候，可以使用sklearn中的給予L1正則化的線性模型，比如LinearSVC，邏輯回歸，或者Lasso。但是要注意的是：在 SVM 和邏輯回歸中，參數 C 是用來控制稀疏性的：小的 C 會導致少的特征被選擇。使用 Lasso，alpha 的值越大，越少的特征會被選擇。

第二種是給予Tree的特征選擇，Tree類的算法包括決策樹，隨機森林等會在訓練后，得出不同特征的重要程度，我們也可以利用這一重要屬性對特征進行選擇。

但是無論選擇哪一種學習器,我們都要記住的是我們的特征選擇的最終標準應當是選擇最好的特征,而非必須依照某種方法進行選擇

幾個重要的參數，屬性，方法

threshold ：閾值，string, float, optional default None
- 可以使用：median 或者 mean 或者 1.25 * mean 這種格式。
- 如果使用參數懲罰設置為L1，則使用的閾值為1e-5，否則默認使用用mean
prefit ：布爾，默認為False，是否為訓練完的模型，（注意不能是cv，GridSearchCV或者clone the estimator得到的），如果是False的話則先fit，再transform。
threshold_ ：采用的閾值

簡單的示例：

使用L1進行特征選擇

from sklearn.svm import LinearSVC from sklearn.datasets import load_iris from sklearn.feature_selection import SelectFromModel# Load the boston dataset. load_iris = load_iris() X, y = load_iris['data'], load_iris['target'] print("X 共有 %s 個特征"%X.shape[1])lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y) model = SelectFromModel(lsvc,prefit=True) X_new = model.transform(X) print("X_new 共有 %s 個特征"%X_new.shape[1]) X 共有 4 個特征 X_new 共有 3 個特征

基于樹的特征選擇

from sklearn.ensemble import ExtraTreesClassifier clf = ExtraTreesClassifier().fit(X,y) print("clf.feature_importances_ :",clf.feature_importances_)model_2 = SelectFromModel(clf,prefit=True) X_new_2 = model_2.transform(X) print("X_new_2 共有 %s 個特征"%X_new_2.shape[1])model_3 = SelectFromModel(clf,prefit=True,threshold=0.15) X_new_3 = model_3.transform(X) print("model的閾值為 :%s"%model_3.threshold) print("X_new_3 共有 %s 個特征"%X_new_3.shape[1]) clf.feature_importances_ : [0.14016636 0.06062787 0.47708914 0.32211664] X_new_2 共有 2 個特征 model的閾值為 :0.15 X_new_3 共有 2 個特征

更多的示例

特征的選取并不一定代表著性能的提升,這一點在所有的特征選擇中是一致的

我對sklearn中的例子(Feature selection using SelectFromModel and LassoCV),稍加改造,就可以一見分毫

import matplotlib.pyplot as plt import numpy as npfrom sklearn.datasets import load_boston from sklearn.feature_selection import SelectFromModel from sklearn.linear_model import LassoCV# Load the boston dataset. boston = load_boston() X, y = boston['data'], boston['target']# We use the base estimator LassoCV since the L1 norm promotes sparsity of features. clf = LassoCV()# Set a minimum threshold of 0.25 sfm = SelectFromModel(clf, threshold=0.0) sfm.fit(X, y) n_features = sfm.transform(X).shape[1]def GetCVScore(estimator,X,y):from sklearn.model_selection import cross_val_scorenested_score = cross_val_score(clf, X=X, y=y, cv=5)nested_score_mean = nested_score.mean()return nested_score_mean# Reset the threshold till the number of features equals two. # Note that the attribute can be set directly instead of repeatedly # fitting the metatransformer.nested_scores = [] n_features_list = [] while n_features > 2:sfm.threshold += 0.01X_transform = sfm.transform(X)n_features = X_transform.shape[1]nested_score = GetCVScore(estimator=clf, X=X_transform, y=y)nested_scores.append(nested_score)n_features_list.append(n_features)# print("nested_score: %s"%nested_score)# print("n_features: %s"%n_features)# print("threshold: %s"%sfm.threshold)# Plot the selected two features from X. plt.title("Features selected from Boston using SelectFromModel with ""threshold %0.3f." % sfm.threshold) feature1 = X_transform[:, 0] feature2 = X_transform[:, 1] plt.plot(feature1, feature2, 'r.') plt.xlabel("Feature number 1") plt.ylabel("Feature number 2") plt.ylim([np.min(feature2), np.max(feature2)]) plt.show()plt.scatter(n_features_list,nested_scores,c=u'b',marker=u'.',label = 'Selected') plt.scatter(X.shape[1],GetCVScore(estimator=clf, X=X, y=y),c=u'r',marker=u'*',label = 'old feature') plt.title("The reduction of features does not necessarily bring up the performance of the model") plt.xlabel("number of features") plt.ylabel("score of model") plt.show()

前面的第一個例子,展示了如何同時使用selectFromModel and Lasso,而后面我所添加的內容,則展示了:The reduction of features does not necessarily bring up the performance of the model

特征選取并不一定升:所有特征有效的情況下,去除的特征只能帶來模型性能的下降,即使不是全部有效很多時候,低重要程度的特征也并不一定代表著一定會導致模型性能的下降,因為某種度量方式并不代表著該特征的最終效果,很多時候我們的度量方式,往往只是一個參考而已.

參考

sklearn官方文檔:Univariate Feature Selection
sklearn官方文檔:Feature selection using SelectFromModel and LassoCV
sklearn ApacheCN 官方翻譯

擴展閱讀:

方差過濾
單變量特征選擇
遞歸式特征選擇
機器學習常用數據集

總結

以上是生活随笔為你收集整理的通过模型进行特征选择的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Hyperopt TypeError:
下一篇： RandomForest:随机森林