特征选择 回归_如何执行回归问题的特征选择
特征選擇 回歸
1.簡介 (1. Introduction)
什么是功能選擇 ? (What is feature selection ?)
Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable (that we wish to predict).
特征選擇是選擇與目標變量(我們希望預測)最相關的輸入 變量的子集 (某些可用變量中的一部分)的過程。
Target variable here refers to the variable that we wish to predict.
目標變量在這里 指我們希望預測的變量 。
For this article we will assume that we only have numerical input variables and a numerical target for regression predictive modeling. Assuming that, we can easily estimate the relationship between each input variable and the target variable. This relationship can be established by calculating a metric such as the correlation value for example.
對于本文,我們將假設我們只有數字輸入變量和用于回歸預測建模的數字目標。 假設,我們可以輕松地估計每個輸入變量和目標變量之間的關系 。 例如,可以通過計算諸如相關值之類的度量來建立該關系。
2.主要的數值特征選擇方法 (2. The main numerical feature selection methods)
The 2 most famous feature selection techniques that can be used for numerical input data and a numerical target variable are the following:
可以用于數字輸入數據和數字目標變量的兩種最著名的特征選擇技術如下:
- Correlation (Pearson, spearman) 相關性(皮爾遜,斯皮爾曼)
- Mutual Information (MI, normalized MI) 相互信息(MI,標準化MI)
Correlation is a measure of how two variables change together. The most widely used correlation measure is the Pearson’s correlation that assumes a Gaussian distribution of each variable and detects linear relationship between numerical variables.
相關性是兩個變量如何一起變化的度量。 最廣泛使用的相關度量是Pearson相關,它假設每個變量的高斯分布并檢測數值變量之間的線性關系。
This is done in 2 steps:
分兩個步驟完成:
The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).
計算每個回歸變量與目標之間的相關性 ,即(((X [:, i]-mean(X [:, i]))*(y-mean_y))/(std(X [:, i] )* std(y))。
It is converted to an F score then to a p-value.
將其轉換為F分數,然后轉換為p值 。
Mutual information originates from the field of information theory. The idea is that the information gain (typically used in the construction of decision trees) is applied in order to perform the feature selection. Mutual information is calculated between two variables and measures as the reduction in uncertainty for one variable given a known value of the other variable.
互信息起源于信息理論領域。 這個想法是應用信息增益(通常用于構建決策樹)來執行特征選擇。 互信息是在兩個變量之間計算的,并且在給定另一個變量的已知值的情況下,度量為一個變量的不確定性降低。
3.數據集 (3. The dataset)
We will use the boston house-prices dataset. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The dataset consists of the following variables:
我們將使用波士頓 房屋 - 價格的 數據集 。 該數據集包含美國人口普查局收集的有關馬薩諸塞州波士頓地區住房的信息。該數據集包含以下變量:
4. Python代碼和工作示例 (4. Python Code & Working Example)
Let’s load and split the dataset into training (70%) and test (30%) sets.
讓我們加載數據集并將其分成訓練(70%)和測試(30%)集。
from sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
import matplotlib.pyplot as plt
from sklearn.feature_selection import mutual_info_regression# load the data
X, y = load_boston(return_X_y=True)# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)
We will use the well known scikit-learn machine library.
我們將使用眾所周知的scikit-learn機器庫。
情況1:使用“相關”度量標準選擇特征 (Case 1: Feature selection using the Correlation metric)
For the correlation statistic we will use the f_regression() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.
對于相關統計,我們將使用f_regression()函數 。 可以在功能選擇策略中使用此功能,例如,通過SelectKBest類選擇前k個最相關的功能(最大值)。
# feature selectionf_selector = SelectKBest(score_func=f_regression, k='all')# learn relationship from training data
f_selector.fit(X_train, y_train)# transform train input data
X_train_fs = f_selector.transform(X_train)# transform test input data
X_test_fs = f_selector.transform(X_test)# Plot the scores for the features
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")
plt.ylabel("F-value (transformed from the correlation values)")
plt.show()
Reminder: For the correlation statistic case:
提醒 :對于相關統計情況:
It is converted to an F score then to a p-value.
將其轉換為F分數,然后轉換為p值 。
The plot above shows that feature 6 and 13 are more important than the other features. The y-axis represents the F-values that were estimated from the correlation values.
上圖顯示功能6和13比其他功能更重要。 y軸表示根據相關值估算的F值。
情況2:使用互信息量度選擇特征 (Case 2: Feature selection using the Mutual Information metric)
The scikit-learn machine learning library provides an implementation of mutual information for feature selection with numeric input and output variables via the mutual_info_regression() function.
scikit-learn機器學習庫通過common_info_regression()函數為帶有數字輸入和輸出變量的特征選擇提供了互信息的實現。
# feature selectionf_selector = SelectKBest(score_func=mutual_info_regression, k='all')# learn relationship from training data
f_selector.fit(X_train, y_train)# transform train input data
X_train_fs = f_selector.transform(X_train)# transform test input data
X_test_fs = f_selector.transform(X_test)# Plot the scores for the features
plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")
plt.ylabel("Estimated MI value")
plt.show()Feature Importance plot特征重要性圖
The y-axis represents the estimated mutual information between each feature and the target variable. Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.
y軸表示每個特征和目標變量之間的估計互信息。 與相關特征選擇方法相比,我們可以清楚地看到更多的特征被標記為相關。 這可能是因為數據集中可能存在統計噪聲。
5.結論 (5. Conclusion)
In this article I have provided two ways in order to perform feature selection. Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable. Target variable here refers to the variable that we wish to predict.
在本文中,我提供了兩種方法來執行特征選擇。 特征選擇是從輸入變量中選擇一個與目標變量最相關的子集的過程。 目標變量在這里 指我們希望預測的變量 。
Using either the Correlation metric or the Mutual Information metric , we can easily estimate the relationship between each input variable and the target variable.
使用“ 相關”度量或“ 互 信息”度量,我們可以輕松估計每個輸入變量和目標變量之間的關系 。
Correlation vs Mutual Information: Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.
關聯 與 互 信息:與關聯特征選擇方法相比,我們可以清楚地看到更多的特征被標記為相關。 這可能是因為數據集中可能存在統計噪聲。
您可能還喜歡: (You might also like:)
請繼續關注并支持這項工作 (Stay tuned & support this effort)
If you liked and found this article useful, follow me to be able to see all my new posts.
如果您喜歡并認為本文有用,請關注我以查看我的所有新帖子。
Questions? Post them as a comment and I will reply as soon as possible.
有什么問題嗎 將其發布為評論,我會盡快回復。
最新帖子 (Latest posts)
與我取得聯系 (Get in touch with me)
LinkedIn: https://www.linkedin.com/in/serafeim-loukas/
領英 : https : //www.linkedin.com/in/serafeim-loukas/
ResearchGate: https://www.researchgate.net/profile/Serafeim_Loukas
ResearchGate : https : //www.researchgate.net/profile/Serafeim_Loukas
EPFL profile: https://people.epfl.ch/serafeim.loukas
EPFL 個人資料 : https : //people.epfl.ch/serafeim.loukas
Stack Overflow: https://stackoverflow.com/users/5025009/seralouk
堆棧 溢出 : https : //stackoverflow.com/users/5025009/seralouk
翻譯自: https://towardsdatascience.com/how-to-perform-feature-selection-for-regression-problems-c928e527bbfa
特征選擇 回歸
總結
以上是生活随笔為你收集整理的特征选择 回归_如何执行回归问题的特征选择的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 博主辟谣魅族20设计图 系魅族17弃案
- 下一篇: 建立神经网络来预测贷款风险