日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 >

特征选择 回归_如何执行回归问题的特征选择

發布時間:2023/12/15 37 豆豆
生活随笔 收集整理的這篇文章主要介紹了 特征选择 回归_如何执行回归问题的特征选择 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

特征選擇 回歸

1.簡介 (1. Introduction)

什么是功能選擇(What is feature selection ?)

Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable (that we wish to predict).

特征選擇是選擇與目標變量(我們希望預測)最相關的輸入 變量子集 (某些可用變量中的一部分)的過程。

Target variable here refers to the variable that we wish to predict.

目標變量在這里 指我們希望預測變量 。

For this article we will assume that we only have numerical input variables and a numerical target for regression predictive modeling. Assuming that, we can easily estimate the relationship between each input variable and the target variable. This relationship can be established by calculating a metric such as the correlation value for example.

對于本文,我們將假設我們只有數字輸入變量和用于回歸預測建模的數字目標。 假設,我們可以輕松地估計每個輸入變量和目標變量之間的關系 。 例如,可以通過計算諸如相關值之類的度量來建立該關系。

2.主要的數值特征選擇方法 (2. The main numerical feature selection methods)

The 2 most famous feature selection techniques that can be used for numerical input data and a numerical target variable are the following:

可以用于數字輸入數據和數字目標變量的兩種最著名的特征選擇技術如下:

  • Correlation (Pearson, spearman)

    相關性(皮爾遜,斯皮爾曼)
  • Mutual Information (MI, normalized MI)

    相互信息(MI,標準化MI)

Correlation is a measure of how two variables change together. The most widely used correlation measure is the Pearson’s correlation that assumes a Gaussian distribution of each variable and detects linear relationship between numerical variables.

相關性是兩個變量如何一起變化的度量。 最廣泛使用的相關度量是Pearson相關,它假設每個變量的高斯分布并檢測數值變量之間的線性關系。

This is done in 2 steps:

分兩個步驟完成:

  • The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).

    計算每個回歸變量與目標之間的相關性 ,即(((X [:, i]-mean(X [:, i]))*(y-mean_y))/(std(X [:, i] )* std(y))。

  • It is converted to an F score then to a p-value.

    將其轉換為F分數,然后轉換為p值

  • Mutual information originates from the field of information theory. The idea is that the information gain (typically used in the construction of decision trees) is applied in order to perform the feature selection. Mutual information is calculated between two variables and measures as the reduction in uncertainty for one variable given a known value of the other variable.

    互信息起源于信息理論領域。 這個想法是應用信息增益(通常用于構建決策樹)來執行特征選擇。 互信息是在兩個變量之間計算的,并且在給定另一個變量的已知值的情況下,度量為一個變量的不確定性降低。

    3.數據集 (3. The dataset)

    We will use the boston house-prices dataset. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. The dataset consists of the following variables:

    我們將使用波士頓 房屋 - 價格的 數據集 。 該數據集包含美國人口普查局收集的有關馬薩諸塞州波士頓地區住房的信息。該數據集包含以下變量:

  • CRIM — per capita crime rate by town

    CRIM —城鎮居民人均犯罪率
  • ZN — proportion of residential land zoned for lots over 25,000 sq.ft.

    ZN-25,000平方英尺以上的土地劃為住宅用地的比例。
  • INDUS — proportion of non-retail business acres per town.

    印度—每個鎮非零售業務英畝的比例。
  • CHAS — Charles River dummy variable (1 if tract bounds river; 0 otherwise)

    CHAS —查爾斯河虛擬變量(如果束縛河,則為1;否則為0)
  • NOX — nitric oxides concentration (parts per 10 million)

    NOX-一氧化氮濃度(百萬分之幾)
  • RM — average number of rooms per dwelling

    RM —每個住宅的平均房間數
  • AGE — proportion of owner-occupied units built prior to 1940

    年齡-1940年之前建造的自有單位的比例
  • DIS — weighted distances to five Boston employment centres

    DIS-與五個波士頓就業中心的加權距離
  • RAD — index of accessibility to radial highways

    RAD —徑向公路的可達性指數
  • TAX — full-value property-tax rate per $10,000

    稅金-每10,000美元的全值財產稅率
  • PTRATIO — pupil-teacher ratio by town

    PTRATIO-各鎮師生比例
  • B — 1000(Bk — 0.63)2 where Bk is the proportion of blacks by town

    B — 1000(Bk-0.63)2,其中Bk是按城鎮劃分的黑人比例
  • LSTAT — % lower status of the population

    LSTAT-人口狀況降低百分比
  • MEDV — Median value of owner-occupied homes in $1000's

    MEDV —擁有住房的中位數價值(以1000美元計)
  • 4. Python代碼和工作示例 (4. Python Code & Working Example)

    Let’s load and split the dataset into training (70%) and test (30%) sets.

    讓我們加載數據集并將其分成訓練(70%)和測試(30%)集。

    from sklearn.datasets import load_boston
    from sklearn.model_selection import train_test_split
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_selection import f_regression
    import matplotlib.pyplot as plt
    from sklearn.feature_selection import mutual_info_regression# load the data
    X, y = load_boston(return_X_y=True)# split into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

    We will use the well known scikit-learn machine library.

    我們將使用眾所周知的scikit-learn機器庫。

    情況1:使用“相關”度量標準選擇特征 (Case 1: Feature selection using the Correlation metric)

    For the correlation statistic we will use the f_regression() function. This function can be used in a feature selection strategy, such as selecting the top k most relevant features (largest values) via the SelectKBest class.

    對于相關統計,我們將使用f_regression()函數 。 可以在功能選擇策略中使用此功能,例如,通過SelectKBest類選擇前k個最相關的功能(最大值)。

    # feature selection
    f_selector = SelectKBest(score_func=f_regression, k='all')# learn relationship from training data
    f_selector.fit(X_train, y_train)# transform train input data
    X_train_fs = f_selector.transform(X_train)# transform test input data
    X_test_fs = f_selector.transform(X_test)# Plot the scores for the features
    plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")
    plt.ylabel("F-value (transformed from the correlation values)")
    plt.show()

    Reminder: For the correlation statistic case:

    提醒 :對于相關統計情況:

  • The correlation between each regressor and the target is computed, that is, ((X[:, i] — mean(X[:, i])) * (y — mean_y)) / (std(X[:, i]) * std(y)).

    計算每個回歸變量與目標之間的相關性,即(((X [:, i]-mean(X [:, i]))*(y-mean_y))/(std(X [:, i] )* std(y))。
  • It is converted to an F score then to a p-value.

    將其轉換為F分數,然后轉換為p值 。

  • Feature Importance plot特征重要性圖

    The plot above shows that feature 6 and 13 are more important than the other features. The y-axis represents the F-values that were estimated from the correlation values.

    上圖顯示功能6和13比其他功能更重要。 y軸表示根據相關值估算的F值。

    情況2:使用互信息量度選擇特征 (Case 2: Feature selection using the Mutual Information metric)

    The scikit-learn machine learning library provides an implementation of mutual information for feature selection with numeric input and output variables via the mutual_info_regression() function.

    scikit-learn機器學習庫通過common_info_regression()函數為帶有數字輸入和輸出變量的特征選擇提供了互信息的實現。

    # feature selection
    f_selector = SelectKBest(score_func=mutual_info_regression, k='all')# learn relationship from training data
    f_selector.fit(X_train, y_train)# transform train input data
    X_train_fs = f_selector.transform(X_train)# transform test input data
    X_test_fs = f_selector.transform(X_test)# Plot the scores for the features
    plt.bar([i for i in range(len(f_selector.scores_))], f_selector.scores_)plt.xlabel("feature index")
    plt.ylabel("Estimated MI value")
    plt.show()Feature Importance plot特征重要性圖

    The y-axis represents the estimated mutual information between each feature and the target variable. Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.

    y軸表示每個特征和目標變量之間的估計互信息。 與相關特征選擇方法相比,我們可以清楚地看到更多的特征被標記為相關。 這可能是因為數據集中可能存在統計噪聲。

    5.結論 (5. Conclusion)

    In this article I have provided two ways in order to perform feature selection. Feature selection is the procedure of selecting a subset (some out of all available) of the input variables that are most relevant to the target variable. Target variable here refers to the variable that we wish to predict.

    在本文中,我提供了兩種方法來執行特征選擇。 特征選擇是從輸入變量中選擇一個與目標變量最相關的子集的過程。 目標變量在這里 指我們希望預測變量

    Using either the Correlation metric or the Mutual Information metric , we can easily estimate the relationship between each input variable and the target variable.

    使用“ 相關”度量或“ 信息”度量,我們可以輕松估計每個輸入變量和目標變量之間的關系 。

    Correlation vs Mutual Information: Compared to the correlation feature selection method we can clearly see many more features scored as being relevant. This may be because of the statistical noise that might exists in the dataset.

    關聯 信息:與關聯特征選擇方法相比,我們可以清楚地看到更多的特征被標記為相關。 這可能是因為數據集中可能存在統計噪聲。

    您可能還喜歡: (You might also like:)

    請繼續關注并支持這項工作 (Stay tuned & support this effort)

    If you liked and found this article useful, follow me to be able to see all my new posts.

    如果您喜歡并認為本文有用,請關注我以查看我的所有新帖子。

    Questions? Post them as a comment and I will reply as soon as possible.

    有什么問題嗎 將其發布為評論,我會盡快回復。

    最新帖子 (Latest posts)

    與我取得聯系 (Get in touch with me)

    • LinkedIn: https://www.linkedin.com/in/serafeim-loukas/

      領英 : https : //www.linkedin.com/in/serafeim-loukas/

    • ResearchGate: https://www.researchgate.net/profile/Serafeim_Loukas

      ResearchGate : https : //www.researchgate.net/profile/Serafeim_Loukas

    • EPFL profile: https://people.epfl.ch/serafeim.loukas

      EPFL 個人資料 : https : //people.epfl.ch/serafeim.loukas

    • Stack Overflow: https://stackoverflow.com/users/5025009/seralouk

      堆棧 溢出 : https : //stackoverflow.com/users/5025009/seralouk

    翻譯自: https://towardsdatascience.com/how-to-perform-feature-selection-for-regression-problems-c928e527bbfa

    特征選擇 回歸

    總結

    以上是生活随笔為你收集整理的特征选择 回归_如何执行回归问题的特征选择的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。