回归分析中自变量共线性_具有大特征空间的回归分析中的变量选择
回歸分析中自變量共線(xiàn)性
介紹 (Introduction)
Performing multiple regression analysis from a large set of independent variables can be a challenging task. Identifying the best subset of regressors for a model involves optimizing against things like bias, multicollinearity, exogeneity/endogeneity, and threats to external validity. Such problems become difficult to understand and control in the presence of a large number of features. Professors will often tell you to “l(fā)et theory be your guide” when going about feature selection, but that is not always so easy.
從大量獨(dú)立變量中進(jìn)行多元回歸分析可能是一項(xiàng)艱巨的任務(wù)。 為模型確定最佳的回歸子集涉及針對(duì)偏差,多重共線(xiàn)性,外生性/內(nèi)生性以及對(duì)外部有效性的威脅等方面的優(yōu)化。 在存在大量特征的情況下,此類(lèi)問(wèn)題變得難以理解和控制。 在進(jìn)行特征選擇時(shí),教授通常會(huì)告訴您“讓理論作為指導(dǎo)”,但這并不總是那么容易。
This blog considers the issue of multicollinearity and suggests a method of avoiding it. Proposed here is not a “solution” to collinear variables, nor is it a perfect way of identifying them. It is simply one measurement to take into consideration when comparing multiple subsets of variables.
該博客考慮了多重共線(xiàn)性問(wèn)題,并提出了避免這種問(wèn)題的方法。 這里提出的不是共線(xiàn)變量的“解決方案”,也不是識(shí)別它們的理想方法。 比較變量的多個(gè)子集時(shí),它只是一種要考慮的度量。
問(wèn)題 (The Problem)
There are several ways of identifying the features that are causing problems in a model. The most common approach (and the basis of this post) is to calculate correlations between suspected collinear variables. While effective, it is important to acknowledge the shortcomings of this method. For instance, correlation coefficients are often biased by sample sizes, and bivariate correlation cannot detect two variables that are collinear only in the presence of additional variables. For these reasons, it is a good idea to consider other metrics/methods as well, some of which include the following: look at the significance of coefficients compared to the overall model; look for high standard error; calculate variance inflation factors for different features; conduct principal components analysis; and yes, let theory be your guide.
有幾種方法可以識(shí)別導(dǎo)致模型出現(xiàn)問(wèn)題的特征。 最常見(jiàn)的方法(也是本文的基礎(chǔ))是計(jì)算可疑共線(xiàn)變量之間的相關(guān)性。 盡管有效,但重要的是要認(rèn)識(shí)到此方法的缺點(diǎn)。 例如,相關(guān)系數(shù)通常受樣本量的影響,而雙變量相關(guān)僅在存在其他變量的情況下無(wú)法檢測(cè)到共線(xiàn)的兩個(gè)變量。 由于這些原因,考慮其他指標(biāo)/方法也是一個(gè)好主意,其中的一些指標(biāo)/方法包括:與整體模型相比,考察系數(shù)的重要性; 尋找高標(biāo)準(zhǔn)誤差; 計(jì)算不同特征的方差膨脹因子; 進(jìn)行主成分分析; 是的,以理論為指導(dǎo)。
With all of this in mind, let us now consider a technique that employs a collection of transformed Pearson correlation coefficients in a multiple-criteria evaluation problem (see Multiple-Criteria Decision Analysis). The goal of the technique is to find a subset of independent variables where every pairwise correlation within the set is as low as possible, while simultaneously, each variable’s correlation with the dependent variable is as high as possible. We may represent the problem in the following way:
考慮到所有這些,現(xiàn)在讓我們考慮一種在多準(zhǔn)則評(píng)估問(wèn)題中使用一組變換的Pearson相關(guān)系數(shù)的技術(shù)(請(qǐng)參閱多準(zhǔn)則決策分析 )。 該技術(shù)的目標(biāo)是找到獨(dú)立變量的子集,其中集合中每個(gè)成對(duì)的相關(guān)性都應(yīng)盡可能低,而同時(shí),每個(gè)變量與因變量的相關(guān)性應(yīng)盡可能地高。 我們可以通過(guò)以下方式表示問(wèn)題:
Here, r is the Pearson correlation coefficient of two variables, and f(x) is the weighted mean of a set of correlation coefficients. In order to apply this function, the coefficients must first be transformed in order to correct for their bias. Arithmetic operations are invalid on raw correlation coefficients because unstable variances across different values make them biased estimates of the population. To address this, we apply the Fisher z-transformation, normalizing the distribution of correlations and approximating stable variance. The Fisher z-transformation is denoted as:
在此, r是兩個(gè)變量的皮爾遜相關(guān)系數(shù), f (x)是一組相關(guān)系數(shù)的加權(quán)平均值。 為了應(yīng)用該功能,必須首先對(duì)系數(shù)進(jìn)行變換以校正其偏差。 算術(shù)運(yùn)算對(duì)原始相關(guān)系數(shù)無(wú)效,因?yàn)椴煌抵g的不穩(wěn)定方差使其成為總體的有偏估計(jì)。 為了解決這個(gè)問(wèn)題,我們應(yīng)用了Fisher z變換,對(duì)相關(guān)分布進(jìn)行了歸一化并近似了穩(wěn)定方差。 Fisher z變換表示為:
With this in mind, we now consider the “maximizing” and “minimizing” elements of the problem. Because the magnitude and not the direction of correlation is of concern, the absolute value of coefficients are considered. We might think of maximizing correlation to mean “get as close to 1 as possible” and minimizing correlation to mean “get as close to 0 as possible”. Getting as close to 1 as possible is less intuitive after applying the z-transformation, because arctanh(1)=∞. Therefore, we can change the maximization problem to a minimization problem by subtracting the absolute value of each correlation from 1. Now, we might phrase the problem as follows:
考慮到這一點(diǎn),我們現(xiàn)在考慮問(wèn)題的“最大化”和“最小化”要素。 因?yàn)殛P(guān)注的是幅度而不是相關(guān)方向,所以考慮了系數(shù)的絕對(duì)值。 我們可能會(huì)想到最大化相關(guān)性以表示“盡可能接近1”,最小化相關(guān)性以表示“盡可能接近0”。 在應(yīng)用z變換后,盡可能接近1不太直觀,因?yàn)閍rctanh (1)=∞。 因此,我們可以通過(guò)從1中減去每個(gè)相關(guān)的絕對(duì)值,將最大化問(wèn)題變?yōu)樽钚』瘑?wèn)題。現(xiàn)在,我們可以用以下方式表達(dá)問(wèn)題:
We find the set of features that minimizes both of these functions by calculating the distance of each set from the theoretical global minimum (0,0). This solution can be best represented graphically. The figure below plots the two functions against each other for every set of features in a sample dataset. Each blue point represents one subset of variables, while the red area is an arbitrary frontier to visualize which point has the shortest Euclidian distance from the theoretical minimum.
通過(guò)計(jì)算每個(gè)集合與理論全局最小值(0,0)的距離,我們找到了使這兩個(gè)函數(shù)最小化的特征集。 該解決方案最好以圖形方式表示。 下圖針對(duì)樣本數(shù)據(jù)集中的每組特征繪制了兩個(gè)函數(shù)的相對(duì)關(guān)系。 每個(gè)藍(lán)點(diǎn)代表一個(gè)變量子集,而紅色區(qū)域是一個(gè)任意邊界,可以直觀地看到哪個(gè)點(diǎn)與理論最小值之間的歐氏距離最短。
The subset corresponding to the point with the shortest distance to the origin can be understood as the set where every pairwise correlation is as low as possible, and simultaneously, each correlation with the dependent variable is as high as possible.
可以將與距原點(diǎn)的距離最短的點(diǎn)對(duì)應(yīng)的子集理解為一組,其中每個(gè)成對(duì)的相關(guān)性都盡可能低,同時(shí)與因變量的每個(gè)相關(guān)性都盡可能高。
一個(gè)應(yīng)用程序 (An Application)
For more clarity, let’s now define a real world example. Consider the popular Boston Housing dataset. The dataset provides information on housing prices in Boston as well as information on several features of houses and the housing market there. Say we want to build a model that contains as much explanatory power of housing prices as possible. There are 506 observations in the dataset, each corresponding to a housing unit. There are 14 independent variables, but let’s say we only want to consider two different subsets with 5 independent variables each.
為了更加清晰,讓我們現(xiàn)在定義一個(gè)真實(shí)的示例。 考慮流行的波士頓住房數(shù)據(jù)集。 該數(shù)據(jù)集提供有關(guān)波士頓住房?jī)r(jià)格的信息,以及有關(guān)房屋的一些特征和那里的住房市場(chǎng)的信息。 假設(shè)我們要建立一個(gè)模型,其中包含盡可能多的房?jī)r(jià)解釋力。 數(shù)據(jù)集中有506個(gè)觀測(cè)值,每個(gè)觀測(cè)值對(duì)應(yīng)一個(gè)住房單元。 有14個(gè)自變量,但假設(shè)我們只考慮兩個(gè)具有5個(gè)自變量的不同子集。
The first subset consists of the following variables: proportion of non-retail business acres in the area (INDUS); Nitrus Oxide concentration (NOX); proportion of units built before 1940 in the area (AGE); property tax-rate (TAX); and the accessibility to radial highways (RAD). This subset will be referred to as {INDUS, NOX, AGE, TAX, RAD}.
第一個(gè)子集由以下變量組成:該地區(qū)非零售業(yè)務(wù)英畝的比例(INDUS); 一氧化二氮濃度(NOX); 1940年之前在該地區(qū)(AGE)建造的單位的比例; 財(cái)產(chǎn)稅率(TAX); 以及徑向公路(RAD)的可及性。 該子集將被稱(chēng)為{INDUS,NOX,AGE,TAX,RAD}。
The second subset consists of the following variables: distance to Boston employment centers (DIS); average number of rooms per dwelling (RM); pupil-to-teacher ratio in the area (PTRATIO); percent of lower status population in the area (LSTAT); and property tax-rate (TAX). This subset will be referred to as {DIS, RM, PTRATIO, LSTAT, TAX}.
第二個(gè)子集由以下變量組成:距波士頓就業(yè)中心(DIS)的距離; 每個(gè)住宅的平均房間數(shù)(RM); 該地區(qū)的師生比(PTRATIO); 該地區(qū)較低地位人口的百分比(LSTAT); 和財(cái)產(chǎn)稅率(TAX)。 該子集將被稱(chēng)為{DIS,RM,PTRATIO,LSTAT,TAX}。
These subsets will be used to predict the dependent variable, PRICE. Correlograms of the independent variables as well as the correlations with the dependent variable for both subsets are provided below.
這些子集將用于預(yù)測(cè)因變量PRICE。 下面提供兩個(gè)子集的自變量的相關(guān)圖以及與因變量的相關(guān)性。
The first step is to take the absolute value of every correlation coefficient, subtract correlations with the dependent variable from 1, and transform the correlations into z-scores.
第一步是獲取每個(gè)相關(guān)系數(shù)的絕對(duì)值,從1中減去與因變量的相關(guān)性,并將相關(guān)性轉(zhuǎn)換為z得分。
Next, we calculate the weighted mean of each correlation with the dependent variable as well as the correlations within the independent variables. Weights are determined by each coefficient’s proportion of the sum of coefficients. With these aggregations, the distance of each set from the theoretical minimum (0,0) is also calculated.This is done for the {INDUS, NOX, AGE, TAX, RAD} subset as follows:
接下來(lái),我們計(jì)算與因變量以及自變量?jī)?nèi)部的每個(gè)相關(guān)的加權(quán)平均值。 權(quán)重由每個(gè)系數(shù)在系數(shù)總和中的比例確定。 通過(guò)這些聚合,還可以計(jì)算出每個(gè)集合與理論最小值(0,0)的距離。這是針對(duì){INDUS,NOX,AGE,TAX,RAD}子集完成的,如下所示:
And for the {DIS, RM, PTRATIO, LSTAT, TAX} subset as:
對(duì)于{DIS,RM,PTRATIO,LSTAT,TAX}子集為:
These two values indicate that subset {DIS, RM, PTRATIO, LSTAT, TAX} has higher correlation with PRICE and lower correlation within itself than does subset {INDUS, NOX, AGE, TAX, RAD}, demonstrated by their respective distances from the origin. This tentatively suggests that subset {DIS, RM, PTRATIO, LSTAT, TAX} has the better explanatory power of PRICE. This is not a perfect indication, and other metrics must be also be assessed.
這兩個(gè)值表明,與子集{INDUS,NOX,AGE,TAX,RAD}相比,子集{DIS,RM,PTRATIO,LSTAT,TAX}與PRICE的相關(guān)性更高,而在其內(nèi)部的相關(guān)性較低,這兩個(gè)子集與原點(diǎn)之間的距離表明。 初步表明,子集{DIS,RM,PTRATIO,LSTAT,TAX}具有更好的PRICE解釋能力。 這不是一個(gè)完美的指示,還必須評(píng)估其他指標(biāo)。
We can verify which subset is better by actually fitting models now. Below, PRICE has been regressed on DIS, RM, PTRATIO, LSTAT, and TAX. We immediately can recognize that every variable is statistically significant to the model (see P>|t|). We also recognize that the model itself if statistically significant (see P(F)). Take note of the R2 values, the F-statistic, the root mean squared error, and the Akaike/Bayes Information Criteria.
我們現(xiàn)在可以通過(guò)實(shí)際擬合模型來(lái)驗(yàn)證哪個(gè)子集更好。 下方,PRICE已針對(duì)DIS,RM,PTRATIO,LSTAT和TAX進(jìn)行了回歸。 我們立即可以看出,每個(gè)變量對(duì)模型都具有統(tǒng)計(jì)意義(請(qǐng)參閱P> | t |) 。 我們還認(rèn)識(shí)到該模型本身具有統(tǒng)計(jì)學(xué)意義(請(qǐng)參閱P(F) )。 注意R2值, F統(tǒng)計(jì)量,均方根誤差和Akaike / Bayes信息標(biāo)準(zhǔn)。
Next, PRICE has been regressed on INDUS, NOX, AGE, TAX, and RAD. In this model, we can see that there are now at least two independent variables that are not statistically significant. The model itself is still significant, but it has a lower F-statistic than the previous model. Additionally, its R2 values are both lower than that of the previous model, implying less explanatory power. RMSE, AIC, and BIC are also higher here, implying lower quality. This confirms the findings calculated above.
接下來(lái),PRICE已針對(duì)INDUS,NOX,AGE,TAX和RAD進(jìn)行了回歸。 在此模型中,我們可以看到,現(xiàn)在至少有兩個(gè)獨(dú)立變量在統(tǒng)計(jì)上不顯著。 該模型本身仍然很重要,但F統(tǒng)計(jì)量比以前的模型低。 此外,其R2值均低于先前模型的R2值,這意味著較少的解釋力。 RMSE,AIC和BIC在這里也較高,這意味著質(zhì)量較低。 這證實(shí)了上面計(jì)算的結(jié)果。
The “z-distance” presented in this blog post has demonstrated its use in this example. The {DIS, RM, PTRATIO, LSTAT, TAX} subset has a shorter distance to 0 than the {INDUS, NOX, AGE, TAX, RAD} subset. DIS, RM, PTRATIO, LSTAT, and TAX were then shown to be better predictors of PRICE. While it was easy to simply fit these two models and compare them, in a feature space of much higher dimension it might be faster to calculate the distances of several subsets.
本博客文章中介紹的“ z -distance”已在示例中證明了其用法。 {DIS,RM,PTRATIO,LSTAT,TAX}子集比{INDUS,NOX,AGE,TAX,RAD}子集的距離短。 然后顯示DIS,RM,PTRATIO,LSTAT和TAX是PRICE的更好預(yù)測(cè)指標(biāo)。 盡管很容易簡(jiǎn)單地?cái)M合這兩個(gè)模型并進(jìn)行比較,但是在具有更高維度的特征空間中,計(jì)算多個(gè)子集的距離可能會(huì)更快。
結(jié)論 (Conclusion)
There are many factors to consider in feature selection. This post does not offer a solution to finding the best subset of variables, but merely a way for one to take a step in the right direction by finding sets of features that do not immediately demonstrate collinearity. It is important to remember that one must rely on more than just correlation coefficients when identifying multicollinearity.
在特征選擇中要考慮許多因素。 這篇文章并沒(méi)有提供找到最佳變量子集的解決方案,而只是提供了一種方法,即通過(guò)查找未立即證明共線(xiàn)性的特征集,朝正確的方向邁出了一步。 重要的是要記住,在識(shí)別多重共線(xiàn)性時(shí),人們不僅要依賴(lài)相關(guān)系數(shù)。
A Python script for this solution and for automating feature combinations can be found at the following GitHub repository:
可在以下GitHub存儲(chǔ)庫(kù)中找到此解決方案和自動(dòng)化功能組合的Python腳本:
https://github.com/willarliss/z-Distance/
https://github.com/willarliss/z-Distance/
翻譯自: https://towardsdatascience.com/variable-selection-in-regression-analysis-with-a-large-feature-space-2f142f15e5a
回歸分析中自變量共線(xiàn)性
總結(jié)
以上是生活随笔為你收集整理的回归分析中自变量共线性_具有大特征空间的回归分析中的变量选择的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到别人送狗给我是啥意思
- 下一篇: 机器学习模型 非线性模型_机器学习:通过