多元线性回归中多重共线性_多重共线性如何在线性回归中成为问题。
多元線性回歸中多重共線性
Linear Regression is one of the simplest and most widely used algorithms for Supervised machine learning problems where the output is a numerical quantitative variable and the input is a bunch of independent variables or single variable.
對(duì)于有監(jiān)督的機(jī)器學(xué)習(xí)問(wèn)題,線性回歸是最簡(jiǎn)單且使用最廣泛的算法之一,其中輸出是數(shù)字量化變量,輸入是一堆自變量或單變量。
The math behind it is easy to understand and that’s what makes Linear Regression one of my most favorite algorithms to work with. But this simplicity comes with a price.
它背后的數(shù)學(xué)原理很容易理解,這就是線性回歸成為我最喜歡使用的算法之一的原因。 但是,這種簡(jiǎn)單性需要付出代價(jià)。
When we decide to fit a Linear Regression model, we have to make sure that some conditions are satisfied or else our model will perform poorly or will give us incorrect interpretations. So what are some of these conditions that have to be met?
當(dāng)我們決定擬合線性回歸模型時(shí),我們必須確保滿足某些條件,否則我們的模型將無(wú)法正常運(yùn)行或會(huì)給我們錯(cuò)誤的解釋。 那么必須滿足哪些條件呢?
Linearity: X and the mean of Y have a Linear Relationship
線性 :X和Y的平均值具有線性關(guān)系
2. Homoscedasticity: variance of the error terms is the same for all values of X.
2. 均方差:所有X值的誤差項(xiàng)的方差都相同。
3. No collinearity: independent variables are not highly correlated with each other
3. 沒(méi)有共線性:自變量彼此之間沒(méi)有高度相關(guān)性
4.Normality: Y is normally distributed for any value of X.
4. 正態(tài)性 :對(duì)于任何X值,Y 均呈正態(tài)分布。
If the above four conditions are satisfied, we can expect our Linear Regression model to perform well.
如果滿足以上四個(gè)條件,我們可以期望線性回歸模型表現(xiàn)良好。
So how do we ensure the above conditions are met? Well, If I start going into the depth of all of the above conditions, it might result in a very long article. So for this article, I will go over the third condition of No collinearity meaning I will explain what Multicollinearity and how it is a problem in the first place and what can be done to overcome it.
那么我們?nèi)绾未_保滿足以上條件? 好吧,如果我開始深入研究以上所有條件,可能會(huì)導(dǎo)致篇幅太長(zhǎng)。 因此,對(duì)于本文,我將介紹“無(wú)共線性”的第三個(gè)條件,這意味著我將首先解釋什么是“多重共線性”,以及這是一個(gè)問(wèn)題以及如何解決該問(wèn)題。
When we have a Supervised Machine Learning Regression problem, We know we have a bunch of Independent variables and an Output variable which will be used to train our model and make predictions and interpretations.
當(dāng)我們遇到監(jiān)督機(jī)器學(xué)習(xí)回歸問(wèn)題時(shí),我們知道我們有一堆獨(dú)立變量和一個(gè)輸出變量,這些變量將用于訓(xùn)練我們的模型并進(jìn)行預(yù)測(cè)和解釋。
In a Multivariate Linear Regression problem, we make predictions based off of the model trained and use the coefficients to make interpretations of the model for example:
在多元線性回歸問(wèn)題中,我們根據(jù)訓(xùn)練的模型進(jìn)行預(yù)測(cè),并使用系數(shù)對(duì)模型進(jìn)行解釋,例如:
Multivariate Linear Regression多元線性回歸The above equation states that a unit increase in X1, will result in a B1 increase in the value of Y and a unit increase in X2 will result in a B2 increase in the value of Y.
上面的等式指出,X1的單位增加將導(dǎo)致Y值增加B1,X2的單位增加將導(dǎo)致Y值增加B2。
The coefficients are mandatory in order to understand which variable has the highest influence on the model.
系數(shù)是強(qiáng)制性的,以便了解哪個(gè)變量對(duì)模型的影響最大。
So how is multicollinearity a problem? Well, When we have independent variables that are highly related to each other, our coefficients won’t be reliable and we cannot make accurate interpretations based on their values.
那么多重共線性如何成為問(wèn)題呢? 好吧,當(dāng)我們有彼此高度相關(guān)的自變量時(shí),我們的系數(shù)將不可靠,并且我們無(wú)法基于它們的值進(jìn)行準(zhǔn)確的解釋。
To explain this point further, I created two dummy input variables in Python and one dependent output variable.
為了進(jìn)一步說(shuō)明這一點(diǎn),我在Python中創(chuàng)建了兩個(gè)偽輸入變量和一個(gè)從屬輸出變量。
x3 = np.random.randint(0,100,100)x4 = 3*x3 + np.random.randint(0,100,100)
y1 = (4*x3) + np.random.randint(0,100,100)
Creating the scatterplot for the variables gives us:
為變量創(chuàng)建散點(diǎn)圖可以使我們:
plt.figure(figsize = (12,5))plt.subplot(1,2,1)
plt.xlabel('x3')
sns.scatterplot(x3,y1)
plt.subplot(1,2,2)
plt.xlabel('x4')
sns.scatterplot(x4,y1)Scatterplots for the Input variables against y1針對(duì)y1的輸入變量的散點(diǎn)圖
The scatterplot shows that both x3 and x4 are have a linear relationship with y1. Lets look at the correlation matrix for the variables and see what else can we interpret.I put my variables into a DataFrame by the name of S2 and created a correlation matrix.
散點(diǎn)圖顯示x3和x4都與y1線性相關(guān)。 讓我們看一下變量的相關(guān)矩陣,看看還能解釋什么。我將變量以S2的名稱放入DataFrame中,并創(chuàng)建了一個(gè)相關(guān)矩陣。
S2.corr()Correlation matrix相關(guān)矩陣By the looks of the correlation matrix, it seems that both X3 and X4 not only have a high positive correlation with y1 but also are highly correlated with each other. Let’s see how this will affect our results.
從相關(guān)矩陣的外觀來(lái)看,似乎X3和X4不僅與y1具有很高的正相關(guān)性,而且彼此之間也具有高度相關(guān)性。 讓我們看看這將如何影響我們的結(jié)果。
Before I fit a Linear Regression model to my variables, we have to understand the concept of P-values and the Null hypothesis.
在將線性回歸模型擬合到變量之前,我們必須了解P值的概念和Null假設(shè) 。
The P-value is used to either reject or accept the Null Hypothesis.
P值用于拒絕或接受零假設(shè)。
The Null Hypothesis in our case is that ‘The variable does not have a significant relation with y”.
在我們的情況下,零假設(shè)是“ 變量與y沒(méi)有顯著關(guān)系 ”。
If the P-value is less than the threshold of 0.005, then we have to reject the Null hypothesis, otherwise, we have to accept it. So let’s move forward
如果P值小于0.005的閾值, 則我們必須拒絕Null假設(shè) ,否則, 我們必須接受它 。 所以讓我們前進(jìn)
I import the stats model from the scipy library and use it to fit an Ordinary Least Squares model to my variables.
我從scipy庫(kù)中導(dǎo)入統(tǒng)計(jì)模型,并使用它來(lái)將普通最小二乘模型擬合到我的變量中。
The independent variables being X3 and X4 and the dependent variable being y1.
自變量為X3和X4,因變量為y1。
X = S2[['X3','X4']]y = S2['y1']import statsmodels.api as sm
from scipy import statsX = sm.add_constant(X3)
est = sm.OLS(y,X)
est2 = est.fit()
print(est2.summary())
The results we get are:
我們得到的結(jié)果是:
Summary for the OLS methodOLS方法摘要We get a very high R2 score which shows that our model explains the variance in the model quite well. The coefficients on the other hand, tell an entirely different story.
我們獲得了很高的R2分?jǐn)?shù),這表明我們的模型很好地解釋了模型中的方差。 另一方面,系數(shù)則講述了一個(gè)完全不同的故事。
The P-value for our X4 variable shows that we cannot reject the Null-hypothesis meaning X4 does not have a significant relation with y.
X4變量的P值表明,我們不能拒絕零假設(shè),這意味著X4與y沒(méi)有顯著關(guān)系。
Furthermore, the coefficient is negative as well which cannot be possible as the scatterplots showed that y had a positive relationship will the independent variable.
此外,該系數(shù)也為負(fù),這是不可能的,因?yàn)樯Ⅻc(diǎn)圖表明y與正變量具有正相關(guān)關(guān)系。
So to sum it up, Our coefficients are not reliable and our P-values cannot be trusted.
綜上所述, 我們的系數(shù)不可靠,我們的P值不可信。
僅使用一個(gè)變量進(jìn)行回歸 (Regression with one variable only)
In the previous multivariate example, our results showed that X4 did not have a significant relation with y1. So let us try to analyze y1 and X4 alone and see what we get.
在前面的多變量示例中,我們的結(jié)果表明X4與y1沒(méi)有顯著關(guān)系。 因此,讓我們嘗試單獨(dú)分析y1和X4并查看得到的結(jié)果。
X3 = S2['X4']y1 = S2['y1']
import statsmodels.api as sm
from scipy import statsX = sm.add_constant(X3)
est = sm.OLS(y1,X)
est2 = est.fit()
print(est2.summary())
After fitting our OLS model, we get
擬合我們的OLS模型后,我們得到
The coefficient is now positive and we can reject the Null Hypothesis that X4 is not related to y1. But one more thing we can take from this model is that our R-squared value has reduced significantly from 0.942 to 0.826. So what does that tell us? Well, if our goal is prediction, we might need to think before removing variables but if our goal is an interpretation of each coefficient, then collinearity can be troublesome and we have to consider which variables to keep and which to remove.
系數(shù)現(xiàn)在為正, 我們可以拒絕零假設(shè) ,即X4與y1不相關(guān)。 但是,我們可以從該模型中得出的另一點(diǎn)是,我們的R平方值已從0.942大幅降低至0.826。 那這告訴我們什么呢? 好吧,如果我們的目標(biāo)是預(yù)測(cè),則可能需要在刪除變量之前進(jìn)行思考,但是如果我們的目標(biāo)是對(duì)每個(gè)系數(shù)的解釋,則共線性可能會(huì)很麻煩,我們必須考慮保留哪些變量以及刪除哪些變量。
[1]: Gareth James.. An introduction to statistical learninghttp://faculty.marshall.usc.edu/gareth-james/ISL/
[1]:Gareth James .. 統(tǒng)計(jì)學(xué)習(xí)簡(jiǎn)介 http://faculty.marshall.usc.edu/gareth-james/ISL/
翻譯自: https://medium.com/analytics-vidhya/how-multicollinearity-is-a-problem-in-linear-regression-dbb76e25cd80
多元線性回歸中多重共線性
總結(jié)
以上是生活随笔為你收集整理的多元线性回归中多重共线性_多重共线性如何在线性回归中成为问题。的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 税收调控功能
- 下一篇: opencv 创建图像_非艺术家的图像创