日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

多元线性回归中多重共线性_多重共线性如何在线性回归中成为问题。

發(fā)布時間:2023/12/15 编程问答 25 豆豆
生活随笔 收集整理的這篇文章主要介紹了 多元线性回归中多重共线性_多重共线性如何在线性回归中成为问题。 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

多元線性回歸中多重共線性

Linear Regression is one of the simplest and most widely used algorithms for Supervised machine learning problems where the output is a numerical quantitative variable and the input is a bunch of independent variables or single variable.

對于有監(jiān)督的機器學習問題,線性回歸是最簡單且使用最廣泛的算法之一,其中輸出是數(shù)字量化變量,輸入是一堆自變量或單變量。

The math behind it is easy to understand and that’s what makes Linear Regression one of my most favorite algorithms to work with. But this simplicity comes with a price.

它背后的數(shù)學原理很容易理解,這就是線性回歸成為我最喜歡使用的算法之一的原因。 但是,這種簡單性需要付出代價。

When we decide to fit a Linear Regression model, we have to make sure that some conditions are satisfied or else our model will perform poorly or will give us incorrect interpretations. So what are some of these conditions that have to be met?

當我們決定擬合線性回歸模型時,我們必須確保滿足某些條件,否則我們的模型將無法正常運行或會給我們錯誤的解釋。 那么必須滿足哪些條件呢?

  • Linearity: X and the mean of Y have a Linear Relationship

    線性 :X和Y的平均值具有線性關(guān)系

  • 2. Homoscedasticity: variance of the error terms is the same for all values of X.

    2. 方差:所有X值的誤差項的方差都相同。

    3. No collinearity: independent variables are not highly correlated with each other

    3. 沒有共線性:自變量彼此之間沒有高度相關(guān)性

    4.Normality: Y is normally distributed for any value of X.

    4. 正態(tài)性 :對于任何X值,Y 均呈正態(tài)分布。

    If the above four conditions are satisfied, we can expect our Linear Regression model to perform well.

    如果滿足以上四個條件,我們可以期望線性回歸模型表現(xiàn)良好。

    So how do we ensure the above conditions are met? Well, If I start going into the depth of all of the above conditions, it might result in a very long article. So for this article, I will go over the third condition of No collinearity meaning I will explain what Multicollinearity and how it is a problem in the first place and what can be done to overcome it.

    那么我們?nèi)绾未_保滿足以上條件? 好吧,如果我開始深入研究以上所有條件,可能會導致篇幅太長。 因此,對于本文,我將介紹“無共線性”的第三個條件,這意味著我將首先解釋什么是“多重共線性”,以及這是一個問題以及如何解決該問題。

    When we have a Supervised Machine Learning Regression problem, We know we have a bunch of Independent variables and an Output variable which will be used to train our model and make predictions and interpretations.

    當我們遇到監(jiān)督機器學習回歸問題時,我們知道我們有一堆獨立變量和一個輸出變量,這些變量將用于訓練我們的模型并進行預測和解釋。

    In a Multivariate Linear Regression problem, we make predictions based off of the model trained and use the coefficients to make interpretations of the model for example:

    在多元線性回歸問題中,我們根據(jù)訓練的模型進行預測,并使用系數(shù)對模型進行解釋,例如:

    Multivariate Linear Regression多元線性回歸

    The above equation states that a unit increase in X1, will result in a B1 increase in the value of Y and a unit increase in X2 will result in a B2 increase in the value of Y.

    上面的等式指出,X1的單位增加將導致Y值增加B1,X2的單位增加將導致Y值增加B2。

    The coefficients are mandatory in order to understand which variable has the highest influence on the model.

    系數(shù)是強制性的,以便了解哪個變量對模型的影響最大。

    So how is multicollinearity a problem? Well, When we have independent variables that are highly related to each other, our coefficients won’t be reliable and we cannot make accurate interpretations based on their values.

    那么多重共線性如何成為問題呢? 好吧,當我們有彼此高度相關(guān)的自變量時,我們的系數(shù)將不可靠,并且我們無法基于它們的值進行準確的解釋。

    To explain this point further, I created two dummy input variables in Python and one dependent output variable.

    為了進一步說明這一點,我在Python中創(chuàng)建了兩個偽輸入變量和一個從屬輸出變量。

    x3 = np.random.randint(0,100,100)
    x4 = 3*x3 + np.random.randint(0,100,100)
    y1 = (4*x3) + np.random.randint(0,100,100)

    Creating the scatterplot for the variables gives us:

    為變量創(chuàng)建散點圖可以使我們:

    plt.figure(figsize = (12,5))
    plt.subplot(1,2,1)
    plt.xlabel('x3')
    sns.scatterplot(x3,y1)
    plt.subplot(1,2,2)
    plt.xlabel('x4')
    sns.scatterplot(x4,y1)Scatterplots for the Input variables against y1針對y1的輸入變量的散點圖

    The scatterplot shows that both x3 and x4 are have a linear relationship with y1. Lets look at the correlation matrix for the variables and see what else can we interpret.I put my variables into a DataFrame by the name of S2 and created a correlation matrix.

    散點圖顯示x3和x4都與y1線性相關(guān)。 讓我們看一下變量的相關(guān)矩陣,看看還能解釋什么。我將變量以S2的名稱放入DataFrame中,并創(chuàng)建了一個相關(guān)矩陣。

    S2.corr()Correlation matrix相關(guān)矩陣

    By the looks of the correlation matrix, it seems that both X3 and X4 not only have a high positive correlation with y1 but also are highly correlated with each other. Let’s see how this will affect our results.

    從相關(guān)矩陣的外觀來看,似乎X3和X4不僅與y1具有很高的正相關(guān)性,而且彼此之間也具有高度相關(guān)性。 讓我們看看這將如何影響我們的結(jié)果。

    Before I fit a Linear Regression model to my variables, we have to understand the concept of P-values and the Null hypothesis.

    在將線性回歸模型擬合到變量之前,我們必須了解P值的概念和Null假設

    The P-value is used to either reject or accept the Null Hypothesis.

    P值用于拒絕或接受零假設。

    The Null Hypothesis in our case is that ‘The variable does not have a significant relation with y”.

    在我們的情況下,零假設是“ 變量與y沒有顯著關(guān)系 ”。

    If the P-value is less than the threshold of 0.005, then we have to reject the Null hypothesis, otherwise, we have to accept it. So let’s move forward

    如果P值小于0.005的閾值, 則我們必須拒絕Null假設 ,否則, 我們必須接受它 。 所以讓我們前進

    I import the stats model from the scipy library and use it to fit an Ordinary Least Squares model to my variables.

    我從scipy庫中導入統(tǒng)計模型,并使用它來將普通最小二乘模型擬合到我的變量中。

    The independent variables being X3 and X4 and the dependent variable being y1.

    自變量為X3和X4,因變量為y1。

    X = S2[['X3','X4']]
    y = S2['y1']import statsmodels.api as sm
    from scipy import statsX = sm.add_constant(X3)
    est = sm.OLS(y,X)
    est2 = est.fit()
    print(est2.summary())

    The results we get are:

    我們得到的結(jié)果是:

    Summary for the OLS methodOLS方法摘要

    We get a very high R2 score which shows that our model explains the variance in the model quite well. The coefficients on the other hand, tell an entirely different story.

    我們獲得了很高的R2分數(shù),這表明我們的模型很好地解釋了模型中的方差。 另一方面,系數(shù)則講述了一個完全不同的故事。

    The P-value for our X4 variable shows that we cannot reject the Null-hypothesis meaning X4 does not have a significant relation with y.

    X4變量的P值表明,我們不能拒絕零假設,這意味著X4與y沒有顯著關(guān)系。

    Furthermore, the coefficient is negative as well which cannot be possible as the scatterplots showed that y had a positive relationship will the independent variable.

    此外,該系數(shù)也為負,這是不可能的,因為散點圖表明y與正變量具有正相關(guān)關(guān)系。

    So to sum it up, Our coefficients are not reliable and our P-values cannot be trusted.

    綜上所述, 我們的系數(shù)不可靠,我們的P值不可信。

    僅使用一個變量進行回歸 (Regression with one variable only)

    In the previous multivariate example, our results showed that X4 did not have a significant relation with y1. So let us try to analyze y1 and X4 alone and see what we get.

    在前面的多變量示例中,我們的結(jié)果表明X4與y1沒有顯著關(guān)系。 因此,讓我們嘗試單獨分析y1和X4并查看得到的結(jié)果。

    X3 = S2['X4']
    y1 = S2['y1']
    import statsmodels.api as sm
    from scipy import statsX = sm.add_constant(X3)
    est = sm.OLS(y1,X)
    est2 = est.fit()
    print(est2.summary())

    After fitting our OLS model, we get

    擬合我們的OLS模型后,我們得到

    The coefficient is now positive and we can reject the Null Hypothesis that X4 is not related to y1. But one more thing we can take from this model is that our R-squared value has reduced significantly from 0.942 to 0.826. So what does that tell us? Well, if our goal is prediction, we might need to think before removing variables but if our goal is an interpretation of each coefficient, then collinearity can be troublesome and we have to consider which variables to keep and which to remove.

    系數(shù)現(xiàn)在為正, 我們可以拒絕零假設 ,即X4與y1不相關(guān)。 但是,我們可以從該模型中得出的另一點是,我們的R平方值已從0.942大幅降低至0.826。 那這告訴我們什么呢? 好吧,如果我們的目標是預測,則可能需要在刪除變量之前進行思考,但是如果我們的目標是對每個系數(shù)的解釋,則共線性可能會很麻煩,我們必須考慮保留哪些變量以及刪除哪些變量。

    [1]: Gareth James.. An introduction to statistical learninghttp://faculty.marshall.usc.edu/gareth-james/ISL/

    [1]:Gareth James .. 統(tǒng)計學習簡介 http://faculty.marshall.usc.edu/gareth-james/ISL/

    翻譯自: https://medium.com/analytics-vidhya/how-multicollinearity-is-a-problem-in-linear-regression-dbb76e25cd80

    多元線性回歸中多重共線性

    總結(jié)

    以上是生活随笔為你收集整理的多元线性回归中多重共线性_多重共线性如何在线性回归中成为问题。的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。