r怎么对两组数据统计检验_数据科学中最常用的统计检验是什么
r怎么對(duì)兩組數(shù)據(jù)統(tǒng)計(jì)檢驗(yàn)
Business analytics and data science is a convergence of many fields of expertise. Professionals form multiple domains and educational backgrounds are joining the analytics industry in the pursuit of becoming data scientists.
業(yè)務(wù)分析和數(shù)據(jù)科學(xué)是許多專(zhuān)業(yè)領(lǐng)域的融合。 專(zhuān)業(yè)人士來(lái)自多個(gè)領(lǐng)域,教育背景正在加入分析行業(yè),以成為數(shù)據(jù)科學(xué)家。
Two kinds of data scientist I met in my career. One who provides attention to the details of the algorithms and models. They always try to understand the mathematics and statistics behind the scene. Want to take full control over solution and the theory behind it. The other kind are more interested in the end result without looking at the theoretical details. They are fascinated by the implementation of new and advanced models. Inclined towards solving the problem in hand rather than the theory behind the solution.
我在職業(yè)生涯中遇到的兩種數(shù)據(jù)科學(xué)家。 一位關(guān)注算法和模型細(xì)節(jié)的人。 他們總是試圖了解幕后的數(shù)學(xué)和統(tǒng)計(jì)學(xué)。 想要完全控制解決方案及其背后的理論。 另一類(lèi)對(duì)最終結(jié)果更感興趣,而不關(guān)注理論細(xì)節(jié)。 他們對(duì)新的和先進(jìn)的模型的實(shí)施著迷。 傾向于解決現(xiàn)有問(wèn)題,而不是解決方案背后的理論。
Believers of both of these approaches have their own logic to support their stand. I respect their choices.
這兩種方法的信徒都有自己的邏輯來(lái)支持自己的立場(chǎng)。 我尊重他們的選擇。
In this post, I shall share some statistical tests that are commonly used in data science. It will be good to know some of these irrespective of the approach you believe in.
在這篇文章中,我將分享一些數(shù)據(jù)科學(xué)中常用的統(tǒng)計(jì)測(cè)試。 無(wú)論您采用哪種方法,都應(yīng)該了解其中的一些內(nèi)容。
In statistics, there are two ways of drawing an inference from any exploration. Estimation of parameters is one of the ways. Here unknown values of population parameters are computed through various methods. The other way is testing of hypothesis. It helps us to test the parameter values that are guessed by some prior knowledge.
在統(tǒng)計(jì)中,有兩種方法可以從任何探索中得出推論。 參數(shù)估計(jì)是方法之一。 這里,人口參數(shù)的未知值是通過(guò)各種方法計(jì)算的。 另一種方法是檢驗(yàn)假設(shè)。 它可以幫助我們測(cè)試一些先驗(yàn)知識(shí)猜測(cè)的參數(shù)值。
I shall list out some statistical test procedures which you will frequently encounter in data science.
我將列出您在數(shù)據(jù)科學(xué)中經(jīng)常遇到的一些統(tǒng)計(jì)測(cè)試程序。
“The only relevant test of the validity of a hypothesis is comparison of its predictions with experience.” — Milton Friedman
“關(guān)于假設(shè)有效性的唯一相關(guān)檢驗(yàn)是將其預(yù)測(cè)與經(jīng)驗(yàn)進(jìn)行比較。” —米爾頓·弗里德曼
作為數(shù)據(jù)科學(xué)家,我真的需要了解假設(shè)檢驗(yàn)嗎? (As a data scientist, do I really need to know hypothesis testing?)
In most decision-making procedures in data science, we are knowing or unknowingly using hypothesis testing. Here are some evidences in support of my statement.
在數(shù)據(jù)科學(xué)的大多數(shù)決策程序中,我們都在使用或不使用假設(shè)檢驗(yàn)。 這里有一些證據(jù)支持我的發(fā)言。
Being data scientist, the kind of data analysis we do can be segregated into four broad areas —
作為數(shù)據(jù)科學(xué)家,我們進(jìn)行的數(shù)據(jù)分析可以分為四個(gè)主要領(lǐng)域:
2. Regression and Classification
2.回歸與分類(lèi)
3. Forecasting
3.預(yù)測(cè)
4. Data Grouping
4.數(shù)據(jù)分組
Each of these areas include some amount of statistical testing.
這些領(lǐng)域中的每個(gè)領(lǐng)域都包含一些統(tǒng)計(jì)測(cè)試。
探索性數(shù)據(jù)分析(EDA) (Exploratory Data Analysis (EDA))
It is an unavoidable part of data science in which every data scientist spends a significant amount of time. It establishes the foundation for creating machine learning and statistical models. Some common tasks that involve statistical testing in EDA are —
這是數(shù)據(jù)科學(xué)中不可避免的一部分,每個(gè)數(shù)據(jù)科學(xué)家都花費(fèi)大量時(shí)間。 它為創(chuàng)建機(jī)器學(xué)習(xí)和統(tǒng)計(jì)模型奠定了基礎(chǔ)。 在EDA中涉及統(tǒng)計(jì)測(cè)試的一些常見(jiàn)任務(wù)是-
2. Test for Outliers
2.測(cè)試異常值
3. Test for correlation
3.測(cè)試相關(guān)性
4. Test of homogeneity
4.均勻性測(cè)試
5. Test for equality of distribution
5.測(cè)試分配是否平等
Each of these tasks involves testing of hypothesis at some point.
這些任務(wù)中的每一個(gè)都需要在某個(gè)時(shí)候檢驗(yàn)假設(shè)。
1.How to Test for normality?
1.如何測(cè)試正常性?
Normality is everywhere in Statistics. Most theories we use in statistics are based on normality assumption. Normality means the data should follow a particular kind of probability distribution, which is the normal distribution. It has a particular shape and represented by a particular function.
統(tǒng)計(jì)數(shù)據(jù)中到處都有常態(tài)。 我們?cè)诮y(tǒng)計(jì)學(xué)中使用的大多數(shù)理論都基于正態(tài)性假設(shè)。 正態(tài)性表示數(shù)據(jù)應(yīng)遵循一種特定的概率分布,即正態(tài)分布。 它具有特定的形狀并由特定的功能表示。
In Analysis of Variance(ANOVA), we assume normality of the data. While doing regression we expect the residual to follow normal distribution.
在方差分析(ANOVA)中,我們假設(shè)數(shù)據(jù)是正態(tài)的。 在進(jìn)行回歸時(shí),我們期望殘差遵循正態(tài)分布。
To check normality of data we can use Shapiro–Wilk Test. The null hypothesis for this test is — the distribution of the data sample is normal.
要檢查數(shù)據(jù)的正態(tài)性,我們可以使用Shapiro-Wilk Test。 該檢驗(yàn)的零假設(shè)是-數(shù)據(jù)樣本的分布是正態(tài)的。
Python implementation:
Python實(shí)現(xiàn):
import numpy as npfrom scipy import stats
data = stats.norm.rvs(loc=2.5, scale=2, size=100)
shapiro_test = stats.shapiro(data)
print(shapiro_test)
2. How to test whether a data point is an outlier?
2.如何測(cè)試數(shù)據(jù)點(diǎn)是否為離群值?
When I start any new data science use case, where I have to fit some model, one of the routine tasks I do is detection of outliers in the response variable. Outliers affect the regression models greatly. A careful elimination or substitution strategy is required for the outliers.
當(dāng)我開(kāi)始任何新的數(shù)據(jù)科學(xué)用例時(shí),我必須適應(yīng)某種模型,我要做的日常任務(wù)之一是檢測(cè)響應(yīng)變量中的異常值。 離群值極大地影響回歸模型。 離群值需要謹(jǐn)慎的消除或替換策略。
An outlier can be global outlier if its value significantly deviate from rest of the data. It is called contextual outlier if it deviates only from the data point originated from a particular context. Also, a set of data point can be collectively outlier when they deviate considerably from the rest.
如果異常值的值與其他數(shù)據(jù)有明顯偏差,則該異常值可以是全局異常值。 如果它僅偏離源自特定上下文的數(shù)據(jù)點(diǎn),則稱(chēng)為上下文離群值。 同樣,當(dāng)一組數(shù)據(jù)點(diǎn)與其他數(shù)據(jù)點(diǎn)有很大差異時(shí),它們可能在總體上離群。
The Tietjen-Moore test is useful for determining multiple outliers in a data set. The null hypothesis for this test is — there are no outliers in the data.
Tietjen-Moore檢驗(yàn)對(duì)于確定數(shù)據(jù)集中的多個(gè)異常值很有用。 該檢驗(yàn)的零假設(shè)是-數(shù)據(jù)中沒(méi)有異常值。
Python implementation:
Python實(shí)現(xiàn):
import scikit_posthocsx = np.array([-1.40, -0.44, -0.30, -0.24, -0.22, -0.13, -0.05, 0.06, 0.10, 0.18, 0.20, 0.39, 0.48, 0.63, 1.01])
scikit_posthocs.outliers_tietjen(x, 2)
3. How to test the significance of correlation coefficient between two variables?
3.如何檢驗(yàn)兩個(gè)變量之間相關(guān)系數(shù)的顯著性?
In data science, we deal with a number of independent variables that explain the behavior of the dependent variable. Significant correlation between the independent variables may affect the estimated coefficient of the variables. It makes the standard error of the regression coefficients unreliable. Which hurts the interpretability of the regression.
在數(shù)據(jù)科學(xué)中,我們處理許多自變量,這些自變量解釋了因變量的行為。 自變量之間的顯著相關(guān)性可能會(huì)影響變量的估計(jì)系數(shù)。 這使得回歸系數(shù)的標(biāo)準(zhǔn)誤差不可靠。 這損害了回歸的可解釋性。
When we calculate the correlation between two variables, we should check the significance of the correlation. It can be checked by t-test. The null hypothesis of this test assumes that the correlation among the variables is not significant.
當(dāng)我們計(jì)算兩個(gè)變量之間的相關(guān)性時(shí),我們應(yīng)該檢查相關(guān)性的重要性。 可以通過(guò)t檢驗(yàn)進(jìn)行檢查。 該檢驗(yàn)的零假設(shè)假設(shè)變量之間的相關(guān)性不顯著。
Python implementation:
Python實(shí)現(xiàn):
from scipy.stats import pearsonrdata1 = stats.norm.rvs(loc=3, scale=1.5, size=20)
data2 = stats.norm.rvs(loc=-5, scale=0.5, size=20)
stat, p = pearsonr(data1, data2)
print(stat, p)
4. How to test the homogeneity of a categorical variable in two data sets?
4.如何在兩個(gè)數(shù)據(jù)集中測(cè)試分類(lèi)變量的同質(zhì)性?
It would be convenient to explain the test of homogeneity if I use an example. Suppose you we want to check if the viewing preference of Netflix subscribers are same for males and females. You can use Chi-square test for homogeneity for the same. You have to check whether the frequency distribution of the males and females are significantly different from each other.
如果我舉一個(gè)例子,解釋同質(zhì)性測(cè)試將很方便。 假設(shè)您要檢查男性和女性的Netflix訂戶的觀看偏好是否相同。 您可以使用卡方檢驗(yàn)進(jìn)行同質(zhì)性檢驗(yàn)。 您必須檢查男性和女性的頻率分布是否顯著不同。
The null hypotheses for the test is the two data sets are homogeneous.
檢驗(yàn)的零假設(shè)是兩個(gè)數(shù)據(jù)集是同質(zhì)的。
Python implementation:
Python實(shí)現(xiàn):
import scipyimport scipy.stats
from scipy.stats import chisquare
data1 = stats.norm.rvs(loc=3, scale=1.5, size=20)
data2 = stats.norm.rvs(loc=-5, scale=0.5, size=20)
chisquare(data1, data2)
5. How to check if a given data sets follow a particular distribution?
5.如何檢查給定的數(shù)據(jù)集是否遵循特定的分布?
Sometimes in data analysis we require checking if the data follows a particular distribution. Even we may want to check if two samples follow the same distribution. In such cases we use Kolmogorov-Smirnov (KS) test. We often use KS test to check for goodness of fit of a regression model.
有時(shí),在數(shù)據(jù)分析中,我們需要檢查數(shù)據(jù)是否遵循特定的分布。 甚至我們可能要檢查兩個(gè)樣本是否遵循相同的分布。 在這種情況下,我們使用Kolmogorov-Smirnov(KS)檢驗(yàn)。 我們經(jīng)常使用KS檢驗(yàn)來(lái)檢查回歸模型的擬合優(yōu)度。
This test compares the empirical cumulative distribution functions (ECDF) with the theoretical distribution function. The null hypothesis for this test assumes that the given data follows the specified distribution.
該測(cè)試將經(jīng)驗(yàn)累積分布函數(shù)(ECDF)與理論分布函數(shù)進(jìn)行了比較。 此檢驗(yàn)的零假設(shè)假設(shè)給定數(shù)據(jù)遵循指定的分布。
Python implementation:
Python實(shí)現(xiàn):
from scipy import statsx = np.linspace(-25, 17, 6)
stats.kstest(x, ‘norm’)
回歸與分類(lèi) (Regression and Classification)
Most of the modeling we do in data science fall under either regression or classification. Whenever we predict some value or some class, we take help of these two methods.
我們?cè)跀?shù)據(jù)科學(xué)中所做的大多數(shù)建模屬于回歸或分類(lèi)。 每當(dāng)我們預(yù)測(cè)某個(gè)值或某個(gè)類(lèi)時(shí),我們都會(huì)使用這兩種方法。
Both regression and classification involves statistical tests at different stages of decision making. Also, the data need to satisfy some prerequisite conditions to be eligible for these tasks. Some tests are required to be performed to check these conditions.
回歸和分類(lèi)都涉及決策不同階段的統(tǒng)計(jì)檢驗(yàn)。 同樣,數(shù)據(jù)需要滿足一些前提條件才能有資格執(zhí)行這些任務(wù)。 需要執(zhí)行一些測(cè)試以檢查這些條件。
Some common statistical tests associated with regression and classification are —
與回歸和分類(lèi)相關(guān)的一些常見(jiàn)統(tǒng)計(jì)檢驗(yàn)是-
2. Test or multicollinearity
2.測(cè)試或多重共線性
3. Test of the significance of regression coefficients
3.檢驗(yàn)回歸系數(shù)的顯著性
4. ANOVA for regression or classification model
4.回歸或分類(lèi)模型的方差分析
1.How to test for heteroscedasticity?
1.如何測(cè)試異方差?
Heteroscedasticity is a quite heavy term. It simply means unequal variance. Let me explain it with an example. Suppose you are collecting income data from different cities. You will see that the variation of income differs significantly over cities.
異方差性是一個(gè)很沉重的名詞。 它只是意味著方差不均。 讓我用一個(gè)例子來(lái)解釋它。 假設(shè)您正在收集來(lái)自不同城市的收入數(shù)據(jù)。 您將看到,收入的差異在城市之間存在很大差異。
If the data is heteroscedastic, it affects the estimation of the regression coefficients largely. It makes the regression coefficients less precise. The estimates will be far from actual values.
如果數(shù)據(jù)是異方差的,那么它將極大地影響回歸系數(shù)的估計(jì)。 這使得回歸系數(shù)不太精確。 該估計(jì)將與實(shí)際值相差甚遠(yuǎn)。
To test heteroscedasticity in the data White’s Test can be used. White’s test considers the null hypothesis — the variance is constant over the data.
要測(cè)試數(shù)據(jù)中的異方差性,可以使用White's Test。 White的檢驗(yàn)考慮了原假設(shè)-方差在數(shù)據(jù)上是恒定的。
Python implementation:
Python實(shí)現(xiàn):
from statsmodels.stats.diagnostic import het_whitefrom statsmodels.compat import lzip
expr = ‘y_var ~ x_var’
y, X = dmatrices(expr, df, return_type=’dataframe’)
keys = [‘LM stat’, ‘LM test p-value’, ‘F-stat’, ‘F-test p-value’]
results = het_white(olsr_results.resid, X)
lzip(keys, results)
2. How to test for multicollinearity in the variables?
2.如何測(cè)試變量的多重共線性?
Data science problems often include multiple explanatory variables. Some time these variables become correlated due to their origin and nature. Also, sometimes we create more than one variable from the same underlying fact. In these cases the variables become highly correlated. It is called multicollinearity.
數(shù)據(jù)科學(xué)問(wèn)題通常包含多個(gè)解釋變量。 一段時(shí)間以來(lái),這些變量由于其來(lái)源和性質(zhì)而變得相關(guān)。 此外,有時(shí)我們會(huì)根據(jù)相同的基礎(chǔ)事實(shí)創(chuàng)建多個(gè)變量。 在這些情況下,變量變得高度相關(guān)。 這稱(chēng)為多重共線性。
Presence of multicollinearity increases standard error of the coefficients of the regression or classification model. It makes some important variables insignificant in the model.
多重共線性的存在增加了回歸或分類(lèi)模型的系數(shù)的標(biāo)準(zhǔn)誤差。 它使一些重要變量在模型中無(wú)關(guān)緊要。
Farrar–Glauber Test can be used to check the presence of multicollinearity in the data.
Farrar–Glauber檢驗(yàn)可用于檢查數(shù)據(jù)中是否存在多重共線性。
3. How to test if the model coefficients are significant?
3.如何測(cè)試模型系數(shù)是否顯著?
In classification or regression models we require identifying the important variables which have strong influence on the target variable. The models perform some tests and provide us with the extent of significance of the variables.
在分類(lèi)或回歸模型中,我們需要確定對(duì)目標(biāo)變量有很大影響的重要變量。 這些模型執(zhí)行了一些測(cè)試,并為我們提供了變量的重要程度。
t-test is used in models to check the significance of the variables. The null hypothesis of the test is- the coefficients are zero. You need to check p-values of the tests to understand the significance of the coefficients.
模型中使用t檢驗(yàn)來(lái)檢查變量的重要性。 檢驗(yàn)的原假設(shè)是-系數(shù)為零。 您需要檢查測(cè)試的p值以了解系數(shù)的重要性。
Python implementation:
Python實(shí)現(xiàn):
from scipy import statsrvs1 = stats.norm.rvs(loc=5,scale=10,size=500)
stats.ttest_1samp(rvs1, 7)
4. How to test statistical significance of a model?
4.如何檢驗(yàn)?zāi)P偷慕y(tǒng)計(jì)顯著性?
While developing regression or classification model, we perform Analysis of Variance (ANOVA). It checks the validity of regression coefficients. ANOVA compares the variation due to model with the variation due to error. If the variation due to model is significantly different from variation due to error, the effect of the variable is significant.
在開(kāi)發(fā)回歸或分類(lèi)模型時(shí),我們執(zhí)行方差分析(ANOVA)。 它檢查回歸系數(shù)的有效性。 方差分析將模型引起的變化與誤差引起的變化進(jìn)行比較。 如果因模型引起的變化與因誤差引起的變化顯著不同,則變量的影響就很大。
F-test is used to take the decision. The null hypothesis in this test is — the regression coefficient is equal to zero.
F檢驗(yàn)用于做出決定。 該檢驗(yàn)中的零假設(shè)是-回歸系數(shù)等于零。
Python implementation:
Python實(shí)現(xiàn):
import scipy.stats as statsdata1 = stats.norm.rvs(loc=3, scale=1.5, size=20)
data2 = stats.norm.rvs(loc=-5, scale=0.5, size=20)
stats.f_oneway(data1,data2)
預(yù)測(cè) (Forecasting)
In data science we deal with two kinds of data- cross-section and time series. The profiles of a set of customers on an e-commerce website are a cross-section data. But, the daily sales of an item in the e-commerce website for a year will be time series data.
在數(shù)據(jù)科學(xué)中,我們處理兩種數(shù)據(jù):橫截面和時(shí)間序列。 電子商務(wù)網(wǎng)站上一組客戶的資料是橫截面數(shù)據(jù)。 但是,電子商務(wù)網(wǎng)站中某項(xiàng)商品一年的每日銷(xiāo)售額將是時(shí)間序列數(shù)據(jù)。
We often use forecasting models on time series data to estimate the future sales or profits. But, before forecasting, we go through some diagnostic checking of the data to understand the data pattern and its fitness for forecasting.
我們經(jīng)常對(duì)時(shí)間序列數(shù)據(jù)使用預(yù)測(cè)模型來(lái)估計(jì)未來(lái)的銷(xiāo)售或利潤(rùn)。 但是,在進(jìn)行預(yù)測(cè)之前,我們會(huì)對(duì)數(shù)據(jù)進(jìn)行一些診斷檢查,以了解數(shù)據(jù)模式及其對(duì)預(yù)測(cè)的適用性。
As a data scientist I frequently use these tests on time series data:
作為數(shù)據(jù)科學(xué)家,我經(jīng)常對(duì)時(shí)間序列數(shù)據(jù)使用以下測(cè)試:
2. Test for stationarity
2.測(cè)試平穩(wěn)性
3. Test for autocorrelation
3.測(cè)試自相關(guān)
4. Test for causality
4.檢驗(yàn)因果關(guān)系
5. Test for temporal relationship
5.測(cè)試時(shí)間關(guān)系
1. How to test for trend in time series data?
1.如何測(cè)試時(shí)序數(shù)據(jù)趨勢(shì)?
Data generated over time from business often shows an upward or downward trend. Be it sales or profit or any other performance metrics that depicts business performance, we always prefer to estimate the future movements.
隨著時(shí)間推移從業(yè)務(wù)生成的數(shù)據(jù)通常顯示出上升或下降的趨勢(shì)。 無(wú)論是銷(xiāo)售或利潤(rùn),還是描述業(yè)務(wù)績(jī)效的任何其他績(jī)效指標(biāo),我們始終希望估算未來(lái)的走勢(shì)。
To forecast the such movements, you need to estimate or eliminate the trend component. To understand if the trend is significant, you can use some statistical test.
要預(yù)測(cè)這種運(yùn)動(dòng),您需要估計(jì)或消除趨勢(shì)分量。 要了解趨勢(shì)是否顯著,可以使用一些統(tǒng)計(jì)檢驗(yàn)。
Mann-Kendall Test can be used to test the existence of trend. The null hypothesis assumes that there is no significant trend.
Mann-Kendall檢驗(yàn)可以用來(lái)檢驗(yàn)趨勢(shì)的存在。 零假設(shè)假設(shè)沒(méi)有明顯趨勢(shì)。
Python implementation:
Python實(shí)現(xiàn):
pip install pymannkendallimport numpy as np
import pymannkendall as mk
data = np.random.rand(250,1)
test_result = mk.original_test(data)
print(test_result)
2. How to test whether a time series data is stationary?
2.如何測(cè)試時(shí)間序列數(shù)據(jù)是否固定?
Non-stationarity is an inherent characteristic of most time series data. We always need to test for stationarity before any time series modeling. If the data is non-stationary it may produce unreliable and spurious results after modeling. It will lead to a poor understanding of the data.
非平穩(wěn)性是大多數(shù)時(shí)間序列數(shù)據(jù)的固有特征。 在任何時(shí)間序列建模之前,我們始終需要測(cè)試平穩(wěn)性。 如果數(shù)據(jù)不穩(wěn)定,則建模后可能會(huì)產(chǎn)生不可靠且虛假的結(jié)果。 這將導(dǎo)致對(duì)數(shù)據(jù)的理解不充分。
Augmented Dickey-Fuller (ADF) can be used to check for non-stationarity. The null hypothesis for ADF is the series is non-stationary. At 5% level of significance, if the p-value is less than 0.05, we reject the null hypothesis.
增強(qiáng)的Dickey-Fuller(ADF)可用于檢查非平穩(wěn)性。 ADF的原假設(shè)是級(jí)數(shù)是非平穩(wěn)的。 在5%的顯著性水平下,如果p值小于0.05,我們將拒絕原假設(shè)。
Python implementation:
Python實(shí)現(xiàn):
from statsmodels.tsa.stattools import adfullerX = [15, 20, 21, 20, 21, 30, 33, 45, 56]
result = adfuller(X)
print(result)
3. How to check autocorrelation among the values of a time series?
3.如何檢查時(shí)間序列值之間的自相關(guān)?
For time series data, the causal relationship between past and present values is a common phenomenon. For financial time series often we see that current price is influenced by the prices of the last few days. This feature of time series data is measured by autocorrelation.
對(duì)于時(shí)間序列數(shù)據(jù),過(guò)去值和現(xiàn)在值之間的因果關(guān)系是一種常見(jiàn)現(xiàn)象。 對(duì)于財(cái)務(wù)時(shí)間序列,我們經(jīng)常看到當(dāng)前價(jià)格受最近幾天的價(jià)格影響。 時(shí)間序列數(shù)據(jù)的此功能通過(guò)自相關(guān)度量。
To know whether the autocorrelation is strong enough, you can test for it. Durbin-Watson test reveals the extent of it. The null hypothesis for this test assumes that there is no autocorrelation between the values.
要知道自相關(guān)是否足夠強(qiáng),可以對(duì)其進(jìn)行測(cè)試。 Durbin-Watson檢驗(yàn)揭示了其程度。 此檢驗(yàn)的零假設(shè)假設(shè)值之間不存在自相關(guān)。
Python implementation:
Python實(shí)現(xiàn):
from statsmodels.stats.stattools import durbin_watsonX = [15, 20, 21, 20, 21, 30, 33, 45, 56]
result = durbin_watson(X)
print(result)
4. How can you test one variable has causes effect on other?
4.如何測(cè)試一個(gè)變量對(duì)另一個(gè)變量有影響?
Two time series variable can share causal relationship. If you are familiar with financial derivatives, a financial instrument defined on underlying stocks, you would know that spot and future values have causal relationships. They influence each other according to the situation.
兩個(gè)時(shí)間序列變量可以共享因果關(guān)系。 如果您熟悉金融衍生工具(一種定義在基礎(chǔ)股票上的金融工具),則您會(huì)知道現(xiàn)貨和未來(lái)價(jià)值具有因果關(guān)系。 它們根據(jù)情況相互影響。
The causality between two variables can be tested by Granger Causality test. This test uses a regression setup. The current value of one variable regresses on lagged values of the other variable along with lagged values of itself. The null hypothesis of no causality is determined by F-test.
兩個(gè)變量之間的因果關(guān)系可以通過(guò)格蘭杰因果關(guān)系檢驗(yàn)進(jìn)行檢驗(yàn)。 該測(cè)試使用回歸設(shè)置。 一個(gè)變量的當(dāng)前值與其他變量的滯后值一起回歸。 沒(méi)有因果關(guān)系的零假設(shè)由F檢驗(yàn)確定。
Python implementation:
Python實(shí)現(xiàn):
import statsmodels.api as smfrom statsmodels.tsa.stattools import grangercausalitytests
import numpy as np
data = sm.datasets.macrodata.load_pandas()
data = data.data[[“realgdp”, “realcons”]].pct_change().dropna()
gc_res = grangercausalitytests(data, 4)
5. How can you check the temporal relationship between two variables?
5.如何檢查兩個(gè)變量之間的時(shí)間關(guān)系?
Two time series sometimes moves together over time. In the financial time series you will often observe that spot and future price of derivatives move together.
有時(shí)兩個(gè)時(shí)間序列會(huì)隨著時(shí)間一起移動(dòng)。 在金融時(shí)間序列中,您經(jīng)常會(huì)觀察到衍生產(chǎn)品的現(xiàn)貨價(jià)格和未來(lái)價(jià)格會(huì)同時(shí)波動(dòng)。
This co-movements can be checked through a characteristic called cointegration. This cointegration can be tested by Johansen’s test. The null hypothesis of this test assumes no cointegartion between the variables.
可以通過(guò)稱(chēng)為協(xié)整的特征來(lái)檢查這種共同運(yùn)動(dòng)。 可以通過(guò)約翰森的檢驗(yàn)來(lái)檢驗(yàn)這種協(xié)整。 該檢驗(yàn)的零假設(shè)假設(shè)變量之間沒(méi)有共同含義。
Python implementation:
Python實(shí)現(xiàn):
from statsmodels.tsa.vector_ar.vecm import coint_johansendata = sm.datasets.macrodata.load_pandas()
data = data.data[[“realgdp”, “realcons”]].pct_change().dropna()
#x = getx() # dataframe of n series for cointegration analysis
jres = coint_johansen(data, det_order=0, k_ar_diff=1
print(jres.max_eig_stat)
print(jres.max_eig_stat_crit_vals)
資料分組 (Data Grouping)
Many times in real-life scenario we try to find similarity among the data points. The intention becomes grouping them together in some buckets and study them closely to understand how different buckets behave.
在現(xiàn)實(shí)生活中,很多時(shí)候我們?cè)噲D找到數(shù)據(jù)點(diǎn)之間的相似性。 目的是將它們分組到一些存儲(chǔ)桶中,并仔細(xì)研究它們以了解不同存儲(chǔ)桶的行為。
The same is applicable for variables as well. We identify some latent variable those are formed by the combination of a number of observable variables.
同樣適用于變量。 我們確定一些潛在變量,它們是由多個(gè)可觀察變量的組合形成的。
A retail store might be interested to form segments among its customers like — cost-conscious, brand-conscious, bulk-purchaser, etc. It requires grouping of the customers based on their characteristics like — transactions, demographics, psychographics, etc.
零售商店可能有興趣在其顧客中形成細(xì)分,例如-注重成本,注重品牌,大量購(gòu)買(mǎi)者等。它要求根據(jù)顧客的特征(例如交易,人口統(tǒng)計(jì),心理特征等)對(duì)顧客進(jìn)行分組。
In this area we often encounter the following tests:
在這一方面,我們經(jīng)常遇到以下測(cè)試:
1. Test of sphericity
1.球形度測(cè)試
2. Test for sampling adequacy
2.檢驗(yàn)抽樣是否足夠
3. Test for clustering tendency
3.測(cè)試聚類(lèi)趨勢(shì)
1. How to test for Sphericity of the variables?
1.如何測(cè)試變量的球形性?
If the number of variables in the data is very high, the regression models in this situation tend to perform badly. Besides, identifying important variables becomes challenging. In this scenario, we try to reduce the number of variables.
如果數(shù)據(jù)中的變量數(shù)量非常多,則這種情況下的回歸模型往往表現(xiàn)不佳。 此外,識(shí)別重要變量也變得充滿挑戰(zhàn)。 在這種情況下,我們嘗試減少變量的數(shù)量。
Principal Component Analysis (PCA) is one method of reducing the number of variables and identifying major factors. These factors will help you built a regression model with reduced dimension. Also, help to identify key features of any object or incident of interest.
主成分分析(PCA)是減少變量數(shù)量和識(shí)別主要因素的一種方法。 這些因素將幫助您構(gòu)建尺寸減小的回歸模型。 此外,有助于識(shí)別感興趣的任何物體或事件的關(guān)鍵特征。
Now, variables can form factors only when they share some amount of correlation. It is tested by Bartlet’s test. The null hypothesis of this test is — variables are uncorrelated.
現(xiàn)在,變量只有在它們共享一定程度的相關(guān)性時(shí)才能形成因素。 它通過(guò)Bartlet的測(cè)試進(jìn)行了測(cè)試。 該檢驗(yàn)的零假設(shè)是-變量不相關(guān)。
Python implementation:
Python實(shí)現(xiàn):
from scipy.stats import bartletta = [8.88, 9.12, 9.04, 8.98, 9.00, 9.08, 9.01, 8.85, 9.06, 8.99]
b = [8.88, 8.95, 9.29, 9.44, 9.15, 9.58, 8.36, 9.18, 8.67, 9.05]
c = [8.95, 9.12, 8.95, 8.85, 9.03, 8.84, 9.07, 8.98, 8.86, 8.98]
stat, p = bartlett(a, b, c)
print(p, stat)
2. How to test for sampling adequacy of variables?
2.如何檢驗(yàn)變量的抽樣充分性?
The PCA method will produce a reliable result when the sample size is large enough. This is called sampling adequacy. It is to be checked for each variable.
當(dāng)樣本量足夠大時(shí),PCA方法將產(chǎn)生可靠的結(jié)果。 這稱(chēng)為抽樣充分性。 將檢查每個(gè)變量。
Kaiser-Meyer-Olkin (KMO) test is used to check sampling adequacy for the overall data set. The statistic measures the proportion of variance among variables that could be common variance.
Kaiser-Meyer-Olkin(KMO)測(cè)試用于檢查整個(gè)數(shù)據(jù)集的采樣是否足夠。 該統(tǒng)計(jì)數(shù)據(jù)衡量的是可能是普通方差的變量之間方差的比例。
Python implementation:
Python實(shí)現(xiàn):
import pandas as pdfrom factor_analyzer.factor_analyzer import calculate_kmo
a = [8.88, 9.12, 9.04, 8.98, 9.00, 9.08, 9.01, 8.85, 9.06, 8.99]
b = [8.88, 8.95, 9.29, 9.44, 9.15, 9.58, 8.36, 9.18, 8.67, 9.05]
c = [8.95, 9.12, 8.95, 8.85, 9.03, 8.84, 9.07, 8.98, 8.86, 8.98]
df= pd.DataFrame({‘x’:a,’y’:b,’z’:c})
kmo_all,kmo_model=calculate_kmo(df)
print(kmo_all,kmo_model)
3. How to test for clustering tendency of a data set?
3.如何測(cè)試數(shù)據(jù)集的聚類(lèi)趨勢(shì)?
To group the data in different buckets, we use clustering techniques. But before going for clustering you need to check if there is clustering tendency in the data. If the data has uniform distribution then it not suitable for clustering.
為了將數(shù)據(jù)分組到不同的存儲(chǔ)桶中,我們使用聚類(lèi)技術(shù)。 但是在進(jìn)行聚類(lèi)之前,您需要檢查數(shù)據(jù)中是否存在聚類(lèi)趨勢(shì)。 如果數(shù)據(jù)具有均勻分布,則不適合聚類(lèi)。
Hopkins test can check for spatial randomness of variables. Null hypothesis in this test is — the data is generated from non-random, uniform distribution.
Hopkins檢驗(yàn)可以檢查變量的空間隨機(jī)性。 該測(cè)試中的零假設(shè)是-數(shù)據(jù)是由非隨機(jī),均勻分布生成的。
Python implementation:
Python實(shí)現(xiàn):
from sklearn import datasetsfrom pyclustertend import hopkins
from sklearn.preprocessing import scale
X = scale(datasets.load_iris().data)
hopkins(X,150)
In this article, I mentioned some frequently used tests in data science. There are a lot of others which I could not mention. Let me know if you find some which I haven’t mentioned here.
在本文中,我提到了數(shù)據(jù)科學(xué)中一些常用的測(cè)試。 還有很多我不能提及的。 如果您找到我在這里未提及的內(nèi)容,請(qǐng)告訴我。
Reference:
參考:
https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.grangercausalitytests.html
https://www.statsmodels.org/dev/generation/statsmodels.tsa.stattools.grangercausalitytests.html
https://pypi.org/project/pyclustertend/
https://pypi.org/project/pyclustertend/
翻譯自: https://towardsdatascience.com/what-are-the-commonly-used-statistical-tests-in-data-science-a95cfc2e6b5e
r怎么對(duì)兩組數(shù)據(jù)統(tǒng)計(jì)檢驗(yàn)
總結(jié)
以上是生活随笔為你收集整理的r怎么对两组数据统计检验_数据科学中最常用的统计检验是什么的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 做梦梦到和尚是什么意思
- 下一篇: 为什么即使在班级均衡的情况下,准确度仍然