當前位置：首頁 >

chi-squared检验_每位数据科学家都必须具备Chi-S方检验统计量：客户流失中的案例研究

發布時間：2023/12/15 53 豆豆

生活随笔收集整理的這篇文章主要介紹了 chi-squared检验_每位数据科学家都必须具备Chi-S方检验统计量：客户流失中的案例研究小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

chi-squared檢驗

重點 (Top highlight)

介紹 (Introduction)

The chi-square statistic is a useful tool for understanding the relationship between two categorical variables.

卡方統計量是了解兩個分類變量之間關系的有用工具。

For the sake of example, let’s say you work for a tech company that has rolled out a new product and you want to assess the relationship between this product and customer churn. In the age of data, tech or otherwise, many companies undergo to risk of taking evidence that is either anecdotal or perhaps a high level visualization to indicate certainty of a given relationship. The chi-square statistic gives us a way to quantify and assess the strength of a given pair of categorical variables.

舉例來說，假設您在一家已經推出新產品的技術公司工作，并且想要評估該產品與客戶流失之間的關系。在數據時代，技術時代或其他時代，許多公司都冒著獲取證據的風險，這些證據既可能是軼事，也可能是高級可視化，以表明給定關系的確定性。卡方統計量為我們提供了一種量化和評估給定類別變量對強度的方法。

客戶流失 (Customer Churn)

Let’s explore chi-square from this lens of customer churn.

讓我們從客戶流失的角度探討卡方。

You can download the customer churn dataset that we’ll be working with from kaggle. This dataset provides details for a variety of telecom customers and whether or not they “churned” or closed their account.

您可以從kaggle下載將要使用的客戶流失數據集。該數據集提供了各種電信客戶的詳細信息，以及他們是否“攪動”或關閉了他們的帳戶。

Regardless of what company, teams, products, or industries you work with, the following example should be very generalizable.

無論與您合作的公司，團隊，產品或行業是什么，以下示例都應該非常概括。

Now that we have our dataset, let’s quickly use dplyr's select command to pull down the fields we'll be working with for simplicity sake. I'll also be dropping the number of levels down to two for simplicity sake. You can certainly run a chi-square test on categorical variables with more than two levels, but as we venture to understand it from the ground up, we'll keep it simple.

現在我們有了數據集，為簡單起見，讓我們快速使用dplyr的select命令下拉我們將要使用的字段。為了簡單起見，我還將級別的數目降低到兩個。您當然可以對具有兩個以上級別的分類變量進行卡方檢驗，但是當我們從頭開始理解它時，我們將使其保持簡單。

churn <- churn %>%
select(customerID, StreamingTV, Churn)%>%
mutate(StreamingTV = ifelse(StreamingTV == 'Yes', 1, 0))

Churn is going to be classified as a Yes or a No. As you just saw, StreamingTV will be encoded with either a 1 or 0.

流失率將被分類為是或否。如您所見，StreamingTV將被編碼為1或0。

探索性數據分析 (Exploratory Data Analysis)

I won’t go into great depth on exploratory data analysis here, but I will give you two quick tools to being able to assess a relationship between two categorical variables.

在這里，我不會深入探討探索性數據分析，但我將為您提供兩個快速工具，以便能夠評估兩個類別變量之間的關系。

比例表 (Proportion Tables)

Proportion tables are a great way to establish some fundamental understanding about the relationship between two categoricals

比例表是一種建立關于兩個類別之間關系的基本理解的好方法

table(churn$StreamingTV)
table(churn$Churn)round(prop.table(table(churn$StreamingTV)),2)
round(prop.table(table(churn$Churn)),2)

Table gives us a quick idea of the counts in any given level, wrapping that in prop.table() allows us to see the percentage break down.

Table使我們可以快速了解任何給定級別的計數，將其包裝在prop.table()可以讓我們看到百分比細分。

Let’s now pass both variables to our table() function

現在讓我們將兩個變量都傳遞給table()函數

table(churn$StreamingTV, churn$Churn)
round(prop.table(table(churn$StreamingTV, churn$Churn),1),2)

Once you pass another variable into the proportion table, you’re then able to establish where you want to assess relative proportion. In this case, the second parameter we pass to the prop.table() function, "1", which specifies that we'd like to see the relative proportion of records across each row or value of StreamingTV. As you can see in the above table in cases when a customer did not have streaming tv, they remained active 76% of the time, conversely if they did have streaming tv they actually stuck around less at 70%.

將另一個變量傳遞到比例表后，您就可以確定要評估相對比例的位置。在這種情況下，我們將第二個參數傳遞給prop.table()函數“ 1”，該參數指定我們希望查看StreamingTV每一行或值的記錄的相對比例。如上表所示，如果客戶沒有流媒體電視，則他們有76％的時間保持活動狀態；相反，如果客戶有流媒體電視，則實際上停留在70％左右。

Now before we go getting ahead of ourselves, saying that having streaming tv most certainly is causing more people to churn… we need to make an assessment of whether or not we really have grounds to make such a claim. Yes the proportion of return customers is lower, but the difference could be random noise. More on this shortly.

現在，在我們超越自己之前，要說擁有電視流肯定會導致更多人流失……我們需要評估我們是否真的有理由提出這樣的要求。是的，回頭客的比例較低，但差異可能是隨機噪聲。不久之后會更多。

可視化的時間 (Time to Visualize)

This will give us similar information to what we just saw, but visualization tends to lend better to quickly understanding relative value.

這將為我們提供與我們剛剛看到的信息類似的信息，但是可視化往往有助于更好地快速理解相對價值。

Let’s start off with a quick bar plot with StreamingTV across the x-axis, and the fill as Churn.

讓我們從在x軸上使用StreamingTV的快速條形圖開始，填充為Churn 。

churn %>%
ggplot(aes(x = StreamingTV, fill = Churn))+
geom_bar()

As you can see, nearly as many tv streamers churned and with a substantially lower total customer count. Similar to what we saw with proportion tables, 100% stacked bar helps assess relative distribution among values of a categorical variable. All we have to do is pass position = 'fill' to geom_bar().

如您所見，幾乎流失了許多電視流媒體，并且客戶總數大大減少。與比例表類似，100％堆積條形圖有助于評估分類變量的值之間的相對分布。我們要做的就是將position = 'fill'傳遞給geom_bar() 。

churn %>%
ggplot(aes(x = StreamingTV, fill = Churn))+
geom_bar(position = 'fill')

深入卡方統計 (Diving into the Chi-square Statistic)

Now there appears to be some sort of relationship between the two variables, yet we don’t have an assessment of the statistical significance. In other words, is it because of something about the relationship between tv streamers and customers, i.e. did they hate the service so much that they churn at a higher rate? Does their overall bill appear way to high as a product of the streaming plan, such that they churn all together?

現在，兩個變量之間似乎存在某種關系，但我們沒有統計意義的評估。換句話說，這是否是由于電視流媒體和客戶之間的關系所致，也就是說，他們是否討厭這項服務，以致于流失率更高？作為流媒體計劃的產物，他們的總賬單看起來是否很高，以至于他們一起流失？

All great questions, and we won’t have the answer to them just yet, but what we are doing is taking the first steps to assessing whether this larger investigative journey is worthwhile.

所有偉大的問題，我們暫時還沒有答案，但是我們正在采取的第一步是評估這一較大的調查旅程是否值得。

卡方說明 (Chi-square Explanation)

Before we dive into the depths of creating a chi-square statistic, it’s very important that you understand the purpose conceptually.

在深入研究創建卡方統計量之前，從概念上了解目標非常重要。

We can see two categorical variables that appear to be related, however we don’t definitively know if the disparate proportions are a product of randomness or some other underlying affect. This is where chi-square comes in. The chi-square test statistic is effectively a comparison of our distribution to the distribution we would expect, in the case that the two variables were indeed perfectly independent.

我們可以看到兩個似乎相關的類別變量，但是我們并不確定是否不同的比例是隨機性還是其他潛在影響的產物。這就是卡方檢驗的出處。在兩個變量確實完全獨立的情況下，卡方檢驗統計量實際上是我們的分布與期望分布的比較。

So first things first, we need a dataset to represent said independence.

首先，我們需要一個數據集來表示獨立性。

生成樣本數據集 (Generating Our Sample Dataset)

We will be making use of the infer package. This package is incredibly useful for creating sample data for hypothesis testing, creating confidence intervals, etc.

我們將使用infer包。該軟件包對于創建用于假設檢驗的樣本數據，創建置信區間等非常有用。

I won’t break down all of the details on how to use infer, but at a high level, you're creating a new dataset. In this case, we want to create a dataset that looks a lot like what we just saw with the churn dataset, only this time, we want to ensure independent distribution, i.e. in cases when customers are tv streamers, we shouldn't see a greater occurrence of churn.

我不會分解有關如何使用infer所有詳細信息，但總的來說，您正在創建一個新的數據集。在這種情況下，我們想要創建一個看起來很像流失數據集的數據集，只是這次，我們要確保獨立分發，即在客戶是電視流媒體的情況下，我們不應該看到流失的可能性更大。

Easy way to think about infer is in the following the steps of specify, hypothesize, and generate. We specify the relationship we’re modeling, we input the intended distribution, independent, and finally we specify the number of replicates we want to generate. A replicate in this case will mirror the row count of our original dataset. There are instances in which you would create many replicates of the same dataset and make calculations on top of that, but not for this part of the process.

考慮推理的簡單方法是指定，假設和生成以下步驟。我們指定要建模的關系，輸入預期的分布，獨立，最后指定要生成的重復數。在這種情況下，副本將反映原始數據集的行數。在某些情況下，您將創建同一數據集的許多副本，并在此之上進行計算，但并不針對過程的這一部分。

churn_perm <- churn %>%
specify(Churn ~ StreamingTV) %>%
hypothesize(null = "independence") %>%
generate(reps = 1, type = "permute")

Lets’s quickly take a look at this dataset.

讓我們快速看一下該數據集。

head(churn_perm)

As you can see we have the two variables we specified, as well as replicate. All records in this table will be replicate: 1, as we only made a single replicate.

如您所見，我們擁有我們指定的兩個變量以及replicate 。該表中的所有記錄將被復制：1，因為我們僅進行了一次復制。

樣本摘要 (Sample Summaries)

Let’s quickly visualize our independent dataset to visualize the relative proportions now.

讓我們快速可視化我們的獨立數據集以可視化現在的相對比例。

churn_perm %>%
ggplot(aes(x = StreamingTV, fill = Churn))+
geom_bar(position = 'fill')

As desired you can see that the relative proportions line up almost exactly. There is some randomness at play so we may not see that these two line up perfectly… but that’s really the point. We’re not doing this quite yet, but remember when I mentioned the idea of creating many replicates?

根據需要，您可以看到相對比例幾乎完全對齊。游戲中存在一些隨機性，因此我們可能看不到兩者完美地對接……但這就是重點。我們還沒有這樣做，但是還記得我提到創建多個副本的想法嗎？

What might the purpose of that be?

這樣做的目的是什么？

If we create this sample dataset tons of times, do we ever see a gap as wide as 70% to 76% churn as we saw in our observed dataset? If so, how often do we see it? Is it so often that we don’t have grounds to chalk up the difference to anything more than random noise?

如果我們多次創建此樣本數據集，是否有看到像在觀察到的數據集中看到的70％至76％的流失寬度？如果是這樣，我們多久見一次？難道我們常常沒有理由將差異歸因于隨機噪聲嗎？

Alright enough of that rant… On to making an assessment of how much our observed data varies from our sample data.

足夠好了……繼續評估我們觀察到的數據與樣本數據之間的差異。

讓我們開始計算 (Let’s Get Calculating)

Now that we really understand our purpose, let’s go ahead and calculate our statistic. Simply enough, our intent is to calculate the distance between each cell of our table of observed counts with that of our sample counts.

現在我們已經真正了解了我們的目的，讓我們繼續計算我們的統計數據。簡而言之，我們的目的是計算觀察計數表中每個像元與樣本計數之間的距離。

The formula for said “distance” looks like this:

所說“距離”的公式如下：

sum(((obs - sample)^2)/sample)

We subtract observed from our sample,

我們從樣本中減去觀察到的

but square them such that they don’t cancel each other out.

但要對它們進行平方，以免彼此抵消。

We divide them by the sample count to prevent any single cell from having too great a presence due to its size,

我們將它們除以樣本數，以防止任何單個單元格由于其大小而存在過多，

and finally we take the sum.

最后，我們求和。

The chi-square statistic that we get is: 20.1

我們得到的卡方統計是：20.1

So, great. We understand the purpose of the chi-square statistic, we even have it… but what we still don’t know is… is a chi-square stat of 20.1 meaningful?

很好。我們了解卡方統計量的目的，甚至有這個目的……但是我們仍然不知道…… 卡方統計量20.1是否有意義？

假設檢驗 (Hypothesis Testing)

Earlier in the post, we spoke about how we can use the infer package to create many, many replicates. A hypothesis test is precisely the time for that type of sampling.

在文章的前面，我們談到了如何使用infer包創建許多復制品。假設檢驗正是該類型抽樣的時間。

Let’s use infer again, just this time we'll generate 500 replicates & calculate a chi-square statistic for each group of replicates.

讓我們再次使用infer ，僅這次我們將生成500個重復并為每個重復組計算卡方統計量。

churn_null <- churn %>%
specify(Churn ~ StreamingTV) %>%
hypothesize(null = "independence") %>%
generate(reps = 500, type = "permute") %>%
calculate(stat = "Chisq")
churn_null

Based on the above output, you can see that each replicate has it's own stat.

根據上面的輸出，您可以看到每個replicate都有自己的stat 。

Let’s use a density plot to see what our distribution of chi-square statistics looks like.

讓我們使用密度圖來查看我們的卡方統計分布。

churn_null %>%
ggplot(aes(x = stat)) +
# Add density layer
geom_density()

At a first glance we can see the distribution of chi-square statistics is very right skewed. We can also see that our statistic of 20.1 is not even on the plot.

乍一看，我們可以看到卡方統計的分布非常偏斜。我們還可以看到我們的統計數據20.1甚至沒有在圖中。

Let’s add a vertical line to show how our observed chi-square compares to the permuted distribution.

讓我們添加一條垂直線，以顯示觀察到的卡方與排列后的分布相比。

churn_null %>%
ggplot(aes(x = stat)) +
geom_density() +
geom_vline(xintercept = obs_chi_sq, color = "red")

When it comes to having sufficient evidence to reject the null hypothesis, this is promising. Null hypothesis being that there is no relationship between the two variables.

當有足夠的證據拒絕原假設時，這是有希望的。零假設是兩個變量之間沒有關系。

計算P值 (Calculating P-value)

As a final portion to this lesson on how to use chi-square statistics, let’s talk about how we should go about calculating p-value.

作為本課有關如何使用卡方統計量的最后一部分，讓我們討論如何計算p值。

Earlier I mentioned the idea that we might want to know if our simulated chi-square stat was ever as large as our observed chi-square stat, and if so how often it might have occurred.

早些時候，我提到過這樣一個想法，我們可能想知道模擬卡方統計量是否與觀察到的卡方統計量一樣大，如果是，那么它可能會發生多久。

That is the essence of p-value.

這就是p值的本質。

When taking the chi-square stat of two variables that we know are independent of one another (the simulated case), what percentage of these replicates’ chi-square stats are greater than or equal to our observed chi-square stat.

當我們知道彼此獨立的兩個變量的卡方統計量(模擬情況)時，這些重復樣本的卡方統計量的百分比大于或等于我們觀察到的卡方統計量。

churn_null %>%
summarise(p_value = 2 * mean(stat >= obs_chi_sq))

In the case of our sample, we’re getting a p-value of 0. As to say that in the course of 500 replicates, we never surpassed a chi-square stat of 20.1.

以我們的樣本為例，我們得到的p值為0。可以說，在500次重復的過程中，我們從未超過卡方值20.1。

As such, we would reject the null hypothesis that churn and streaming tv are independent.

因此，我們將拒絕流失和電視流是獨立的零假設。

結論 (Conclusion)

We have done a lot in such a short amount of time. It’s easy to get lost when dissecting statistics concepts like the chi-square statistic. My hope is that having a strong foundational understanding of the need and corresponding calculation of this statistic lends to the right instinct for recognizing the right opportunity to put this tool to work.

在這么短的時間內我們做了很多事情。剖析卡方統計之類的統計概念時，很容易迷失方向。我的希望是，對此統計信息的需求和相應的計算方法有一個深刻的基礎理解，有助于正確認識本機，使該工具投入使用。

In just a few minutes, we have covered:

在短短幾分鐘內，我們涵蓋了：

A bit of EDA for pairs of categorical variables
對類別變量對的一些EDA
Proportion tables
比例表
Bar Charts
條形圖
100% Stacked Bar
100％堆積酒吧
Chi-square explanation & purpose
卡方說明和目的
How to calculate a chi-square statistic
如何計算卡方統計
Hypothesis testing with infer
假設檢驗
Calculating p-value
計算p值

If this was helpful, feel free to check out my other posts at datasciencelessons.com. Happy Data Science-ing!

如果這有幫助，請隨時在datasciencelessons.com上查看我的其他帖子。快樂數據科學！

翻譯自: https://towardsdatascience.com/the-chi-squared-test-statistic-is-a-must-for-every-data-scientist-a-case-study-in-customer-churn-bcdb17bbafb7