當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

客户细分_客户细分：K-Means聚类和A / B测试

發布時間：2023/12/15 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了客户细分_客户细分：K-Means聚类和A / B测试小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

客戶細分

語境 (Context)

I have been working in Advertising, specifically Digital Media and Performance, for nearly 3 years and customer behaviour analysis is one of the core concentrations in my day-to-day job. With the help of different analytics platforms (e.g. Google Analytics, Adobe Analytics), my life has been made easier than before since these platforms come with the built-in function of segmentation that analyses user behaviours across dimensions and metrics.

我從事廣告業，特別是數字媒體和表演業已近3年，客戶行為分析是我日常工作的核心內容之一。在不同的分析平臺(例如Google Analytics(分析)，Adobe Analytics)的幫助下，我的生活變得比以前更加輕松，因為這些平臺具有內置的細分功能，可以根據維度和指標分析用戶行為。

However, despite the convenience provided, I was hoping to leverage Machine Learning to do customer segmentation that can be scalable and applicable to other optimizations in Data Science (e.g. A/B Testing). Then, I came across the dataset provided by Google Analytics for a Kaggle competition and decided to use it for this project.

但是，盡管提供了便利，但我還是希望利用機器學習來進行客戶細分 ，該細分可以擴展并適用于數據科學中的其他優化(例如A / B測試)。然后，我遇到了Google Analytics(分析)提供的Kaggle競賽數據集，并決定將其用于該項目。

Feel free to check out the dataset here if you’re keen! Beware that the dataset has several sub-datasets and each has more than 900k rows!

如果您愿意，可以在這里簽出數據集！請注意，數據集具有多個子數據集，每個子數據集具有超過900k的行 ！

A.解釋性數據分析(EDA) (A. Explanatory Data Analysis (EDA))

This always remain an essential step in every Data Science project to ensure the dataset is clean and properly pre-processed to be used for modelling.

這始終是每個Data Science項目中必不可少的步驟，以確保數據集干凈且經過適當預處理以用于建模。

First of all, let’s import all the necessary libraries and read the csv file:

首先，讓我們導入所有必需的庫并讀取csv文件：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as snsdf_raw = pd.read_csv("google-analytics.csv")
df_raw.head()

1.展平JSON字段 (1. Flatten JSON Fields)

As you can see, the raw dataset above is a bit “messy” and not digestible at all since some variables are formatted as JSON fields which compress different values of different sub-variables into one field. For example, for geoNetwork variable, we can tell that there are several sub-variables such as continent, subContinent, etc. that are grouped together.

如您所見，上面的原始數據集有點“混亂”，根本無法消化，因為某些變量的格式設置為JSON字段，可將不同子變量的不同值壓縮到一個字段中。例如，對于geoNetwork變量，我們可以知道有幾個子變量(例如，continent，subContinent等)組合在一起。

Thanks to the help of a Kaggler, I was able to convert these variables to a more digestible ones by flattening those JSON fields:

多虧了Kaggler的幫助，我能夠通過展平那些JSON字段將這些變量轉換為更易消化的變量：

import os
import json
from pandas import json_normalizedef load_df(csv_path="google-analytics.csv", nrows=None):
json_columns = ['device', 'geoNetwork', 'totals', 'trafficSource']
df = pd.read_csv(csv_path, converters={column: json.loads for column in json_columns},dtype={'fullVisitorID':'str'}, nrows=nrows)
for column in json_columns:
column_converted = json_normalize(df[column])
column_converted.columns = [f"{column}_{subcolumn}" for subcolumn in column_converted.columns]
df = df.drop(column, axis=1).merge(column_converted, right_index=True, left_index=True)
return df

After flattening those JSON fields, we are able to see a much cleaner dataset, especially those JSON variables split into sub-variables (e.g. device split into device_browser, device_browserVersion, etc.).

展平那些JSON字段后，我們可以看到一個更整潔的數據集，尤其是那些JSON變量拆分為子變量(例如，設備拆分為device_browser，device_browserVersion等)。

2.數據重新格式化和分組 (2. Data Re-formatting & Grouping)

For this project, I have chosen the variables that I believe have better impact or correlation to the user behaviours:

在這個項目中，我選擇了我認為對用戶行為有更好影響或相關性的變量：

df = df.loc[:,['channelGrouping', 'date', 'fullVisitorId', 'sessionId', 'visitId', 'visitNumber', 'device_browser', 'device_operatingSystem', 'device_isMobile', 'geoNetwork_country', 'trafficSource_source', 'totals_visits', 'totals_hits', 'totals_pageviews', 'totals_bounces', 'totals_transactionRevenue']]df = df.fillna(value=0)
df.head()

Moving on, as the new dataset has fewer variables which, however, vary in terms of data type, I took some time to analyze each and every variable to ensure the data is “clean enough” prior to modelling. Below are some quick examples of un-clean data to be cleaned:

繼續，由于新數據集的變量較少，但是變量的數據類型不同，我花了一些時間分析每個變量，以確保在建模之前數據“足夠干凈”。以下是一些要清除的不干凈數據的快速示例：

#Format the values
df.channelGrouping.unique()
df.channelGrouping = df.channelGrouping.replace("(Other)", "Others")#Convert boolean type to string
df.device_isMobile.unique()
df.device_isMobile = df.device_isMobile.astype(str)
df.loc[df.device_isMobile == "False", "device"] = "Desktop"
df.loc[df.device_isMobile == "True", "device"] = "Mobile"#Categorize similar valuesdf['traffic_source'] = df.trafficSource_sourcemain_traffic_source = ["google","baidu","bing","yahoo",...., "pinterest","yandex"]df.traffic_source[df.traffic_source.str.contains("google")] = "google"
df.traffic_source[df.traffic_source.str.contains("baidu")] = "baidu"
df.traffic_source[df.traffic_source.str.contains("bing")] = "bing"
df.traffic_source[df.traffic_source.str.contains("yahoo")] = "yahoo"
.....
df.traffic_source[~df.traffic_source.isin(main_traffic_source)] = "Others"

After re-formatting, I found that fullVisitorID’s unique values are fewer than the total rows of the dataset, meaning there are multiple fullVisitorIDs that were recorded. Hence, I proceeded to group the variables by fullVisitorID and sort by Revenue:

重新格式化后，我發現fullVisitorID的唯一值少于數據集的總行數，這意味著記錄了多個fullVisitorID。因此，我著手按照fullVisitorID對變量進行分組，然后按Revenue進行排序：

df_groupby = df.groupby(['fullVisitorId', 'channelGrouping', 'geoNetwork_country', 'traffic_source', 'device', 'deviceBrowser', 'device_operatingSystem'])
.agg({'totals_hits':'sum', 'totals_pageviews':'sum', 'totals_bounces':'sum','totals_transactionRevenue':'sum'})
.reset_index()df_groupby = df_groupby.sort_values(by='totals_transactionRevenue', ascending=False).reset_index(drop=True)df.groupby() and df.sort_values()df.groupby()和df.sort_values()

3.異常值處理 (3. Outlier Handling)

The last step of any EDA process that cannot be overlooked is detecting and handling outliers of the dataset. The reason being is that outliers, especially those marginally extreme ones, impact the performance of a machine learning model, mostly negatively. That said, we need to either remove those outliers from the dataset or convert them (by mean or mode) to fit them to the range that the majority of the data points lie in:

任何EDA流程中不可忽視的最后一步是檢測和處理數據集的異常值。原因是離群值，尤其是那些極度極端的值，對機器學習模型的性能產生了很大的負面影響。也就是說，我們需要從數據集中刪除那些離群值，或者將它們轉換(通過均值或眾數)以使其適合大多數數據點所在的范圍：

#Seaborn Boxplot to see how far outliers lie compared to the restsns.boxplot(df_groupby.totals_transactionRevenue)sns.boxplot()sns.boxplot()

As you can see, most of the data points in Revenue lie below USD200,000 and there’s only one extreme outlier that hits nearly USD600,000. If we don’t remove this outlier, the model also takes it into consideration that produces a less objective reflection.

如您所見，“收入”中的大多數數據點都在200,000美元以下，只有一個極端的異常值達到了600,000美元。如果我們不刪除此異常值，則模型也會將其考慮在內，從而產生較少客觀的反映。

So let’s go ahead and remove it, and please do so for other variables. Just a quick note, there are several methods of dealing with outliers (such as inter-quantiles). However, in my case, there’s only one so I just went ahead defining the range that I believe fits well:

因此，讓我們繼續刪除它，對于其他變量，請這樣做。簡要說明一下，有幾種處理離群值(例如分位數間)的方法。但是，就我而言，只有一個，所以我繼續定義了我認為合適的范圍：

df_groupby = df_groupby.loc[df_groupby.totals_transactionRevenue < 200000]

B. K-均值聚類 (B. K-Means Clustering)

What is K-Means Clustering and how does it help with customer segmentation?

什么是K-Means聚類，它對客戶細分有何幫助？

Clustering is the most well-known unsupervised learning technique that finds structure in unlabeled data by identifying similar groups/clusters, particularly with the helps of K-Means.

聚類是最著名的無監督學習技術，通過識別相似的組/集群來發現未標記數據中的結構，尤其是在K-Means的幫助下。

K-Means tries to address two questions: (1) K: the number of clusters (groups) we expect to find in the dataset and (2) Means: the average distance of data to each cluster center (centroid) which we try to minimize.

K-Means嘗試解決兩個問題：(1)K：我們希望在數據集中找到的聚類 (組) 的數量； (2)均值： 數據到我們試圖聚類的每個聚類中心 (質心) 的平均距離最小化。

Also, one thing of note is that K-Means comes with several variations, typically :

另外，值得注意的是，K-Means具有多種變體，通常是：

init = ‘random’: that randomly selects the centroids of each cluster

init ='random'：隨機選擇每個簇的質心

init = ‘k-means++’: that only selects the 1st centroid by randomness while other centroids to be placed as far away from the 1st as possible

init ='k-means ++'：僅隨機選擇第一個質心，而其他質心則盡可能遠離第一個質心

In this project, I’ll use the second option to ensure that each cluster is well-distinguished from one another:

在這個項目中，我將使用第二個選項來確保每個群集之間的區別明顯：

from sklearn.cluster import KMeansdata = df_groupby.iloc[:, 7:]kmeans = KMeans(n_clusters=3, init="k-means++")
kmeans.fit(data)labels = kmeans.predict(data)
labels = pd.DataFrame(data=labels, index = df_groupby.index, columns=["labels"])

Before applying the algorithm, we need to define “n_clusters” which is the number of groups we expect to get out of the modelling. In this case, I randomly put n_clusters = 3. Then, I went ahead visualizing how the dataset is grouped using 2 variables: Revenue and PageViews:

在應用算法之前，我們需要定義“ n_clusters ”，這是我們希望從建模中擺脫出來的組數。在這種情況下，我隨機放置n_clusters =3。然后，我繼續可視化如何使用2個變量對數據集進行分組：Revenue和PageViews：

plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 0],df_kmeans.totals_pageviews[df_kmeans.labels == 0], c='blue')plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 1], df_kmeans.totals_pageviews[df_kmeans.labels == 1], c='green')plt.scatter(df_kmeans.totals_transactionRevenue[df_kmeans.labels == 2], df_kmeans.totals_pageviews[df_kmeans.labels == 2], c='orange')plt.show()

As you can see, the x-axis stands for the number of Revenue while y-axis for PageViews . After modelling, we can tell a certain degree of difference in 3 clusters. However, I was not sure whether 3 is the “right” number of clusters or not. That said, we can rely on the estimator of K-Means algorithm, inertia_, which is the distance from each sample to the centroid. In particular, we will compare the inertia of each cluster ranging from 1 to 10, in my case, and see which is the lowest and how far we should go:

如您所見，x軸代表“收入”數，y軸代表“ PageViews”。建模后，我們可以區分3個聚類的一定程度的差異。但是，我不確定3個集群是否正確。就是說，我們可以依靠K-Means算法的估計量initiative_ ，它是每個樣本到質心的距離。特別是，在我的例子中，我們將比較每個群集的慣性，范圍是1到10，然后看看哪一個是最低的以及應該走多遠：

#Find the best number of clustersnum_clusters = [x for x in range(1,10)]
inertia = []for i in num_clusters:
model = KMeans(n_clusters = i, init="k-means++")
model.fit(data)
inertia.append(model.inertia_)

plt.plot(num_clusters, inertia)
plt.show()model.inertia_model.inertia_

From the chart above, inertia started to fall slowly since the 4th or 5th cluster, meaning that that’s the lowest inertia we can get, so I decided to go with “n_clusters=4”:

從上表中可以看出，自第4簇或第5簇以來，慣性開始緩慢下降，這意味著這是我們可以獲得的最低慣性，因此我決定使用“ n_clusters = 4 ”：

plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 0], df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 0], c='blue')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 1],
df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 1], c='green')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 2],
df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 2], c='orange')plt.scatter(df_kmeans_n4.totals_pageviews[df_kmeans_n4.labels == 3],
df_kmeans_n4.totals_transactionRevenue[df_kmeans_n4.labels == 3], c='red')plt.xlabel("Page Views")
plt.ylabel("Revenue")plt.show()Switch PageViews to x-axis and Revenue to y-axis將網頁瀏覽量切換到x軸，將收入切換到y軸

The clusters now look a lot more distinguishable from one another:

現在，這些群集彼此之間的區別更加明顯：

Cluster 0 (Blue): high PageViews yet little-to-none Revenue

群集0(藍色)：網頁瀏覽量高，但收入卻幾乎沒有

Cluster 1 (Red): medium PageViews, low Revenue

群集1(紅色)：中型網頁瀏覽量，低收入

Cluster 2 (Orange): medium PageViews, medium Revenue

群集2(橙色)：中等瀏覽量，中等收入

Cluster 4 (Green): unclear trend of PageViews, high Revenue

群集4(綠色)：PageViews趨勢不明確，收入高

Except for cluster 0 and 4 (unclear pattern), which are beyond our control, cluster 1 and 2 can tell a story here as they seem to share some similarities.

除了群集0和4(不清楚的模式)，這超出了我們的控制范圍，群集1和2在這里可以講一個故事，因為它們似乎具有一些相似之處。

To understand which factor that might impact each cluster, I segmented each cluster by Channels, Device and Operating System:

為了了解可能影響每個集群的因素，我按渠道，設備和操作系統對每個集群進行了細分：

Cluster 1集群1 Cluster 2集群2

As seen from above, in Cluster 1, Referral channel contributed the highest Revenue, followed by Direct and Organic Search. In contrast, it’s Direct that made the highest contribution in Cluster 2. Similarly, while Macintosh is the most dominating device in Cluster 1, it’s Windows in Cluster 2 that achieved higher revenue. The only similarity between 2 clusters is the Device Browser, which Chrome is widely used.

從上方可以看出，在類別1中，引薦渠道貢獻了最高的收入，其次是直接搜索和自然搜索。相比之下，Direct在集群2中貢獻最大。類似地，盡管Macintosh是集群1中最主要的設備，但集群2中的Windows獲得了更高的收入。 2個群集之間的唯一相似之處是設備瀏覽器，Chrome被廣泛使用。

Voila! This further segmentation helps us tell which factor (in this case, Channel, Device Browser, Operating System) works better for each cluster, hence we can better evaluate our investment moving forward!

瞧！進一步的細分可以幫助我們確定哪個因素(在這種情況下，通道，設備瀏覽器，操作系統)對于每個集群都更有效，因此我們可以更好地評估未來的投資！

C.通過假設檢驗進行A / B檢驗 (C. A/B Testing through Hypothesis Testing)

What is A/B Testing and how can Hypothesis Testing come into place to complement the process?

什么是A / B測試，以及如何進行假設測試來補充流程？

A/B Testing is no stranger to those who work in Advertising and Media, since it’s one of the powerful techniques that help improve the performance with more cost efficiency. Particularly, A/B Testing divides the audience into 2 groups: Test vs Control. Then, we expose the ads/show a different design to the Test group only to see if there’s any significant discrepancy between 2 groups: exposed vs un-exposed.

A / B測試對于從事廣告和媒體工作的人員并不陌生，因為它是幫助以更高的成本效率提高性能的強大技術之一。特別是，A / B測試將受眾分為兩組：測試與控制。然后，我們向測試組展示廣告/展示不同的設計，只是為了查看兩組之間是否存在顯著差異：公開與未公開。

Image credit: https://productcoalition.com/are-you-segmenting-your-a-b-test-results-c5512c6def65?gi=7b445e5ef457圖片來源： https : //productcoalition.com/are-you-segmenting-your-ab-test-results-c5512c6def65?gi=7b445e5ef457

In Advertising, there are a number of different automatic tools in the market that can easily help do A/B Testing at one click. However, I still wanted to try a different method in Data Science that can do the same: Hypothesis Testing. The methodology is pretty much the same, as Hypothesis Testing compares the Null Hypothesis (H0) and Alternate Hypothesis (H1) and see if there’s any significant discrepancy between the two!

在廣告中，市場上有許多不同的自動工具，可輕松幫助您一鍵進行A / B測試。但是，我仍然想在數據科學中嘗試一種可以做到這一點的不同方法： 假設檢驗 。方法學幾乎是一樣的，因為假設檢驗將零假設(H0)和替代假設(H1)進行比較，看看兩者之間是否存在顯著差異！

Assume that I run a promotion campaign that exposes an ad to the Test group. Here’s a quick summary of steps that need to be followed to test the result with Hypothesis Testing:

假設我運行了一個促銷活動，將廣告展示給“測試”組。以下是使用假設檢驗測試結果所需遵循的步驟的快速摘要：

Sample Size Determination

樣本量確定

Pre-requisite Requirements: Normality and Correlation Tests

先決條件：正常性和相關性測試

Hypothesis Testing

假設檢驗

For the 1st step, we can rely on Power Analysis which helps determine the sample size to draw from a population. Power Analysis requires 3 parameters: (1) effect size, (2) power and (3) alpha. If you are looking for details on how Power Analysis, please refer to an in-depth article here that I wrote some time ago.

對于第一步 ，我們可以依靠功效分析，該分析有助于確定要從總體中提取的樣本量。功效分析需要3個參數：(1)效果大小，(2)功效和(3)alpha。如果您正在尋找在功率分析如何，請參閱了深入的文章詳細介紹在這里，我寫了前一段時間。

Below is a quick note to each parameter for your quick understanding:

以下是對每個參數的快速注釋，以幫助您快速理解：

#Effect Size: (expected mean - actual mean) / actual_std
effect_size = (280000 - df_group1_ab.revenue.mean())/df_group1_ab.revenue.std() #set expected mean to $350,000
print(effect_size)#Power
power = 0.9 #the probability of rejecting the null hypothesis#Alpha
alpha = 0.05 #the error rate

After having 3 parameters ready, we use TTestPower() to determine the sample size:

準備好3個參數后，我們使用TTestPower()確定樣本大小：

import statsmodels.stats.power as smsn = sms.TTestPower().solve_power(effect_size=effect_size, power=power, alpha=alpha)print(n)

The result is 279, meaning we need to draw 279 data points from each group: Test and Control. As I don’t have real data, I used np.random.normal to generate a list of revenue data, in this case sample size = 279 for each group:

結果是279，這意味著我們需要從每個組中提取279個數據點：測試和控制。由于我沒有真實數據，因此我使用np.random.normal生成了收入數據列表，在這種情況下，每個組的樣本量= 279：

#Take the samples out of each group: control vs testcontrol_sample = np.random.normal(control_rev.mean(), control_rev.std(), size=279)
test_sample = np.random.normal(test_rev.mean(), test_rev.std(), size=279)

Moving to the 2nd step, we need to ensure the samples are (1) normally distributed and (2) independent (not correlated). Again, if you want a refresh on the tests used in this step, refer to my article as above. In short, we are going to use (1) Shapiro as the normality test and (2) Pearson as the correlation test.

移至第二步 ，我們需要確保樣本是(1)正態分布和(2)獨立(不相關)的。同樣，如果您想刷新此步驟中使用的測試，請參考上面的文章。簡而言之，我們將使用(1)Shapiro作為正態性檢驗，(2)Pearson作為相關性檢驗。

#Step 2. Pre-requisite: Normality, Correlationfrom scipy.stats import shapiro, pearsonrstat1, p1 = shapiro(control_sample)
stat2, p2 = shapiro(test_sample)print(p1, p2)stat3, p3 = pearsonr(control_sample, test_sample)
print(p3)

The p-value of Shapiro is 0.129 and 0.539 for Control and Test group respectively, which is > 0.05. Hence, we don’t reject the null hypothesis and are able to say that 2 groups are normally distributed.

對照組和測試組的Shapiro p值分別為0.129和0.539，> 0.05。因此，我們不會拒絕原假設，而是可以說2個組是正態分布的。

The p-value of Pearson is 0.98, which is >0.05, meaning that 2 groups are independent from each other.

皮爾森(Pearson)的p值為0.98，即> 0.05，表示2個組彼此獨立。

Final step is here! As there are 2 variables to be tested against each other (Test vs Control group), we use T-Test to see if there’s any significant discrepancy in Revenue after running A/B Testing:

最后一步就在這里 ！由于有兩個變量需要相互測試(測試組和對照組)，因此我們使用T-Test來查看運行A / B測試后收入是否存在顯著差異：

#Step 3. Hypothesis Testingfrom scipy.stats import ttest_indtstat, p4 = ttest_ind(control_sample, test_sample)
print(p4)

The result is 0.35, which is > 0.05. Hence, the A/B Test conducted indicates that the Test Group exposed to the ads doesn’t show any superiority over the Control Group with no ad exposure.

結果為0.35，即> 0.05。因此，進行的A / B測試表明，暴露于廣告的測試組與沒有暴露廣告的對照組相比沒有任何優勢。

Voila! That’s the end of this project — Customer Segmentation & A/B Testing! I hope you find this article useful and easy to follow.

瞧！這就是項目的結尾–客戶細分和A / B測試！我希望您覺得這篇文章有用且易于閱讀。

Do look out for my upcoming projects in Data Science and Machine Learning in the near future! In the meantime feel free to check out my Github here for the complete repository:

請在不久的將來注意我即將進行的數據科學和機器學習項目！同時，請隨時在此處查看我的Github以獲取完整的存儲庫：

Github: https://github.com/andrewnguyen07LinkedIn: www.linkedin.com/in/andrewnguyen07

GitHub： https : //github.com/andrewnguyen07 LinkedIn： www.linkedin.com/in/andrewnguyen07

Thanks!

謝謝！

翻譯自: https://towardsdatascience.com/customer-segmentation-k-means-clustering-a-b-testing-bd26a94462dd

客戶細分

總結

以上是生活随笔為你收集整理的客户细分_客户细分：K-Means聚类和A / B测试的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。