使用python和pandas进行同类群组分析
背景故事 (Backstory)
I stumbled upon an interesting task while doing a data exercise for a company. It was about cohort analysis based on user activity data, I got really interested so thought of writing this post.
在為公司進(jìn)行數(shù)據(jù)練習(xí)時(shí),我偶然發(fā)現(xiàn)了一項(xiàng)有趣的任務(wù)。 這是關(guān)于基于用戶(hù)活動(dòng)數(shù)據(jù)的隊(duì)列分析,我真的很感興趣,因此想到了寫(xiě)這篇文章。
This article provides an insight into what cohort analysis is and how to analyze data for plotting cohorts. There are various ways to do this, I have discussed a specific approach that uses pandas and python to track user retention. And further provided some analysis into figuring out the best traffic sources (organic/ inorganic) for an organization.
本文提供了關(guān)于什么是隊(duì)列分析以及如何分析數(shù)據(jù)以繪制隊(duì)列的見(jiàn)解。 有多種方法可以做到這一點(diǎn),我已經(jīng)討論了一種使用pandas和python跟蹤用戶(hù)保留率的特定方法。 并進(jìn)一步提供了一些分析,以找出組織的最佳流量來(lái)源(有機(jī)/無(wú)機(jī))。
隊(duì)列分析 (Cohort Analysis)
Let’s start by introducing the concept of cohorts. Dictionary definition of a cohort is a group of people with some common characteristics. Examples of cohort include birth cohorts (a group of people born during the same period, like 90’s kids) and academic cohorts (a group of people who started working towards the same curriculum to finish a degree together).
讓我們開(kāi)始介紹同類(lèi)群組的概念。 同類(lèi)詞典的定義是一群具有某些共同特征的人。 同類(lèi)人群的例子包括出生人群 (在同一時(shí)期出生的一群人,例如90年代的孩子)和學(xué)術(shù)人群 (一群開(kāi)始朝著相同的課程努力以完成學(xué)位的人們)。
Cohort analysis is specifically useful in analyzing user growth patterns for products. In terms of a product, a cohort can be a group of people with the same sign-up date, the same usage starts month/date, or the same traffic source.
同類(lèi)群組分析在分析產(chǎn)品的用戶(hù)增長(zhǎng)模式時(shí)特別有用。 就產(chǎn)品而言,同類(lèi)群組可以是一群具有相同注冊(cè)日期,相同使用開(kāi)始月份/日期或相同流量來(lái)源的人。
Cohort analysis is an analytics method by which these groups can be tracked over time for finding key insights. This analysis can further be used to do customer segmentation and track metrics like retention, churn, and lifetime value. There are two types of cohorts — acquisitional and behavioral.
同類(lèi)群組分析是一種分析方法,可以隨著時(shí)間的推移跟蹤這些組以查找關(guān)鍵見(jiàn)解。 該分析還可以用于進(jìn)行客戶(hù)細(xì)分,并跟蹤諸如保留率,客戶(hù)流失率和生命周期價(jià)值之類(lèi)的指標(biāo)。 隊(duì)列有兩種類(lèi)型:獲取型和行為型。
Acquisitional cohorts — groups of users on the basis of there signup date or first use date etc.
獲取群組-根據(jù)注冊(cè)日期或首次使用日期等確定的用戶(hù)組。
Behavioral cohorts — groups of users on the basis of there activities in a given period of time. Examples could be when they install the app, uninstall the app, delete the app, etc.
行為群組-基于給定時(shí)間段內(nèi)的活動(dòng)的用戶(hù)組。 例如當(dāng)他們安裝應(yīng)用程序,卸載應(yīng)用程序,刪除應(yīng)用程序等時(shí)。
In this article, I will be demonstrating the acquisition cohort creation and analysis using a dataset. Let’s dive into it:
在本文中,我將演示使用數(shù)據(jù)集進(jìn)行的采集隊(duì)列創(chuàng)建和分析。 讓我們深入了解一下:
建立 (Setup)
I am using pandas, NumPy, seaborn, and matplotlib for this analysis. So let’s start by importing the required libraries.
我正在使用熊貓,NumPy,seaborn和matplotlib進(jìn)行此分析。 因此,讓我們從導(dǎo)入所需的庫(kù)開(kāi)始。
import pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inline
數(shù)據(jù) (The Data)
This dataset consists of usage data from customers for the month of February. Some of the users in the dataset started using the app in February (if ‘isFirst’ is true) and some are pre-February users.
此數(shù)據(jù)集包含2月份來(lái)自客戶(hù)的使用情況數(shù)據(jù)。 數(shù)據(jù)集中的某些用戶(hù)從2月開(kāi)始使用該應(yīng)用程序(如果'isFirst'為true),另一些則是2月之前的用戶(hù)。
df=pd.read_json(“data.json”)df.head()
The data has 5 columns:
數(shù)據(jù)有5列:
date: date of the use (for the month of February)timestamp: usage timestampuid: unique id assigned to usersisFirst: true if this is the user’s first use everSource: traffic source from which the user cameWe can compute the shape and info of the dataframe as follows
我們可以如下計(jì)算數(shù)據(jù)框的形狀和信息
df.shape(4823567, 5)df.info()<class 'pandas.core.frame.DataFrame'>RangeIndex: 4823567 entries, 0 to 4823566
Data columns (total 5 columns):
date datetime64[ns]
isFirst bool
timestamp datetime64[ns]
uid object
utmSource object
dtypes: bool(1), datetime64[ns](2), object(2)
memory usage: 151.8+ MB
Below is a table of contents for the data analysis. I will first show the data cleaning that I did for this dataset, followed by the questions that this exercise will answer. The most important part of any analysis based project is what questions are going to be answered by the end of it. I will be answering 3 questions (listed below) followed by some more analysis, summary, and conclusions.
以下是用于數(shù)據(jù)分析的目錄。 我將首先顯示我對(duì)此數(shù)據(jù)集所做的數(shù)據(jù)清理,然后顯示此練習(xí)將回答的問(wèn)題。 在任何基于分析的項(xiàng)目中,最重要的部分是在項(xiàng)目結(jié)束時(shí)將要回答什么問(wèn)題。 我將回答3個(gè)問(wèn)題(在下面列出),然后提供更多分析,總結(jié)和結(jié)論。
目錄 (Table of Contents)
Data Cleaning
數(shù)據(jù)清理
Question 1: Show the daily active users over the month.
問(wèn)題1:顯示當(dāng)月的每日活躍用戶(hù)。
Question 2: Calculate the daily retention curve for users who used the app for the first time on specific dates. Also, show the number of users from each cohort.
問(wèn)題2:計(jì)算在特定日期首次使用該應(yīng)用程序的用戶(hù)的每日保留曲線(xiàn)。 另外,顯示每個(gè)群組的用戶(hù)數(shù) 。
Question 3: Determine if there are any differences in usage based on where the users came from. From which traffic source does the app get its best users? Its worst users?
問(wèn)題3:根據(jù)用戶(hù)來(lái)自何方來(lái)確定用法上是否存在差異。 該應(yīng)用程序從哪個(gè)流量來(lái)源獲得最佳用戶(hù)? 它最糟糕的用戶(hù)?
Conclusions
結(jié)論
數(shù)據(jù)清理 (Data Cleaning)
Here are some of the tasks I performed for cleaning my data.
這是我執(zhí)行的一些清理數(shù)據(jù)任務(wù)。
空值: (Null values:)
- Found out the null values in the dataframe: Source had 1,674,386 null values 在數(shù)據(jù)框中找到空值:源具有1,674,386個(gè)空值
- Created a new column ‘trafficSource’ where null values are marked as ‘undefined’ 創(chuàng)建了一個(gè)新列“ trafficSource”,其中空值被標(biāo)記為“未定義”
合并流量來(lái)源: (Merge Traffic Sources:)
- Merged traffic Sources using regular expression: facebook.* to facebook, gplus.* to google, twitter.* to twitter 使用正則表達(dá)式合并流量來(lái)源:facebook。*到facebook,gplus。*到google,twitter。*到twitter
- Merged the traffic sources with < than 500 unique users to ‘others’. This was done because 11 sources had only 1 unique user, another 11 sources had less than 10 unique users, and another 11 sources had less than 500 unique users. 將少于500個(gè)唯一身份用戶(hù)的流量來(lái)源合并為“其他”用戶(hù)。 這樣做是因?yàn)?1個(gè)來(lái)源只有1個(gè)唯一用戶(hù),另外11個(gè)來(lái)源只有10個(gè)以下唯一用戶(hù),另外11個(gè)來(lái)源只有500個(gè)以下唯一用戶(hù)。
- Finally reduced the number of traffic sources from 52 to 11. 最終將流量來(lái)源從52個(gè)減少到11個(gè)。
df.isnull().sum()date 0
isFirst 0
timestamp 0
uid 0
utmSource 1674386
dtype: int64
Looks like Source has a lot of null values. Almost ~34% of the values are null. Created a new column ‘trafficSource’ where null values are marked as ‘undefined’
看起來(lái)Source有很多空值。 幾乎?34%的值為空。 創(chuàng)建了一個(gè)新列“ trafficSource”,其中空值被標(biāo)記為“未定義”
Next, I took care of similar traffic sources like facebook, facebookapp, src_facebook, etc by merging them.
接下來(lái),我通過(guò)合并來(lái)處理類(lèi)似的流量來(lái)源,例如facebook,facebookapp,src_facebook等。
Found out the unique users from each traffic source. This was done to figure out — if some of the traffic sources have very few unique users as compared to others then they all can be merged. This reduces the number of data sources that we have to analyze without any significant loss in the accuracy of the analysis. And so I merged the traffic sources with less than 500 (0.2% of the total) unique users to ‘others’. Finally reducing the number of traffic sources from 52 to 11.
從每個(gè)流量來(lái)源中找出唯一的用戶(hù)。 這樣做是為了弄清楚-如果某些流量源與其他流量相比具有很少的唯一用戶(hù),則可以將它們?nèi)亢喜ⅰ?這減少了我們必須分析的數(shù)據(jù)源的數(shù)量,而不會(huì)導(dǎo)致分析準(zhǔn)確性的任何重大損失。 因此,我將流量來(lái)源與少于500個(gè)唯一用戶(hù)( 占總數(shù)的0.2% )合并為“其他”用戶(hù)。 最后,將流量來(lái)源的數(shù)量從52個(gè)減少到11個(gè) 。
Now let’s answer the questions.
現(xiàn)在讓我們回答問(wèn)題。
問(wèn)題1: (Question 1:)
顯示一個(gè)月的每日活躍用戶(hù)(DAU)。 (Show the daily active users (DAU) over the month.)
A user is considered active for the day if they used the app at least once on a given day. Tasks performed to answer this question:
如果用戶(hù)在給定的一天中至少使用過(guò)一次該應(yīng)用程序,則該用戶(hù)被視為當(dāng)天處于活動(dòng)狀態(tài)。 為回答該問(wèn)題而執(zhí)行的任務(wù):
- Segregated the users who started using the app in February from all the users. 將2月份開(kāi)始使用該應(yīng)用程序的用戶(hù)與所有用戶(hù)隔離開(kāi)來(lái)。
Calculated the DAU for
計(jì)算的DAU
- users who started in the month of February
-2月開(kāi)始的用戶(hù)
- total number of active users
-活動(dòng)用戶(hù)總數(shù)
- plotted this on a graph. 將此繪制在圖形上。
Figure 1 shows the daily active users (DAU) for the month of February. I have plotted 2 graphs; one is the DAU plot for all the users and the other is the DAU plot for those users who started using the app in February. As we can see from the graph, the daily active count for Feb users is increasing in number but the DAU plot for all users has significant periodic dips (which could be attributed to less usage during the weekends) with slower net growth as compared to just the users for February.
圖1顯示了2月份的每日活動(dòng)用戶(hù)(DAU)。 我已經(jīng)繪制了2張圖; 一個(gè)是所有用戶(hù)的DAU圖,另一個(gè)是2月份開(kāi)始使用該應(yīng)用程序的用戶(hù)的DAU圖。 從圖表中可以看出,2月份用戶(hù)的每日活動(dòng)數(shù)量正在增加,但是所有用戶(hù)的DAU圖都有明顯的周期性下降(這可能是由于周末期間的使用減少),與2月份的用戶(hù)。
問(wèn)題2 (Question 2)
計(jì)算在特定日期首次使用該應(yīng)用程序的用戶(hù)的每日保留曲線(xiàn)。 另外,顯示每個(gè)同類(lèi)群組的用戶(hù)數(shù)。 (Calculate the daily retention curve for users who used the app for the first time on specific dates. Also, show the number of users from each cohort.)
The dates which were considered for creating cohorts are Feb 4th, Feb 10th, and Feb 14th. Tasks done to answer this question are:
創(chuàng)建隊(duì)列的日期考慮為2月4日,2月10日和2月14日。 為回答該問(wèn)題而完成的任務(wù)是:
- Created cohorts for all the users who started on the above dates. 為在上述日期開(kāi)始的所有用戶(hù)創(chuàng)建了同類(lèi)群組。
- Calculated daily retention for the above dates as starting dates; for each day of February. The daily retention curve is defined as the % of users from the cohort, who used the product that day. 計(jì)算上述日期的每日保留時(shí)間作為開(kāi)始日期; 2月的每一天。 每日保留曲線(xiàn)定義為當(dāng)天使用該產(chǎn)品的同類(lèi)用戶(hù)的百分比。
The function dailyRetention takes a dataframe and a date (of cohort creation) as input and creates a cohort for all the users who started using the app on date ‘date’. It outputs the total number of unique users in that cohort and the retention in percentage from the starting date for each day of February.
函數(shù)dailyRetention將數(shù)據(jù)框和日期(創(chuàng)建群組的日期)作為輸入,并為所有在日期“ date”開(kāi)始使用該應(yīng)用程序的用戶(hù)創(chuàng)建一個(gè)群組。 它輸出該同類(lèi)群組中的唯一身份用戶(hù)總數(shù),以及從2月的每一天開(kāi)始日期起的保留百分比。
Figure 2 shows the total number of unique users from each cohort.
圖2顯示了每個(gè)隊(duì)列的唯一身份用戶(hù)總數(shù)。
Figure 2. Number of unique users from each cohort圖2.每個(gè)隊(duì)列的唯一用戶(hù)數(shù)The below code makes the data ready for creating a heatmap by adding a cohort index and then pivoting the data with index as Cohort start dates, columns as days of February, and values as a percentage of unique users who used the app on that day. And then the code further plots a heatmap.
下面的代碼通過(guò)添加同類(lèi)群組索引,然后將數(shù)據(jù)與索引一起作為同類(lèi)群組開(kāi)始日期,列作為2月的天以及列值作為當(dāng)天使用該應(yīng)用的唯一身份用戶(hù)的百分比來(lái)使數(shù)據(jù)準(zhǔn)備好創(chuàng)建熱圖。 然后代碼進(jìn)一步繪制熱圖。
Figure 3. Retention rate (%)圖3.保留率(%)Figure 3 shows a heatmap for the daily retention for users who used the app for the first time on Feb 4th, Feb 10th, and Feb 14th. From the heatmap, we can see 100% retention on the first day of usage. And retention decreases to as low as ~31% for some of the days.
圖3顯示了2月4日,2月10日和2月14日首次使用該應(yīng)用程序的用戶(hù)每日保留的熱圖。 從熱圖中,我們可以看到使用第一天的保留率是100%。 在某些日子里,保留率降低至?31%。
Figure 4 shows the Daily retention curve for the month of February. Initially the retention is 100% but it keeps on decreasing and becomes stable after a week.
圖4顯示了2月份的每日保留曲線(xiàn)。 最初的保留率為100%,但持續(xù)下降,并在一周后變得穩(wěn)定。
Figure 4. Daily retention curve圖4.每日保留曲線(xiàn)This retention curve immediately reflects an important insight — about 40–50% of the users stop using the app after the 1st day. After that initial large drop, a second brisk drop occurs after the 10th day — to under 35–45%, before the curve starts to level off after the 13th day, leaving about 37–43% of original users still active in the app at the last day of February.
此保留曲線(xiàn)立即反映出重要的見(jiàn)解-大約40%至50%的用戶(hù)在第一天后停止使用該應(yīng)用程序。 在最初的大幅下降之后,在第10天之后又發(fā)生了第二次快速下降-低于35–45%,在第13天之后曲線(xiàn)開(kāi)始趨于平穩(wěn)之前,大約有37–43%的原始用戶(hù)仍在該應(yīng)用中處于活動(dòng)狀態(tài)2月的最后一天。
The above retention curve indicates that a lot of users are not experiencing the value of the app, resulting in drop-offs. Hence, one way to fix that is to improve the onboarding experience which can help the users in experiencing the core value as quickly as possible, thereby boosting retention.
上面的保留曲線(xiàn)表明,很多用戶(hù)沒(méi)有體驗(yàn)到該應(yīng)用程序的價(jià)值,從而導(dǎo)致該應(yīng)用程序的流失。 因此,一種解決方法是改善入職體驗(yàn),這可以幫助用戶(hù)盡快體驗(yàn)核心價(jià)值,從而提高保留率。
問(wèn)題3 (Question 3)
根據(jù)用戶(hù)來(lái)自何方來(lái)確定用法上是否存在差異。 該應(yīng)用程序從哪個(gè)流量來(lái)源獲得最佳用戶(hù)? 它最糟糕的用戶(hù)? (Determine if there are any differences in usage based on where the users came from. From which traffic source does the app get its best users? Its worst users?)
The tasks performed to answer this question are:
為回答該問(wèn)題而執(zhí)行的任務(wù)是:
- Data Cleaning: To clean user ids with duplicate sources. 數(shù)據(jù)清除:清除具有重復(fù)來(lái)源的用戶(hù)ID。
- Feature Engineering: Feature engineered new features to find out the best and worst sources. 功能工程:功能工程的新功能可以找出最佳和最差的來(lái)源。
Data Cleaning:Identifying the best or the worse sources required some data cleaning. There are some users who had more than one traffic source. I did some data cleaning to remove/merge these multiple sources.
數(shù)據(jù)清除:確定最佳或較差的來(lái)源需要一些數(shù)據(jù)清除。 有些用戶(hù)擁有多個(gè)流量來(lái)源。 我進(jìn)行了一些數(shù)據(jù)清理,以刪除/合并這些多個(gè)源。
- 1.64% of user ids i.e. 4058 unique uids had more than 1 source. 1.64%的用戶(hù)ID(即4058個(gè)唯一uid)具有多個(gè)來(lái)源。
- Since the duplicate traffic source uid count is not significant and there was no reliable way to attribute a single source to these uids, I simply removed these uids from my analysis. 由于重復(fù)的流量源uid計(jì)數(shù)并不重要,并且沒(méi)有可靠的方法將單個(gè)源歸因于這些uid,因此我只是從分析中刪除了這些uid。
1.64% of user ids have more than 1 source.
1.64%的用戶(hù)ID具有多個(gè)來(lái)源。
The below code creates a group of all users with multiple sources and drops those users from our dataframe and creates a new dataframe ‘dfa’.
以下代碼創(chuàng)建了一組具有多個(gè)來(lái)源的所有用戶(hù),并將這些用戶(hù)從我們的數(shù)據(jù)框中刪除,并創(chuàng)建了一個(gè)新的數(shù)據(jù)框'dfa'。
Feature Engineering: In this section, I have feature engineered 2 different metrics to find out the differences in usage based on the traffic source. Here are the metrics:
功能工程:在本節(jié)中,我對(duì)2個(gè)不同的功能進(jìn)行了工程設(shè)計(jì),以根據(jù)流量來(lái)源找出用法上的差異。 以下是指標(biāo):
Total number of unique active users per source per day The first metric is a purely quantitative metric calculated on the basis of how many users we are getting from each source and their activity per day throughout the month. The below code calculates this metric and plots a graph to visualize the results.
每天每個(gè)來(lái)源的唯一活動(dòng)用戶(hù)總數(shù)第一個(gè)指標(biāo)是一個(gè)純粹的定量指標(biāo),該指標(biāo)是根據(jù)我們每個(gè)月從每個(gè)來(lái)源獲得的用戶(hù)數(shù)量及其每天的活動(dòng)量計(jì)算得出的。 以下代碼計(jì)算該指標(biāo)并繪制圖形以可視化結(jié)果。
Figure 6 plots this information using a line plot. From the plot, one can see that biznesowe+rewolucje and undefined sources are getting the most users but there is a dip in the usage on weekends. And sources like program, answers, shmoop, twitter, grub+street, and handbook have constant usage throughout the month but the unique users contributed are low.
圖6使用線(xiàn)圖繪制了這些信息。 從該圖可以看出,biznesowe + rewolucje和未定義來(lái)源正在吸引最多的用戶(hù),但周末使用量有所下降。 諸如程序,答案,smoop,twitter,grub + street和手冊(cè)之類(lèi)的資源在整個(gè)月中都有固定使用量,但唯一用戶(hù)貢獻(xiàn)卻很少。
2) Number of days active/source The second metric I calculated is the number of days active/source. For this, I have grouped the data per traffic source and uid and counted the number of unique dates. So this gives us the number of days for each traffic source when each uid was active. I have plotted this information on a KDE graph and on analyzing the graph it’s evident that the distribution for all sources is bimodal with peaks near 0 and 29 days. The best traffic sources can be defined as ones with a peak at 29 and the worst ones with a peak at 0.
2)活動(dòng)/源天數(shù)我計(jì)算的第二個(gè)指標(biāo)是活動(dòng)/源天數(shù)。 為此,我對(duì)每個(gè)流量來(lái)源和uid的數(shù)據(jù)進(jìn)行了分組,并計(jì)算了唯一日期的數(shù)量。 因此,這為我們提供了每個(gè)uid處于活動(dòng)狀態(tài)時(shí)每個(gè)流量來(lái)源的天數(shù)。 我已經(jīng)在KDE圖上繪制了這些信息,并且在分析圖時(shí)很明顯,所有源的分布都是雙峰的,峰值接近0和??29天。 最佳流量源可以定義為峰值為29的流量來(lái)源,最差流量為0的峰值流量。
Figure 7. Number of days active/source圖7.活動(dòng)/源天數(shù)Figure 7 shows us a KDE graph for the number of days active per source for the month of Feb. From the graph, it can be seen that the best sources with a mode at 29 (most of the users from these sources used the app for 29 days) are shmoop and twitter closely followed by program, salesmanago, and grub+street with peaks at 29, 28 and 27 respectively. The worst source is the undefined with a mode of 0 despite getting the most users, followed by answers and biznesowe+rewolucje. If I were to define the traffic sources from best to worst based on this graph above, this would be my sequence: shmoop, twitter, program, salesmanago, grub+street, other, handbook, mosalingua+fr, biznesowe+rewolucje, answers, and undefined.
圖7為我們顯示了2月每個(gè)來(lái)源的活動(dòng)天數(shù)的KDE圖。從該圖可以看出,最佳模式為29的最佳來(lái)源(這些來(lái)源中的大多數(shù)用戶(hù)使用該應(yīng)用29天)是shmoop和twitter,緊隨其后的是節(jié)目,銷(xiāo)售經(jīng)理和grub + street,其峰值分別在29、28和27。 盡管獲得最多的用戶(hù),但最差的來(lái)源是undefined,其模式為0,其次是答案和biznesowe + rewolucje。 如果我要根據(jù)上面的這張圖來(lái)定義從最佳到最壞的流量來(lái)源,那么這就是我的順序:shmoop,twitter,程序,salesmanago,grub + street,其他,手冊(cè),mosalingua + fr,biznesowe + rewolucje,答案,和未定義。
分析 (Analysis)
User behavior depends on the kind of metric that is important for a business. For some businesses daily activity(pings) can be an important metric and for some businesses, more activity (pings) on certain days of the month has more weight than the daily activity. One would define the worst users and best users based on what is important for the product/organization.
用戶(hù)行為取決于對(duì)業(yè)務(wù)很重要的度量標(biāo)準(zhǔn)類(lèi)型。 對(duì)于某些企業(yè)來(lái)說(shuō),每日活動(dòng)(ping)可能是一個(gè)重要的指標(biāo),對(duì)于某些企業(yè)來(lái)說(shuō),每月某些天的更多活動(dòng)(ping)的權(quán)重要大于日常活動(dòng)。 可以根據(jù)對(duì)產(chǎn)品/組織重要的信息來(lái)定義最差的用戶(hù)和最佳的用戶(hù)。
摘要 (Summary)
If the total number of unique active users is an important metric for the product than the first graph can be used to see which sources are best/worst — more number of users indicate better traffic source.
如果唯一活動(dòng)用戶(hù)總數(shù)是該產(chǎn)品的重要指標(biāo),則可以使用第一個(gè)圖表來(lái)查看哪個(gè)來(lái)源是最佳/最差來(lái)源-更多的用戶(hù)表示更好的流量來(lái)源。
But if we want to see their activity over the month and analyze how many days the users from a particular source were active for the month then the second metric becomes important. In this case, we found out that even if the source (eg Shmoop, twitter) is giving a lesser number of unique active users per day, the users which are coming are using the app for a longer period of time.
但是,如果我們想查看他們一個(gè)月的活動(dòng)并分析來(lái)自特定來(lái)源的用戶(hù)在該月中活躍了多少天,那么第二個(gè)指標(biāo)就變得很重要。 在這種情況下,我們發(fā)現(xiàn)即使源(例如Shmoop,twitter)每天提供的唯一身份活躍用戶(hù)數(shù)量較少,即將到來(lái)的用戶(hù)使用該應(yīng)用程序的時(shí)間也更長(zhǎng)。
結(jié)論 (Conclusions)
In this article, I showed how to carry out Cohort Analysis using Python’s pandas, matplotlib, and seaborn. During the analysis, I have made some simplifying assumptions, but that was mostly due to the nature of the dataset. While working on real data, we would have more understanding of the business and can draw better and more meaningful conclusions from the analysis.
在本文中,我展示了如何使用Python的pandas,matplotlib和seaborn進(jìn)行同類(lèi)群組分析。 在分析過(guò)程中,我做了一些簡(jiǎn)化的假設(shè),但這主要是由于數(shù)據(jù)集的性質(zhì)。 在處理真實(shí)數(shù)據(jù)時(shí),我們將對(duì)業(yè)務(wù)有更多的了解,并可以從分析中得出更好,更有意義的結(jié)論。
You can find the code used for this article on my GitHub. As always, any constructive feedback is welcome. You can reach out to me in the comments.
您可以在我的GitHub上找到用于本文的代碼。 一如既往,歡迎任何建設(shè)性的反饋。 您可以在評(píng)論中與我聯(lián)系。
翻譯自: https://medium.com/swlh/cohort-analysis-using-python-and-pandas-d2a60f4d0a4d
總結(jié)
以上是生活随笔為你收集整理的使用python和pandas进行同类群组分析的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 梦到猴子抓着自己预示着什么
- 下一篇: 使用python pandas data