bland c++_为什么要使用bland altman图进行ab测试python代码
bland c++
目錄(Table of contents)
Why using the Bland-Altman plot
為什么要使用Bland-Altman圖
The used data
使用的數(shù)據(jù)
How the Bland-Altman plot is built
布蘭德·奧特曼圖的構(gòu)建方式
How to interpret the Bland-Altman plot for A/B test
如何解釋A / B測(cè)試的Bland-Altman圖
Conclusion
結(jié)論
為什么要使用Bland-Altman圖(Why use the Bland-Altman plot)
The Bland-Altman plot comes from the medical industry in order to compare the measure for two instruments. The first objective of John Martin Bland & Douglas Altman was to answer this question :
Bland-Altman圖來(lái)自醫(yī)療行業(yè),目的是比較兩種儀器的測(cè)量值。 John Martin Bland和Douglas Altman的首要目標(biāo)是回答這個(gè)問(wèn)題:
Do the two methods of measurement agree sufficiently closely ? — D. G. ALTMAN and J. M. BLAND [1]
兩種測(cè)量方法是否足夠接近? — DG ALTMAN和JM BLAND [1]
If it is the case, it means that if you have two instruments where one is the state of the art at a high cost and the second is 10 times cheaper, do the results obtained by the cheapest method are comparable to the reference and could it be replaced with a sufficient accuracy? For example, does the heart rate provided by a $20 connected watch is sufficiently accurate as the result of an electrocardiogram ? The second objective was to produce a method where the results are easily understandable to non-statisticians.
如果是這樣,則意味著如果您有兩臺(tái)儀器,其中一臺(tái)是最新技術(shù),價(jià)格高昂,而另一臺(tái)則便宜10倍,那么用最便宜的方法獲得的結(jié)果是否可與參考相比?被足夠的精度所取代? 例如,作為心電圖檢查的結(jié)果,一塊價(jià)值20美元的手表所提供的心率是否足夠準(zhǔn)確? 第二個(gè)目標(biāo)是提供一種方法,使非統(tǒng)計(jì)學(xué)家容易理解結(jié)果。
In analytics, A/B testing (as well known as Champion-Challenger) is a common methodology of test to compare the results of a new action / a new treatment / a new design / … on population_A to population_B having the current action. Once we have the test’s results, they have to be analysed and presented to a decisional team mostly composed of non-statisticians. That is why the Bland-Altman plot is relevant because it will compare the results of the A/B test on one plot with all of the statistical measures displayed in an understandable way.
在分析中,A / B測(cè)試(即眾所周知的Champion-Challenger)是一種常見(jiàn)的測(cè)試方法,用于比較針對(duì)種群_A的新操作/新處理/新設(shè)計(jì)/…的結(jié)果與具有當(dāng)前操作的種群_B的結(jié)果。 一旦獲得測(cè)試結(jié)果,就必須對(duì)其進(jìn)行分析,并提交給主要由非統(tǒng)計(jì)人員組成的決策團(tuán)隊(duì)。 這就是為什么Bland-Altman圖具有相關(guān)性的原因,因?yàn)樗鼘⒁砸环N易于理解的方式將A / B測(cè)試在一個(gè)圖上的結(jié)果與所有統(tǒng)計(jì)量進(jìn)行比較。
In their paper, they also showed why the correlation coefficient, the statistical test of mean comparison and the regression are inappropriate to decide on the agreement of two measures which would be in our A/B testing case to decide on the power of the challenger compared to the one of the champion.
在他們的論文中,他們還表明了為什么相關(guān)系數(shù),均值比較的統(tǒng)計(jì)檢驗(yàn)和回歸分析不適合決定兩種方法的一致性,而這在我們的A / B測(cè)試案例中就無(wú)法確定所比較挑戰(zhàn)者的能力給冠軍之一。
使用的數(shù)據(jù) (The used data)
For this article I will use a dataset available on Kaggle (coming from a DataCamp project) called “Mobile Games A/B Testing with Cookie Cats”. The link are in the references part [2].
在本文中,我將使用Kaggle(來(lái)自DataCamp項(xiàng)目)上可用的數(shù)據(jù)集,稱(chēng)為“使用Cookie Cats進(jìn)行Mobile Games A / B測(cè)試”。 鏈接在參考部分[2]中。
Cookie Cats is a popular mobile puzzle game where as a player progress the levels, he will encounter “gates” that will force him to wait for some time before continuing to play or to make a purchase. In such an industry the retention is one of the key metrics and the team in charge of the game would see the impact if the first gate was moved from level 30 to level 40 on the retention at 7 days. To see the behavior of such a move they did an A/B test and they provided us the dataset of these results. We will see how the Bland-Altman plot will answer the following question : “How to analyze the A/B results on the level of retention at 7 days when the waiting time passes from the level 30 to the level 40 ?”
Cookie Cats是一款流行的移動(dòng)益智游戲,隨著玩家逐步升級(jí)關(guān)卡,他將遇到“門(mén)”,這將迫使他等待一段時(shí)間才能繼續(xù)玩或進(jìn)行購(gòu)買(mǎi)。 在這樣的行業(yè)中,保留率是關(guān)鍵指標(biāo)之一,負(fù)責(zé)游戲的團(tuán)隊(duì)會(huì)發(fā)現(xiàn),如果第一個(gè)門(mén)在7天時(shí)從30級(jí)升至40級(jí),則會(huì)對(duì)保留率產(chǎn)生影響。 為了查看此舉的行為,他們進(jìn)行了A / B測(cè)試,并向我們提供了這些結(jié)果的數(shù)據(jù)集。 我們將看到Bland-Altman圖將如何回答以下問(wèn)題:“當(dāng)?shù)却龝r(shí)間從30級(jí)變?yōu)?0級(jí)時(shí),如何分析7天保留率的A / B結(jié)果?”
The dataset is composed of 90.189 rows where we have the player’s unique id, the A/B test’s version (waiting time at gate_30 / gate_40), the game rounds’ sum, retention_1 is a boolean saying if the player came back on the next day and retention_7 is a boolean saying if the player came back after 7 days. In order to have the relevant data to answer our question, it is necessary to do some cleaning. I will only keep the client having a retention_1 = True (because if retention_1=False, retention_7 is False as well), a number of game rounds ≥ 30 (because if they don’t go until 30, they will not be impacted by the gate) and a number of game rounds < 170 (because if we consider the duration of a game = 5 minutes, if a player plays 2 hours per day during 7 day he will play 120*7/5 = 168 games. A higher number would be considered as an anormal usage). After this filter, the dataset is composed of 20.471 rows as Figure1 below. Moreover, the dataset is equally balanced between gate_30 & gate_40.
數(shù)據(jù)集由90.189行組成,其中有玩家的唯一ID,A / B測(cè)試的版本(等待時(shí)間在gate_30 / gate_40),游戲回合的總和,retention_1是一個(gè)布爾值,表示玩家是否在第二天回來(lái)保留值[7]是布爾值,表示玩家是否在7天后回來(lái)。 為了獲得相關(guān)數(shù)據(jù)來(lái)回答我們的問(wèn)題,有必要進(jìn)行一些清潔。 我只會(huì)讓客戶(hù)保持值_1 = True(因?yàn)槿绻鹯etention_1 = False,retention_7也為False),游戲回合數(shù)≥30(因?yàn)槿绻钡?0時(shí)才回合,則不會(huì)受到影響)門(mén)數(shù))和小于170的游戲回合數(shù)(因?yàn)槿绻覀兛紤]游戲的持續(xù)時(shí)間= 5分鐘,則如果玩家在7天中每天玩2個(gè)小時(shí),他將玩120 * 7/5 = 168場(chǎng)游戲。將被視為正常使用)。 經(jīng)過(guò)此過(guò)濾器后,數(shù)據(jù)集由20.471行組成,如下圖1所示。 此外,數(shù)據(jù)集在gate_30和gate_40之間均等地平衡。
Figure1. CookieCats dataset圖1。 CookieCats數(shù)據(jù)集如何建立Bland-Altman圖(How is built the Bland-Altman plot)
We will see on this section how to adapt the original Bland-Altman plot in order to apply it to an A/B test. First of all I am going to explain how is the plot built in its original version [1], [2] and then, I will explain how to build it with our A/B tests’ data.
我們將在本節(jié)中看到如何適應(yīng)原始的Bland-Altman圖,以將其應(yīng)用于A / B測(cè)試。 首先,我將解釋如何在原始版本[1],[2]中構(gòu)建圖表,然后,將解釋如何使用A / B測(cè)試數(shù)據(jù)構(gòu)建圖表。
Due that the original Bland-Altman plot compare the measurement of 2 instruments, they have the same length by design. For example, with the heart rate measurement between the $20 connect watch and the electrocardiogram, the measure are taken on the same time with the same conditions which lead to have the same number of measurement for the 2 methods. So we can represent each dataset’s row as an experience like on the example in the Figure2 below.
由于原始的Bland-Altman圖比較了兩種儀器的測(cè)量結(jié)果,因此它們?cè)谠O(shè)計(jì)上具有相同的長(zhǎng)度。 例如,在$ 20 Connect手表和心電圖之間進(jìn)行心率測(cè)量時(shí),在相同條件下同時(shí)進(jìn)行測(cè)量,導(dǎo)致這兩種方法的測(cè)量次數(shù)相同。 因此,我們可以將每個(gè)數(shù)據(jù)集的行表示為一種體驗(yàn),就像下面的圖2中的示例一樣。
Figure2. Instrument measure per experience圖2。 根據(jù)經(jīng)驗(yàn)衡量This is where we encounter the first “pain point”. An A/B test is considered as an unique experience while as we see above, we need several experiences in order to build the plot. To bypass this limitation we will create from the A/B test several bootstrapped samples having both the same & different length.
這是我們遇到的第一個(gè)“痛點(diǎn)”。 A / B測(cè)試被認(rèn)為是獨(dú)特的體驗(yàn),而正如我們上面所看到的,我們需要一些經(jīng)驗(yàn)來(lái)構(gòu)建圖。 為了繞過(guò)此限制,我們將從A / B測(cè)試中創(chuàng)建幾個(gè)長(zhǎng)度相同且長(zhǎng)度不同的自舉樣本。
We generate 300 non-unique random integers between 200 and 1.000. These integers will represent the length of each bootstrapped sample and in order to benefit the bootstrap’s statistical properties, each non-unique random integer is duplicated 50 times. These numbers are used in order to have a sample diversity but it is arbitrary and the length depends from the size of the original dataset. These 15.000 (300*50) bootstrapped samples having a length between 200 and 2.000 are obtained by a random sampling with a replacement from the original dataset and they are concatenated together. It can be represented as the Figure3.
我們生成200到1.000之間的300個(gè)非唯一隨機(jī)整數(shù)。 這些整數(shù)將代表每個(gè)自舉樣本的長(zhǎng)度,并且為了受益于引導(dǎo)程序的統(tǒng)計(jì)屬性,每個(gè)非唯一隨機(jī)整數(shù)均重復(fù)50次。 使用這些數(shù)字是為了使樣本具有多樣性,但它是任意的,其長(zhǎng)度取決于原始數(shù)據(jù)集的大小。 這些長(zhǎng)度為200到2.000之間的15.000(300 * 50)個(gè)自舉樣本是通過(guò)從原始數(shù)據(jù)集中進(jìn)行替換的隨機(jī)抽樣獲得的,并將它們串聯(lián)在一起。 它可以表示為圖3。
Figure3. Bootstrapped dataset building圖3。 自舉數(shù)據(jù)集構(gòu)建The following code creates the bootstrapped dataset from the original data (be careful, it can takes time because the bootstrapped dataset has a length of 9.184.350 rows {by changing the random_state, we would have in average (((200+1.000)/2)*300*50 = 9.000.000 rows}).
以下代碼根據(jù)原始數(shù)據(jù)創(chuàng)建自舉數(shù)據(jù)集(請(qǐng)注意,這可能會(huì)花費(fèi)一些時(shí)間,因?yàn)樽耘e數(shù)據(jù)集的長(zhǎng)度為9.184.350行{通過(guò)更改random_state,我們平均可以得到((((200 + 1.000)/ 2)* 300 * 50 = 9.000.000行} )。
Then, we group by n_sample (the id of each 15.000 bootstrapped sample), n_sample_2 (the length of each bootstrapped sample) and version in order to have the sum of player’s retention at 7 days per gates as in Figure 4.
然后,我們將n_sample(每個(gè)15.000自舉樣本的ID),n_sample_2(每個(gè)自舉樣本的長(zhǎng)度)和版本進(jìn)行分組,以使每個(gè)門(mén)的玩家保留天數(shù)總計(jì)為7天,如圖4所示。
Figure4. Bootstrapped dataset after groupby圖4。 分組后引導(dǎo)數(shù)據(jù)集We can read this output as: the bootstrapped sample n°0/14.999 is composed of 564 rows where 98 players are still playing at 7 days with a waiting time at gate_30 while 105 players are still playing at 7 days with a waiting time at gate_40.
我們可以將輸出讀取為:引導(dǎo)樣本n°0 / 14.999由564行組成,其中98個(gè)玩家在7天仍在玩游戲,等待時(shí)間在gate_30,而105個(gè)玩家仍在7天在玩游戲,等待時(shí)間在gate_40 。
Then, we use a statistical property of the boostrap saying that the mean of a bootstrap sample is a good estimator of the true mean of a distribution. We make a group by n_sample_2 and version in order to have for each unique sample’s length the average number of player’s retention at 7 days per gates as in Figure 5.
然后,我們使用boostrap的統(tǒng)計(jì)性質(zhì),即自舉樣本的均值是分布真實(shí)均值的良好估計(jì)。 我們按n_sample_2和版本進(jìn)行分組,以便針對(duì)每個(gè)唯一樣本的長(zhǎng)度,每個(gè)門(mén)在7天的平均玩家保留數(shù)如圖5所示。
Figure5. Average number of players still playing at 7 days for each unique sample length and per gates圖5。 每個(gè)唯一的樣本長(zhǎng)度和每個(gè)門(mén),仍在玩7天的平均玩家數(shù)量We can read this output as: when the sample has 200 rows there are in average 34.60 players who are still playing at 7 days with a waiting time at gate_30 while 34.38 players who are still playing at 7 days with a waiting time at gate_40.
我們可以將輸出讀取為:當(dāng)樣本有200行時(shí),平均有34.60名玩家仍在7天的比賽中等待時(shí)間在gate_30處,而仍有34.38名玩家仍在7天的比賽中等待時(shí)間在gate_40處。
Then we unstack the database in order to have the dataset in a clearer format as the Figure6.
然后我們對(duì)數(shù)據(jù)庫(kù)進(jìn)行拆棧,以使數(shù)據(jù)集的格式更加清晰,如圖6所示。
Figure6. Unstacked version of the dataset圖6。 數(shù)據(jù)集的未堆疊版本At this stage we have all the necessary information in order to build the Bland-Altman plot and the representation of the dataset is the same as in the Figure2 above.
在此階段,我們擁有所有必要的信息以構(gòu)建Bland-Altman圖,并且數(shù)據(jù)集的表示與上圖2相同。
The Bland-Altman plot is composed of 2 axis. The x axis is the average of the two methods to compare. So it is for each row: (gate_30i + gate_40i) / 2 ||| The y axis is the difference between method A and method B. So it is for each row: (gate_30i - gate_40i) ||| And here is the second “pain point” we have. By keeping the y axis as it is, the increase of the samples’ size will increase the differences’ variability. As a result, the statistical measure we will obtain later will be over-weighted by the biggest samples. To bypass this limitation, we will represent the y axis in percentage [3]. To do it, the calculation of y is for each row: ((gate_30i - gate_40i)*100 / (( gate_30i + gate_40i)/2) ||| The dataset looks like Figure7.
Bland-Altman圖由2軸組成。 x軸是兩種比較方法的平均值。 每一行都是這樣:(gate_30i + gate_40i)/ 2 ||| y軸是方法A和方法B之間的差。 每一行都是這樣:(gate_30i-gate_40i)||| 這是我們的第二個(gè)“痛點(diǎn)”。 通過(guò)保持y軸不變,樣本大小的增加將增加差異的變異性。 結(jié)果,我們稍后將獲得的統(tǒng)計(jì)量將被最大樣本所加權(quán)。 為了繞過(guò)此限制,我們將以百分比[3]表示y軸。 為此,對(duì)每一行進(jìn)行y的計(jì)算:((gate_30i-gate_40i)* 100 /(((gate_30i + gate_40i)/ 2)|||數(shù)據(jù)集如圖7所示。
Figure7. x & y axis圖7。 x&y軸We have to check that the y axis is normally distributed in order to trust the confidence interval who will be displayed. You can assess it by using the shapiro-wilk test or at least with an histogram. If the distribution is not Normal then you can do a transformation such as logarithmic transformation. In our case, I consider the distribution as Normal.
我們必須檢查y軸是否正態(tài)分布,以便信任將顯示誰(shuí)的置信區(qū)間。 您可以使用shapiro-wilk檢驗(yàn)或至少與直方圖進(jìn)行評(píng)估。 如果分布不是正態(tài)分布,則可以進(jìn)行對(duì)數(shù)轉(zhuǎn)換。 在我們的情況下,我認(rèn)為該分布為正態(tài)分布。
Figure8. Histogram of the y axis圖8。 y軸的直方圖The Bland-Altman is composed of 3 lines (see Figure9):
Bland-Altman由3條線組成(請(qǐng)參見(jiàn)圖9):
the average values of y
y的平均值
the y’s upper bound of the confidence interval (here at 95% given the 1.96)
y的置信區(qū)間上限(此處為1.96的95%)
the y’s lower bound of the confidence interval (at 95%)
y的置信區(qū)間下限(95%)
We put all together, the package pyCompare allows to draw the Bland-Altman plot on a very easy way without having to build db[‘y’]:
我們放在一起,包pyCompare允許以非常簡(jiǎn)單的方式繪制Bland-Altman圖,而無(wú)需構(gòu)建db ['y'] :
It takes first the method A (the champion) and then the method B (the challenger). Then, if percentage = True, it will automatically do the calculus we made above. There are some other parameters we will discuss later.
它首先采用方法A(冠軍),然后采用方法B(挑戰(zhàn)者)。 然后,如果percent = True ,它將自動(dòng)執(zhí)行我們上面進(jìn)行的演算。 我們稍后還會(huì)討論其他一些參數(shù)。
如何為A / B測(cè)試解釋Bland-Altman圖 (How to interpret the Bland-Altman plot for the A/B test)
Here we are ! Here is the figure of the Bland-Altman plot for the A/B testing generated by the code above:
我們來(lái)了 ! 這是上面的代碼生成的A / B測(cè)試的Bland-Altman圖的圖:
Figure10. Bland-Altman plot applied for A/B test圖10。 Bland-Altman圖應(yīng)用于A / B測(cè)試First of all, the mean and the mean’s confidence interval (light blue stripe) are different than 0 (higher in our case). Which means that the level of retention (named bias in the original paper) of gate_30 and gate_40 are significantly different. Due that 2.93 > 0 it means that A > B < — > Champion > Challenger and as a result that a waiting period on gate_30 provides a bigger retention than a waiting period on gate_40.
首先,均值和均值的置信區(qū)間(淺藍(lán)色條紋)不同于0(在我們的情況下較高)。 這意味著gate_30和gate_40的保留水平(在原始論文中稱(chēng)為偏見(jiàn))顯著不同。 由于2.93> 0,這意味著A> B <—>冠軍>挑戰(zhàn)者,結(jié)果是gate_30上的等待時(shí)間比gate_40上的等待時(shí)間更長(zhǎng)。
The two salmon bars represent the confidence interval at 95% (named limit of agreement in the original paper) saying that we are convinced that 95% of the values will be between [-0.78% ; 6.63%]. In our exemple, this is very powerful because we can say that the retention of gate_30 will be almost always bigger than the one of gate_40.
兩條鮭魚(yú)條代表95%的置信區(qū)間(在原始論文中稱(chēng)為協(xié)議限制),表示我們確信95%的值將在[-0.78%之間; 6.63%] 。 在我們的例子中,這非常強(qiáng)大,因?yàn)榭梢哉f(shuō)gate_30的保留幾乎總是大于gate_40的保留。
As you can see, there are 2 values above the upper salmon stripe and 4 below the lower one, which is 6/300 = 0.02 < 0.05 so due that we are certain that 95% of the values are within the 2 boundaries, 5% can be above or below and in our case it represents 2% then it is perfectly normal ;)
如您所見(jiàn),在上面的鮭魚(yú)條上方有2個(gè)值,在下面的鮭魚(yú)條下方有4個(gè)值,這是6/300 = 0.02 <0.05,因此我們可以確定95%的值在2個(gè)邊界內(nèi),即5%可以高于或低于此值,在我們的示例中,它代表2%,那么這是完全正常的;)
In the pyCompare package there is the parameter limitOfAgreement who aims to change the boundaries of confidence. Here, a relevant question would be : “At which percentage can I be sure that the retention of gate_30 will be always bigger than the one of gate_40 ?” To answer this question, the lowest boundary has to be equal to 0, so we have to find the right value in order to have 0 as we can see in the code below which provide the Figure11:
在pyCompare包中,有一個(gè)參數(shù)limitOfAgreement旨在更改置信度邊界。 在這里,一個(gè)相關(guān)的問(wèn)題是:“我可以確保gate_30的保留率始終大于gate_40的保留率?” 要回答這個(gè)問(wèn)題,最低邊界必須等于0 ,所以我們必須找到正確的值才能有0,如下面提供Figure11的代碼所示:
Figure11. Output when changing the limitOfAgreement’s parameter圖11。 更改limitOfAgreement參數(shù)時(shí)的輸出We see that when limitOfAgreement = 1.55, the boundary is almost equal to 0. Then we have to check in the Normal distribution table the value at 1.55 which is 0.9394, so we are sure at ((1–0.9394)*2)*100 = 87.88% that the retention of gate_30 will always be bigger than the one of gate_40
我們看到,當(dāng)limitOfAgreement = 1.55時(shí),邊界幾乎等于0。然后我們必須在正態(tài)分布表中檢查1.55處的值0.9394,因此我們可以確定((1–0.9394)* 2)* 100 = 87.88%,gate_30的保留總是大于gate_40的保留
A last point to add is that whatever the average value of the sample, they are uniformly represented on the plot which means that the interpretation we are making are generalized whatever the size of the sample is. Indeed if we had seen a kind of conic representation of the values we could have conclude that the size of the sample has an impact on the results so we cannot have a valid interpretation.
最后要補(bǔ)充的一點(diǎn)是,無(wú)論樣本的平均值是多少,它們都在圖表上統(tǒng)一表示,這意味著無(wú)論樣本的大小如何,我們所做的解釋都是通用的。 的確,如果我們看到值的一種圓錐形表示,我們可以得出結(jié)論,樣本的大小會(huì)對(duì)結(jié)果產(chǎn)生影響,因此我們無(wú)法做出有效的解釋。
結(jié)論 (Conclusion)
We saw why it can be relevant to use the Bland-Altman plot in order to have one view about the results of an A/B test on a simple plot, how to create the plot from an A/B test and how to interpret it. This only works in case of normality of the difference however it will be necessary to transform the data.
我們已經(jīng)看到了為什么使用Bland-Altman圖對(duì)一個(gè)簡(jiǎn)單圖上的A / B測(cè)試結(jié)果有一個(gè)看法,如何從A / B測(cè)試創(chuàng)建圖以及如何解釋它的意義, 。 這僅在差異正常的情況下有效,但是有必要轉(zhuǎn)換數(shù)據(jù)。
Moreover, I checked the App and the gates appears to be on gate_40 while we proved that the retention at 7 days was better at 30. In this case, it shows that the retention is maybe not the best metric to follow compared to the profitability !
此外,我檢查了該應(yīng)用程序,發(fā)現(xiàn)登機(jī)口位于gate_40上,而我們證明了7天的保留率要好于30天。在這種情況下,它表明與獲利能力相比,保留率可能不是最佳的衡量標(biāo)準(zhǔn)!
翻譯自: https://towardsdatascience.com/why-how-to-use-the-bland-altman-plot-for-a-b-testing-python-code-78712d28c362
bland c++
總結(jié)
以上是生活随笔為你收集整理的bland c++_为什么要使用bland altman图进行ab测试python代码的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: stm32单片机c语言入门,STM32入
- 下一篇: Arduino IDE一键安装ESP82