日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

ab 模拟_Ab测试第二部分的直观模拟

發(fā)布時(shí)間:2023/11/29 编程问答 40 豆豆
生活随笔 收集整理的這篇文章主要介紹了 ab 模拟_Ab测试第二部分的直观模拟 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

ab 模擬

In this post, I would like to invite you to continue our intuitive exploration of A/B testing, as seen in the previous post:

在本文中,我想邀請(qǐng)您繼續(xù)我們對(duì)A / B測(cè)試的直觀探索,如前一篇文章所示:

Resuming what we saw, we were able to prove through simulations and intuition that there was a relationship between Website Version and Signup since we were able to elaborate a test with a Statistical Power of 79% that allowed us to reject the hypothesis that states otherwise with 95% confidence. In other words, we proved that behavior as bias as ours was found randomly, only 1.6% of the time.

繼續(xù)觀察,我們可以通過(guò)模擬和直覺(jué)來(lái)證明網(wǎng)站版本和注冊(cè)之間存在關(guān)系,因?yàn)槲覀兡軌蛞?9%的統(tǒng)計(jì)功效精心制作一個(gè)測(cè)試,從而可以拒絕采用以下方法得出的假設(shè): 95%的信心。 換句話說(shuō),我們證明了像我們這樣的偏見(jiàn)行為是隨機(jī)發(fā)現(xiàn)的,只有1.6%的時(shí)間。

Even though we were satisfied with the results, we still need to prove with a defined statistical confidence level that there was a higher-performing version. In practice, we need to prove our hypothesis that, on average, we should expect version F would win over any other version.

即使我們對(duì)結(jié)果感到滿意,我們?nèi)匀恍枰远x的統(tǒng)計(jì)置信度證明存在更高性能的版本。 在實(shí)踐中,我們需要證明我們的假設(shè),即平均而言,我們應(yīng)該期望版本F會(huì)勝過(guò)任何其他版本。

開(kāi)始之前 (Before we start)

Let us remember and explore our working data from our prior post, where we ended up having 8017 Dices thrown as defined by our Statistical Power target of 80%.

讓我們記住并探索我們先前職位的工作數(shù)據(jù),最終我們拋出了8017個(gè)骰子,這是我們80%的統(tǒng)計(jì)功效目標(biāo)所定義的。

# Biased Dice Rolling FunctionDiceRolling <- function(N) {
Dices <- data.frame()
for (i in 1:6) {
if(i==6) {
Observed <- data.frame(Version=as.character(LETTERS[i]),Signup=rbinom(N/6,1,0.2))
} else {
Observed <- data.frame(Version=as.character(LETTERS[i]),Signup=rbinom(N/6,1,0.16))
}
Dices <- rbind(Dices,Observed)
}
return(Dices)
}# In order to replicateset.seed(11)
Dices <- DiceRolling(8017) # We expect 80% Power
t(table(Dices))

As a reminder, we designed an R function that simulates a biased dice in which we have a 20% probability of lading in 6 while a 16% chance of landing in any other number.

提醒一下,我們?cè)O(shè)計(jì)了一個(gè)R函數(shù),函數(shù)可以模擬有偏見(jiàn)的骰子,在該骰子中,我們有20%的概率在6中提貨,而在其他數(shù)字中有16%的機(jī)會(huì)著陸。

Additionally, we ended up generating a dummy dataset of 8.017 samples, as calculated for 80% Power, that represented six different versions of a signup form and the number of leads we observed on each. For this dummy set to be random and have a winner version (F) that will serve us as ground truth, we generated this table by simulating some biased dice’s throws.

此外,我們最終生成了一個(gè)8017次冪計(jì)算的8.017個(gè)樣本的虛擬數(shù)據(jù)集,該數(shù)據(jù)集表示六個(gè)不同版本的注冊(cè)表單以及我們?cè)诿總€(gè)表單上觀察到的潛在客戶數(shù)量。 為了使這個(gè)虛擬集是隨機(jī)的,并且有一個(gè)獲勝者版本(F),它將用作我們的 基礎(chǔ)事實(shí),我們通過(guò)模擬一些有偏向的骰子投擲來(lái)生成此表。

The output:

輸出:

higher frequency of observed F’s (更高頻率的F ( 251)251 )

Which should allow us to produce this report:

這應(yīng)該使我們能夠生成此報(bào)告:

Note the difference between our observed Signup Rate and the expected (designed) probabilities.注意我們觀察到的注冊(cè)率和預(yù)期(設(shè)計(jì))概率之間的差異。

As seen above, we can observe different Signup Rates for our landing page versions. What is interesting about this is the fact that even though we planned a precise Signup Probability (Signup Rate), we got utterly different results between our observed and expected (planned) rates.

如上所述,對(duì)于目標(biāo)網(wǎng)頁(yè)版本,我們可以觀察到不同的注冊(cè)率。 有趣的是,即使我們計(jì)劃了精確的注冊(cè)概率(注冊(cè)率),我們?cè)谟^察到的和預(yù)期的(計(jì)劃的)利率之間卻得到了截然不同的結(jié)果。

Let us take a pause and allow us to conduct a “sanity check” of say, Version C, which shows the highest difference between its Observed (0.14) and Expected (0.16) rates in order to check if there is something wrong with our data.

讓我們暫停一下,讓我們進(jìn)行一次“健全性檢查”,例如版本C,該版本顯示其觀測(cè)(0.14)和預(yù)期(0.16)速率之間的最大差異,以便檢查我們的數(shù)據(jù)是否存在問(wèn)題。

完整性檢查 (Sanity Check)

Even though this step is not needed, it will serve us as a good starting point for building the intuition that will be useful for our primary goal.

即使不需要這一步驟,它也將為我們建立直覺(jué)提供一個(gè)很好的起點(diǎn),這種直覺(jué)將對(duì)我們的主要目標(biāo)有用。

As mentioned earlier, we want to prove that our results, even though initially different from what we expected, should not be far different from it since they vary based on the underlying probability distribution.

如前所述,我們想證明我們的結(jié)果,盡管最初與我們的預(yù)期有所不同,但應(yīng)該與結(jié)果相差無(wú)幾,因?yàn)樗鼈兓跐撛诘母怕史植级兓?

In other words, for the particular case of Version C, our hypothesis are as follows:

換句話說(shuō),對(duì)于版本C的特定情況,我們的假設(shè)如下:

我們?yōu)槭裁词褂檬侄?#xff1f; (Why did we use means?)

This particular case allows us to use both proportions or means since, as we designed our variables to be dichotomous with values 0 or 1, the mean calculation represents, in this case, the same value as our ratios or proportions.

這種特殊的情況下,允許我們使用這兩個(gè)比例或辦法,因?yàn)檎缥覀冊(cè)谠O(shè)計(jì)變量是二分與值0或1,平均值計(jì)算表示,在這種情況下,相同的值作為我們的比率或比例。

# Results for Version C
VersionC <- Dices[which(Dices$Version==”C”),]# Mean calculation
mean(VersionC$Signup)Version C Observed Mean版本C觀察到的平均值

p值 (p-Value)

We need to find our p-Value, which will allow us to accept or reject our hypothesis based on the probability of finding results “as extreme” as the one we got for Version C within the underlying probability distribution.

我們需要找到我們的p值,這將使我們能夠基于在潛在概率分布內(nèi)發(fā)現(xiàn)與版本C一樣“極端”的結(jié)果的概率來(lái)接受或拒絕我們的假設(shè)。

This determination, meaning that our mean is significantly different from a true value (0.16), is usually addressed through a variation of the well-known Student Test called “One-Sample t-Test.” Note: since we are also using proportions, we could also be using a “Proportion Test”, though it is not the purpose of this post.

這種確定意味著我們的平均值與真實(shí)值(0.16)顯著不同,通常通過(guò)一種著名的學(xué)生測(cè)驗(yàn)(稱為“ 單樣本t測(cè)驗(yàn) ”)來(lái)解決。 注意:由于我們也使用比例,因此我們也可以使用“ 比例測(cè)試 ”,盡管這不是本文的目的。

To obtain the probability of finding results as extreme as ours, we would need to repeat our data collection process many times. Since this procedure is expensive and unrealistic, we will use a method similar to the resampling by permutation that we did in our last post called “Bootstrapping”.

為了獲得發(fā)現(xiàn)與我們一樣極端的結(jié)果的可能性,我們將需要重復(fù)多次數(shù)據(jù)收集過(guò)程。 由于此過(guò)程昂貴且不切實(shí)際,因此我們將使用類似于上一篇名為“ Bootstrapping”的文章中介紹的通過(guò)置換重采樣的方法。

自舉 (Bootstrapping)

Bootstrapping is done by reshuffling one of our columns, in this case Signups, while maintaining the other one fixed. What is different from the permutation resample we have done in the past is that we will allow replacement as shown below:

通過(guò)重新組合我們其中一列(在本例中為Signups),同時(shí)保持另一列固定不變來(lái)進(jìn)行引導(dǎo)。 與過(guò)去進(jìn)行的排列重采樣不同的是,我們將允許替換,如下所示:

Note that this time Signup data from our “Original Sample” is even repeated within our Replacements.請(qǐng)注意,這次,我們的“替換品”中甚至重復(fù)了來(lái)自“原始樣品”的注冊(cè)數(shù)據(jù)。

It is important to note that we need to allow replacement within this reshuffling process; otherwise, simple permutation will always result in the same mean as shown below.

重要的是要注意,我們需要允許在此改組過(guò)程中進(jìn)行替換; 否則,簡(jiǎn)單的置換將始終產(chǎn)生如下所示的均值。

Let us generate 10 Resamples without replacement:

讓我們生成10個(gè)重采樣而不替換

i = 0
for (i in 1:10) {
Resample <- sample(VersionC$Signup,replace=FALSE);
cat(paste("Resample #",i," : ",mean(Resample),"\n",sep=""));
i = i+1;
}Note that using permutation, they will always be the same.請(qǐng)注意,使用置換,它們將始終相同。

And 10 Resamples with replacement:

替換 10個(gè)重采樣:

i = 0
for (i in 1:10) {
Resample <- sample(VersionC$Signup,replace=TRUE);
cat(paste("Resample #",i," : ",mean(Resample),"\n",sep=""));
i = i+1;
}With a sample with replacement, they will vary according to a probability distribution.對(duì)于替換樣本,它們將根據(jù)概率分布而變化。

模擬 (Simulation)

Let us simulate 30k permutations of Version C with our data.

讓我們用數(shù)據(jù)模擬版本C的30k排列。

# Let’s generate a Bootstrap and find our p-Value, Intervals and T-Scoresset.seed(1984)
Sample <- VersionC$Signup
score <- NULL
means <- NULL
for(i in 1:30000) {
Bootstrap <- sample(Sample,replace = TRUE)
means <- rbind(means,mean(Bootstrap))
SimulationtTest <- tTest((Bootstrap-mean(Sample))/sd(Sample))
tScores <- rbind(score,SimulationtTest)
}

As result, we got:

結(jié)果,我們得到:

mean equal to 0.14均值等于0.14

Initially, one might expect a distribution similar in shape, but centered around 0.16, therefore resembling the “true population mean” distribution. Even though we did not recreate the exact “ground truth distribution” (the one we designed), since it is now centered in the mean of our sample instead (0.14), we did recreate one that should have roughly the same shape and Standard Error, and that should contain within its range our true mean.

最初,人們可能會(huì)期望形狀相似的分布,但以0.16為中心,因此類似于“ 真實(shí)總體均值 ”分布。 即使我們沒(méi)有重新創(chuàng)建精確的“地面事實(shí)分布”(我們?cè)O(shè)計(jì)的),因?yàn)樗F(xiàn)在正以樣本均值(0.14)為中心,所以我們確實(shí)重新創(chuàng)建了形狀和標(biāo)準(zhǔn)誤差大致相同的模型,并且應(yīng)該在其真實(shí)范圍內(nèi)。

We can compare our “bootstrapped standard error” with the “true mean standard error” by using both Central Limit Theorem and the Standard Deviation formula for the Binomial Distribution.

通過(guò)使用中心極限定理二項(xiàng)分布的標(biāo)準(zhǔn)偏差公式,我們可以將“ 自舉標(biāo)準(zhǔn)誤差 ”與“ 真實(shí)平均標(biāo)準(zhǔn)誤差 ”進(jìn)行比較。

Which allow us to obtain:

這使我們可以獲得:

Which seems to be quite near to our bootstrapped Standard Error:

這似乎很接近我們自舉的標(biāo)準(zhǔn)錯(cuò)誤:

# Mean from sampling distributionround(sd(means),6)

This data should be enough for us to approximate the original true mean distribution by simulating a Normal Distribution with a mean equal to 0.16 and a Standard Error of 0.01. We could find the percent of times a value as extreme as 0.14 is observed with this information.

通過(guò)模擬均值等于0.16標(biāo)準(zhǔn)誤差為0.01正態(tài)分布,該數(shù)據(jù)應(yīng)該足以使我們近似原始的真實(shí)均值分布。 通過(guò)此信息,我們可以找到一個(gè)值達(dá)到0.14的極高次數(shù)的百分比。

Estimated Sampling Distribution vs True Mean Distribution估計(jì)抽樣分布與真實(shí)均值分布

As seen above, both our True Mean Distribution (green) and our Bootstrapped Sample Distribution (blue) seems very similar, except the latter is centered around 0.14.

如上所示,我們的真實(shí)均值分布(綠色)和自舉樣本分布(藍(lán)色)看起來(lái)非常相似,只是后者的中心在0.14左右。

At this point, we could solve our problem by either finding the percent of times a value as extreme as 0.14 is found within our true mean distribution (area colored in blue). Alternatively, we could find the percent of times a value as extreme as 0.16 is found within our bootstrapped sample distribution (area colored in green). We will proceed with the latter since this post is focused on simulations based solely on our sample data.

在這一點(diǎn)上,我們可以通過(guò)在真實(shí)的均值分布 (藍(lán)色區(qū)域)中 找到等于0.14的極值的次數(shù)百分比來(lái)解決問(wèn)題。 或者,我們可以 在自舉樣本分布 (綠色區(qū)域)中 找到一個(gè)值達(dá)到0.16的極限值百分比 。 我們將繼續(xù)進(jìn)行后者,因?yàn)楸疚膬H關(guān)注基于樣本數(shù)據(jù)的模擬。

Resuming, we need to calculate how many times we observed values as extreme as 0.16 within our bootstrapped sample distribution. It is important to note that in this case, we had a sample mean (0.14) inferior to our expected mean of 0.16, but that is not always the case since, as we saw in our results, Version D got 0.17.

繼續(xù),我們需要計(jì)算在自舉樣本分布中觀察到的值高達(dá)0.16的次數(shù)。 重要的是要注意,在這種情況下,我們的樣本均值(0.14)低于我們的預(yù)期均值0.16,但并非總是如此,因?yàn)檎缥覀冊(cè)诮Y(jié)果中看到的,D版本的值為0.17。

In particular, we will perform a “two-tailed test”, which means finding the probability of obtaining values as extreme or as far from the mean as 0.16. Being our sample mean for Version C equal to 0.14, this is equivalent to say as low as 0.12 or as high as 0.16 since both values are equally extreme.

特別是,我們將執(zhí)行“雙尾檢驗(yàn)”,這意味著找到獲得的值作為極端 遠(yuǎn)離平均0.16的概率 作為我們對(duì)于版本C的樣本均值等于0.14,這等效于低至0.12或高至0.16,因?yàn)閮蓚€(gè)值都同樣極端。

For this case, we found:

對(duì)于這種情況,我們發(fā)現(xiàn):

# Expected Means, Upper and Lower interval (0.14 and 0.16)ExpectedMean <- 0.16
upper <- mean(means)+abs(mean(means)-ExpectedMean)
lower <- mean(means)-abs(mean(means)-ExpectedMean)
PValue <- mean(means <= lower | means >= upper)
Sum <- sum(means <= lower | means >= upper)
cat(paste(“We found values as extreme: “,PValue*100,”% (“,Sum,”/”,length(means),”) of the times”,sep=””))This is our p-Value這是我們的p值

Ok, we have found our p-Value, which is relatively low. Now we would like to find the 95% confidence interval of our mean, which would shed some light as of which values it might take considering a Type I Error (Alpha) of 5%.

好的,我們找到了相對(duì)較低的p值。 現(xiàn)在,我們想找到平均值的95%置信區(qū)間 ,這將為我們考慮5%的I型錯(cuò)誤(Alpha)時(shí)取哪些值提供了一些啟示

# Data aggregation
freq <- as.data.frame(table(means))
freq$means <- as.numeric(as.character(freq$means))# Sort Ascending for right-most proportion
freq <- freq[order(freq$means,decreasing = FALSE),]
freq$cumsumAsc <- cumsum(freq$Freq)/sum(freq$Freq)
UpperMean <- min(freq$means[which(freq$cumsumAsc >= 0.975)])# Sort Descending for left-most proportion
freq <- freq[order(freq$means,decreasing = TRUE),]
freq$cumsumDesc <- cumsum(freq$Freq)/sum(freq$Freq)
LowerMean <- max(freq$means[which(freq$cumsumDesc >= 0.975)])# Print Results
cat(paste(“95 percent confidence interval:\n “,round(LowerMean,7),” “,round(UpperMean,7),sep=””))Note that 0.16 is not contained within this 95% confidence interval.請(qǐng)注意,在此95%置信區(qū)間內(nèi)不包含0.16。

Let us calculate our Student’s T-score, which is calculated as follows:

讓我們計(jì)算學(xué)生的T分?jǐn)?shù),其計(jì)算方法如下:

Since we already calculated this formula for every one of our 30k resamples, we can generate our critical t-Scores for 90%, 95%, and 99% confidence intervals.

由于我們已經(jīng)為每30k次重采樣計(jì)算了此公式,因此我們可以生成90%,95%和99%置信區(qū)間的臨界t分?jǐn)?shù)。

# Which are the T-Values expected for each Confidence level?
histogram <- data.frame(X=tScores)
library(dplyr)
histogram %>%
summarize(
# Find the 0.9 quantile of diff_perm’s stat
q.90 = quantile(X, p = 0.9),
# … and the 0.95 quantile
q.95 = quantile(X, p = 0.95),
# … and the 0.99 quantile
q.99 = quantile(X, p = 0.99)
)Simulated t-Values for 90%, 95% and 99%90%,95%和99%的模擬t值

These values are very near the original Student’s Score Table for 1335 (N-1) degrees of freedom as seen here:

這些值非常接近原始學(xué)生的1335(N-1)自由度的成績(jī)表,如下所示:

Students Score Table學(xué)生成績(jī)表

Resuming, we can observe that our calculated p-Value was around 3.69%, our 95% interval did not include 0.16, and our absolute t-Score of 2.1, as seen in our table, was just between the score of Alpha 0.05 and 0.01. All this seems to be coherent with the same outcome; we reject the null hypothesis with 95% confidence, meaning we cannot confirm that Version C’s true mean is equal to 0.16.

繼續(xù),我們可以觀察到,我們計(jì)算出的p值約為3.69%,我們的95%區(qū)間不包括0.16,而我們的表中所見(jiàn),我們的絕對(duì)t分?jǐn)?shù)為2.1,正好在Alpha得分0.05和0.01之間。 所有這些似乎都與相同的結(jié)果相吻合。 我們以95%的置信度拒絕原假設(shè) ,這意味著我們無(wú)法確認(rèn)版本C的真實(shí)均值等于0.16。

We designed this test ourselves, and we know for sure our null hypothesis was correct. This concept of rejecting a true null hypothesis is called a False Positive or Type I Error, which can be avoided by increasing our current confidence Interval from 95% to maybe 99%.

我們自己設(shè)計(jì)了這個(gè)測(cè)試,并且我們肯定知道我們的零假設(shè)是正確的。 拒絕真實(shí)零假設(shè)的概念稱為誤報(bào)或I型錯(cuò)誤,可以通過(guò)將當(dāng)前的置信區(qū)間從95%增加到99%來(lái)避免。

So far, we have performed the equivalent of a “One-Sample t-Test” trough simulations, which implies we have determined whether the “sample mean” of 0.14 was statistically different from a known or hypothesized “population mean” of 0.16, which is our ground truth.

到目前為止,我們已經(jīng)執(zhí)行了與“ 單樣本t檢驗(yàn) ”低谷模擬等效的操作,這意味著我們已確定0.14的“樣本平均值”是否與已知或假設(shè)的0.16的“人口平均值”在統(tǒng)計(jì)上不同。是我們的基本真理。

For now, this will serve us as a building block for what is coming next since we will now proceed with a very similar approach to compare our Landing Versions between them to see if there is a winner.

目前,這將成為下一步工作的基礎(chǔ),因?yàn)槲覀儸F(xiàn)在將采用一種非常相似的方法來(lái)比較它們之間的著陸版本,以查看是否有贏家。

尋找我們的贏家版本 (Finding our winner version)

We have explored how to compare if a Sample Mean was statistically different from a known Population Mean; now, let us compare our Sample Mean with another Sample Mean.

我們已經(jīng)探索了如何比較樣本平均值是否與已知的總體平均值有統(tǒng)計(jì)學(xué)差異; 現(xiàn)在,讓我們將樣本均值與另一個(gè)樣本均值進(jìn)行比較。

For this particular example, let us compare Version F vs. Version A.

對(duì)于此特定示例,讓我們比較版本F與版本A。

This procedure of comparing two independent samples is usually addressed with a test called “Unpaired Two sample t-Test”; it is unpaired since we will use different (independent) samples; we assume they behave randomly, with normal distribution and zero covariance, as we will later observe.

比較兩個(gè)獨(dú)立樣本的過(guò)程通常通過(guò)稱為“未配對(duì)的兩個(gè)樣本t檢驗(yàn)”的測(cè)試解決 。 它是不成對(duì)的,因?yàn)槲覀儗⑹褂貌煌?獨(dú)立的)樣本; 我們假設(shè)它們的行為隨機(jī),正態(tài)分布且協(xié)方差為零,我們將在后面觀察到。

If we were to use the same sample, say at different moments in time, we would use a “Paired Two Sample t-Test” which, in contrast, compares two dependent samples, and it assumes a non-zero covariance which would be reflected in the formula.

如果我們使用相同的樣本,例如在不同的時(shí)刻,我們將使用“ 成對(duì)的兩個(gè)樣本t檢驗(yàn) ”,相比之下,它比較兩個(gè)相關(guān)樣本 ,并且假設(shè)將反映出一個(gè)非零協(xié)方差在公式。

In simple words, we want to know how often we observe a positive difference in means, which is equivalent to say that Version F has a higher mean than Version A, thus, better performance. We know our current difference in means is as follows:

用簡(jiǎn)單的話說(shuō),我們想知道我們觀察到均值出現(xiàn)正差的頻率這相當(dāng)于說(shuō)版本F的均值比版本A的均值高,因此性能更好。 我們知道我們目前的均值差異如下:

There’s a 3.37% positive difference between Sample F and Sample A樣品F和樣品A之間存在3.37%的正差異

Since we know our Sample Means are just a single measurement of the real Population Mean for both Version F and Version A and not the true sample mean for neither one, we need to compute the estimated sample distribution for both versions like we did earlier. Unlike before, we will also calculate the difference in means for each resample to observe how it is distributed.

由于我們知道樣本均值僅是對(duì)版本F和版本A的真實(shí)總體均值的單次測(cè)量,而不是對(duì)兩個(gè)版本均不是真實(shí)的樣本均值 ,因此我們需要像之前所做的那樣計(jì)算兩個(gè)版本的估計(jì)樣本分布。 與以前不同,我們還將計(jì)算每次重采樣的均值差,以觀察其分布情況。

Let us simulate 40k samples with a replacement for Version F and Version A and calculate the difference in means:

讓我們模擬40k樣本,用版本F和版本A替代,并計(jì)算均值之差:

# Let’s select data from Version F and Version AVersionF <- Dices[which(Dices$Version==”F”),]
VersionA <- Dices[which(Dices$Version==”A”),]# We simulate 40kDiff <- NULL
meansA <- NULL
meansF <- NULL
for(i in 1:40000) {
BootstrapA <- sample(VersionA$Signup,replace = TRUE)
BootstrapF <- sample(VersionF$Signup,replace = TRUE)
MeanDiff <- mean(BootstrapF)-mean(BootstrapA)
Diff <- rbind(Diff,MeanDiff)
}# We plot the result
totals <- as.data.frame(table(Diff))
totals$Diff <- as.numeric(as.character(totals$Diff))
plot( totals$Freq ~ totals$Diff , ylab="Frequency", xlab="Difference",main="Sampling Difference Distrubution")Sampling distribution of the difference between means均值之間的抽樣分布差異

As we might expect from what we learned earlier, we got a normally distributed shape centered in our previously calculated sample difference of 0.337. Like before, we also know the difference between the true population means for Versions A and F should be within the range of this distribution.

正如我們可能從我們先前所學(xué)到的那樣,可以得到一個(gè)正態(tài)分布的形狀,其中心位于我們先前計(jì)算的樣本差異0.337中。 和以前一樣,我們也知道版本A和版本F 的真實(shí)總體均值應(yīng)在此分布范圍內(nèi)

Additionally, our bootstrap should have provided us a good approximation of the Standard Error of the difference between the true means. We can compare our “bootstrapped standard error” with the “true mean difference standard error” with both Central Limit Theorem and the Binomial Distribution.

此外,我們的自舉程序應(yīng)該為我們提供了真實(shí)均值之差的標(biāo)準(zhǔn)誤差的良好近似值。 我們可以通過(guò)中心極限定理二項(xiàng)分布來(lái)比較“ 自舉標(biāo)準(zhǔn)誤差 ”和“ 真實(shí)平均差標(biāo)準(zhǔn)誤差 ”

Which allow us to obtain:

這使我們可以獲得:

Expected Stander Error of the differences is 0.0148 with zero-covariance差異的預(yù)期標(biāo)準(zhǔn)誤差為0.0148,協(xié)方差為零

Just like before, it seems to be quite near our bootstrapped Standard Error for the difference of the means:

和以前一樣,由于方式的不同,這似乎已經(jīng)很接近我們的標(biāo)準(zhǔn)錯(cuò)誤了:

# Simulated Standard Error of the differencesround(sd(Diff),6)(very near our 0.0148 target)(非常接近我們的0.0148目標(biāo))

As designed, we know the true expected difference of means is 0.04. We should have enough data to approximate a Normal Distribution with a mean equal to 0.04 and Standard Error of 0.0148, in which case we could find the percent of times a value as extreme as 0 is being found.

按照設(shè)計(jì),我們知道均值的真實(shí)期望差是0.04。 我們應(yīng)該有足夠的數(shù)據(jù), 平均等于 正態(tài)分布逼近 0.0148 0.04標(biāo)準(zhǔn)錯(cuò)誤,在這種情況下,我們能找到的時(shí)間百分比值極端的被發(fā)現(xiàn)0。

This scenario is unrealistic, though, since we would not usually have population means, which is the whole purpose of estimating trough confidence intervals.

但是,這種情況是不現(xiàn)實(shí)的,因?yàn)槲覀兺ǔ2粫?huì)擁有總體均值,這是估計(jì)谷底置信區(qū)間的全部目的。

Contrary to our previous case, in our first example, we compared our sample distribution of Version C to a hypothesized population mean of 0.16. However, in this case, we compare two individual samples with no further information as it would happen in a real A/B testing.

與我們先前的情況相反,在我們的第一個(gè)示例中,我們將版本C的樣本分布與假設(shè)的總體平均值0.16進(jìn)行了比較。 但是,在這種情況下,我們將比較兩個(gè)單獨(dú)的樣本,而沒(méi)有進(jìn)一步的信息,因?yàn)檫@將在實(shí)際的A / B測(cè)試中發(fā)生。

In particular, we want to prove that Version F is superior to Version A, meaning that the difference between means is greater than zero. For this case, we need to perform a “One-Tailed” test answering the following question: which percent of the times did we observe a difference in means greater than zero?

特別是,我們要證明版本F優(yōu)于版本A,這意味著均值之間的差大于零。 對(duì)于這種情況,我們需要執(zhí)行“單尾”測(cè)試,回答以下問(wèn)題: 在平均百分比中,觀察到差異大于零的百分比是多少?

Our hypothesis is as follows:

我們的假設(shè)如下:

The answer:

答案:

# Percent of times greater than Zeromean(Diff > 0)This means P-Value = 0.011這意味著P值= 0.011

Since our p-Value represents the times we did not observe a difference in mean higher than Zero within our simulation, we can calculate it to be 0.011 (1–0.989). Additionally, being lower than 0.05 (Alpha), we can reject our null hypothesis; therefore, Version F had a higher performance than Version A.

由于我們的p值代表我們?cè)诜抡嬷形从^察到均值大于零的差異的時(shí)間,因此我們可以將其計(jì)算為0.011(1-0.989)。 此外,如果低于0.05(Alpha),我們可以拒絕原假設(shè) ; 因此,F 版本比A版本具有更高的性能。

If we calculate both 95% confidence intervals and t-Scores for this particular test, we should obtain similar results:

如果我們?yōu)榇颂囟y(cè)試計(jì)算95%的置信區(qū)間和t分?jǐn)?shù),則我們應(yīng)獲得類似的結(jié)果:

Confidence interval at 95%:

置信區(qū)間為95%:

# Data aggregation
freq <- as.data.frame(table(Diff))
freq$Diff <- as.numeric(as.character(freq$Diff))# Right-most proportion (Inf)UpperDiff <- Inf# Sort Descending for left-most proportion
freq <- freq[order(freq$Diff,decreasing = TRUE),]
freq$cumsumDesc <- cumsum(freq$Freq)/sum(freq$Freq)
LowerDiff <- max(freq$Diff[which(freq$cumsumDesc >= 0.95)])# Print Results
cat(paste(“95 percent confidence interval:\n “,round(LowerDiff,7),” “,round(UpperDiff,7),sep=””))Note our right-most interval is Infinite, since we did a one tailed test.注意我們最右邊的間隔是無(wú)限,因?yàn)槲覀冞M(jìn)行了一次尾巴測(cè)試。

As expected, our confidence interval tells us that with 95% confidence, we should expect a difference of at least 0.0097, which is above zero; therefore, it shows a better performance.

不出所料,我們的置信區(qū)間告訴我們,在95%的置信度下,我們應(yīng)該期望至少有0.0097的差異,該差異大于零。 因此,它表現(xiàn)出更好的性能。

Unpaired Two-Sample t-Test score:

未配對(duì)的兩次樣本t檢驗(yàn)得分:

Similar to our previous values, checking our t-Table for T=2.31 and 2653 Degrees of Freedom we also found a p-Value of roughly 0.01

與我們之前的值類似,檢查t表中的T = 2.312653自由度,我們還發(fā)現(xiàn)p值大約為0.01

Students Score Table學(xué)生成績(jī)表

成對(duì)比較 (Pairwise Comparison)

So far, we have compared our Landing Page Version C with a hypothesized mean of 0.16. We have also compared Version F with Version A and found which was the highest-performer.

到目前為止,我們已經(jīng)將目標(biāo)網(wǎng)頁(yè)版本C與假設(shè)的平均值0.16進(jìn)行了比較。 我們還將版本F和版本A進(jìn)行了比較,發(fā)現(xiàn)性能最高。

Now we need to determine our absolute winner. We will do a Pairwise Comparison, meaning that we will test every page with each other until we determine our absolute winner if it exists.

現(xiàn)在我們需要確定我們的絕對(duì)贏家。 我們將進(jìn)行成對(duì)比較,這意味著我們將相互測(cè)試每一頁(yè),直到我們確定絕對(duì)贏家(如果存在)。

Since we will make a One-Tailed test for each and we do not need to test a version with itself, we can reduce the total number of tests as calculated below.

由于我們將為每個(gè)測(cè)試進(jìn)行一次測(cè)試,因此我們不需要自己測(cè)試版本,因此可以減少如下計(jì)算的測(cè)試總數(shù)。

# Total number of versionsVersionNumber <- 6# Number of versions comparedComparedNumber <- 2# Combinationsfactorial(VersionNumber)/(factorial(ComparedNumber)*factorial(VersionNumber-ComparedNumber))

As output we obtain: 15 pair of tests.

作為輸出,我們獲得: 15對(duì)測(cè)試。

Note, we need to find only values in green. Whites will be 1- Corresponding Pair (GreenValue)注意,我們只需要找到綠色的值。 白人將是1-對(duì)應(yīng)對(duì)(GreenValue)

We will skip the process of repeating this 15 times, and we will jump straight to the results, which are:

我們將跳過(guò)重復(fù)此過(guò)程15次的過(guò)程,然后直接跳轉(zhuǎn)到以下結(jié)果:

In red we have marked whenever our p-Value < Alpha (0.05) which means Row Version was superior than Column Version.用紅色標(biāo)記時(shí),我們的p值<Alpha(0.05)表示行版本優(yōu)于列版本。

As seen above, we have managed to find that Version F had better performance than both versions A, C, and was almost better performing than B, D, and E, which were close to our selected Alpha of 5%. In contrast, Version C seems to be an extraordinary case since, with both D and E, it seems to have a difference in means greater than zero, which we know is impossible since all three were designed with an equivalent probability of 0.16.

如上所示,我們?cè)O(shè)法發(fā)現(xiàn)版本F的性能優(yōu)于版本A,C,并且?guī)缀醣劝姹綛,D和E更好,后者接近我們選擇的5%的Alpha。 相反,版本C似乎是一個(gè)特殊情況,因?yàn)閷?duì)于D和E,它的均方差似乎都大于零,我們知道這是不可能的,因?yàn)樗羞@三個(gè)均以0.16的等效概率進(jìn)行設(shè)計(jì)。

In other words, we have failed to reject our Null Hypothesis at a 95% confidence even though it is false for F vs. B, D, and C; this situation (Type II Error) could be solved by increasing our Statistical Power. In contrast, we rejected a true null hypothesis for D vs. C and E vs. C, which indicates we have incurred in a Type I Error, which could be solved by defining a lower Alpha or Higher Confidence level.

換句話說(shuō), 即使 F vs 是錯(cuò)誤的 ,我們也無(wú)法以95%的置信度拒絕零假設(shè) 。 B, D和C; 這種情況(II型錯(cuò)誤)可以通過(guò)提高統(tǒng)計(jì)功效來(lái)解決。 相反,我們拒絕了 D vs 的真實(shí)零假設(shè) 。 C和E 與 。 C,表示我們發(fā)生了I型錯(cuò)誤 ,可以通過(guò)定義較低的Alpha或較高的置信度來(lái)解決。

We indeed designed our test to have an 80% statistical power. However, we designed it solely for testing differences between our total observed and expected frequencies and not for testing differences between individual means. In other words, we have switched from a “Chi-Squared Test” to an “Unpaired Two-Sample t-Test”.

實(shí)際上,我們將測(cè)試設(shè)計(jì)為具有80%的統(tǒng)計(jì)功效。 然而,我們?cè)O(shè)計(jì)它只是為了測(cè)試我們的 觀察和期望頻率 之間 ,而不是用于測(cè)試 個(gè)別 手段 之間的差異 的差異 。 換句話說(shuō),我們已經(jīng)從“ 卡方檢驗(yàn)”切換為““非配對(duì)兩樣本t檢驗(yàn)””。

統(tǒng)計(jì)功效 (Statistical Power)

We have obtained our results. Even though we could use them as-is and select the ones with the highest overall differences, such as the ones with the lowest P-Values, we might want to re-test some of the variations in order to be entirely sure.

我們已經(jīng)獲得了結(jié)果。 即使我們可以按原樣使用它們并選擇總體差異最大的差異(例如P值最低的差異),我們也可能要重新測(cè)試某些差異以完全確定。

As we saw in our last post, Power is calculated as follows:

正如我們?cè)谏弦黄恼轮兴?jiàn),Power的計(jì)算如下:

Similarly, Power is a function of:

同樣,Power是以下功能之一:

  • Our significance criterion is our Type I Error or Alpha, which we decided to be 5% (95% confidence).

    我們的顯著性標(biāo)準(zhǔn)是I型錯(cuò)誤或Alpha,我們決定為5%(置信度為95%)。

  • Effect Magnitude or Size: This represents the difference between our observed and expected values regarding the standardized statistic of use. In this case, since we are using a Student’s Test Statistic, this effect (named d) is calculated as the “difference between means” divided by the “Pooled Standard Error”. It is usually categorized as Small (0.2), Medium (0.5), and Large (0.8).

    效果幅度或大小 :這表示我們?cè)谟嘘P(guān)標(biāo)準(zhǔn)化使用統(tǒng)計(jì)方面的觀察值與期望值之間的差異。 在這種情況下,由于我們使用的是學(xué)生的測(cè)試統(tǒng)計(jì)量,因此將這種效應(yīng)(稱為d )計(jì)算為“ 均值之差”除以“ 合并標(biāo)準(zhǔn)誤差” 。 通常分為小(0.2),中(0.5)和大(0.8)。

  • Sample size: This represents the total amount of samples (in our case, 8017).

    樣本數(shù)量 :代表樣本總數(shù)(在我們的情況下為8017)。

效果幅度 (Effect Magnitude)

We designed an experiment with a relatively small effect magnitude since our Dice was only biased in one face (6) with only a slight additional chance of landing in its favor.

我們?cè)O(shè)計(jì)的實(shí)驗(yàn)的效果等級(jí)相對(duì)較小,因?yàn)槲覀兊镊蛔觾H偏向一張臉(6),只有很少的其他機(jī)會(huì)落在其臉上。

In simple words, our effect magnitude (d) is calculated as follows:

簡(jiǎn)而言之,我們的影響幅度(d)計(jì)算如下:

(note the relationship with the t-Score formula we previously used)(請(qǐng)注意與我們之前使用的t-Score公式的關(guān)系)

If we calculate this for the expected values of Version F vs. Version A, using the formulas we have learned so far, we obtain:

如果我們使用到目前為止所學(xué)的公式針對(duì)版本F與版本A的期望值進(jìn)行計(jì)算,則可以獲得:

Note that our calculated effect size is 50% of what’s usually considered a “small effect”請(qǐng)注意,我們計(jì)算出的效果大小是通常認(rèn)為是“小效果”的50%

樣本量 (Sample Size)

As we commented in our last post, we can expect an inverse relationship between sample sizes and effect magnitude. The more significant the effect, the lower the sample size needed to prove it at a given significance level.

正如我們?cè)谏弦黄恼轮性u(píng)論的那樣,我們可以預(yù)期樣本量與效應(yīng)量之間存在反比關(guān)系。 效果越顯著,在給定的顯著性水平下證明該結(jié)果所需的樣本量越少。

Let us try to find the sample size needed in order to have a 90% Power. We can solve this by iterating different values of N until we minimize the difference between our Expected Power and the Observed Power.

讓我們嘗試找到擁有90%功效的所需樣本量。 我們可以通過(guò)迭代N的不同值來(lái)解決此問(wèn)題,直到我們將期望功率和觀察功率之間的差異最小化為止。

# Basic example on how to obtain a given N based on a target Power.
# Playing with initialization variables might be needed for different scenarios.

set.seed(11)
CostFunction <- function(n,d,p) {
df <- (n - 1) * 2
tScore <- qt(0.05, df, lower = FALSE)
value <- pt(tScore , df, ncp = sqrt(n/2) * d, lower = FALSE)
Error <- (p-value)^2
return(Error)
}
SampleSize <- function(d,n,p) {
# Initialize variables
N <- n
i <- 0
h <- 0.000000001
LearningRate <- 3000000
HardStop <- 20000
power <- 0
# Iteration loop
for (i in 1:HardStop) {
dNdError <- (CostFunction(N + h,d,p) - CostFunction(N,d,p)) / h
N <- N - dNdError*LearningRate
tLimit <- qt(0.05, (N - 1) * 2, lower = FALSE)
new_power <- pt(tLimit , (N- 1) * 2, ncp = sqrt(N/2) * d, lower = FALSE)
if(round(power,6) >= p) {
cat(paste0("Found in ",i," Iterations\n"))
cat(paste0(" Power: ",round(power,2),"\n"))
cat(paste0(" N: ",round(N)))
break();
}
power <- new_power
i <- i +1
}
}
set.seed(22)
SampleSize((0.2-0.16)/sqrt((0.16+0.1344)/2),1336,0.9)

As seen above, after different iterations of N, we obtained a recommended sample of 1576 per dice to have a 0.9 Power.

如上所示,在N的不同迭代之后,我們獲得了每個(gè)骰子 1576的推薦樣本,具有0.9的功效。

Let us repeat the experiment from scratch and see which results we get for these new sample size of 9456 (1575*6) as suggested by aiming a good Statistical Power of 0.9.

讓我們從頭開(kāi)始重復(fù)實(shí)驗(yàn),看看針對(duì)9456(1575 * 6)的這些新樣本大小,我們通過(guò)將0.9的良好統(tǒng)計(jì)功效作為目標(biāo)而獲得了哪些結(jié)果。

# Repeat our experiment with sample size 9446set.seed(11)
Dices <- DiceRolling(9456) # We expect 90% Power
t(table(Dices))Version F seems to be the winner again with 339 Signups版本F似乎再次以339個(gè)注冊(cè)獲勝

Let us make a fast sanity check to see if our experiment now has a Statistical Power of 90% before we proceed; this can be answered by asking the following question:

讓我們進(jìn)行快速的理智檢查,看看我們的實(shí)驗(yàn)在進(jìn)行之前是否現(xiàn)在具有90%的統(tǒng)計(jì)功效; 可以通過(guò)提出以下問(wèn)題來(lái)回答:

  • If we were to repeat our experiment X amount of times and calculate our P-Value on each experiment, which percent of the times, we should expect a P-Value as extreme as 5%?

    如果我們要重復(fù)實(shí)驗(yàn)X次并在每個(gè)實(shí)驗(yàn)中計(jì)算出我們的P值(占百分比的百分比),那么我們應(yīng)該期望P值達(dá)到5%的極限嗎?

Let us try answering this question for Version F vs. Version A:

讓我們嘗試回答版本F與版本A的問(wèn)題:

# Proving by simulation
MultipleDiceRolling <- function(k,N) {
pValues <- NULL
for (i in 1:k) {
Dices <- DiceRolling(N)
VersionF <- Dices[which(Dices$Version=="F"),]
VersionA <- Dices[which(Dices$Version=="A"),]
pValues <- cbind(pValues,t.test(VersionF$Signup,VersionA$Signup,alternative="greater")$p.value)
i <- i +1
}
return(pValues)
}# Lets replicate our experiment (9456 throws of a biased dice) 10k times
start_time <- Sys.time()
Rolls <- MultipleDiceRolling(10000,9456)
end_time <- Sys.time()
end_time - start_time

How many times did we observe P-Values as extreme as 5%?

我們觀察過(guò)多少次P值高達(dá)5%?

cat(paste(length(which(Rolls <= 0.05)),"Times"))

Which percent of the times did we observe this scenario?

我們觀察到這種情況的百分比是多少?

Power <- length(which(Rolls <= 0.05))/length(Rolls)
cat(paste(round(Power*100,2),"% of the times (",length(which(Rolls <= 0.05)),"/",length(Rolls),")",sep=""))

As calculated above, we observe a Power equivalent to roughly 90% (0.896), which proves our new sample size works as planned. This implies we have a 10% (1 — Power) probability of making a Type II Error or, equivalently, a 10% chance of failing to reject our Null Hypothesis at a 95% confidence interval even though it is false, which is acceptable.

根據(jù)上面的計(jì)算,我們觀察到的功效大約等于90%(0.896),這證明了我們新的樣本量能按計(jì)劃進(jìn)行。 這意味著我們有10%(1- Power)發(fā)生II型錯(cuò)誤的概率,或者等效地, 即使它為假 ,也有10%的機(jī)會(huì)未能以95%的置信區(qū)間拒絕零假設(shè) ,這是可以接受的。

絕對(duì)贏家 (Absolute winner)

Finally, let us proceed on finding our absolute winner by repeating our Pairwise Comparison with these new samples:

最后,讓我們通過(guò)對(duì)這些新樣本重復(fù)成對(duì)比較來(lái)找到我們的絕對(duì)贏家:

In red we have marked whenever our p-Value < Alpha (0.05) which means Row Version was superior than Column Version.用紅色標(biāo)記時(shí),我們的p值<Alpha(0.05)表示行版本優(yōu)于列版本。

As expected, our absolute winner is Version F amongst all other versions. Additionally, it is also clear now that there is no significant difference between any other version’s true means.

不出所料,我們的絕對(duì)贏家是所有其他版本中的F版本。 此外,現(xiàn)在也很清楚,其他版本的真實(shí)含義之間沒(méi)有顯著差異。

最后的想法 (Final Thoughts)

We have explored how to perform simulations on two types of tests; Chi-Squared and Student’s Tests for One and Two Independent Samples. Additionally, we have examined some concepts such as Type I and Type II errors, Confidence Intervals, and the calculation and Interpretation of the Statistical Power for both scenarios.

我們探索了如何對(duì)兩種類型的測(cè)試進(jìn)行仿真。 一和兩個(gè)獨(dú)立樣本的卡方檢驗(yàn)和學(xué)生檢驗(yàn)。 此外,我們還研究了一些概念,例如I型和II型錯(cuò)誤,置信區(qū)間以及兩種情況下的統(tǒng)計(jì)功效的計(jì)算和解釋。

It is essential to know that we would save much time and even achieve more accurate results by performing such tests using specialized functions in traditional use-case scenarios, so it is not recommended to follow this simulation path. In contrast, this type of exercise offers value in helping us develop a more intuitive understanding, which I wanted to achieve.

必須知道,通過(guò)在傳統(tǒng)用例場(chǎng)景中使用專門(mén)功能執(zhí)行此類測(cè)試,我們將節(jié)省大量時(shí)間,甚至可以獲得更準(zhǔn)確的結(jié)果,因此不建議您遵循此仿真路徑。 相比之下,這種鍛煉方式可以幫助我們建立更直觀的理解,這是我想要實(shí)現(xiàn)的。

If you have any questions or comments, do not hesitate to post them below.

如果您有任何問(wèn)題或意見(jiàn),請(qǐng)隨時(shí)在下面發(fā)布。

翻譯自: https://towardsdatascience.com/intuitive-simulation-of-a-b-testing-part-ii-8902c354947c

ab 模擬

總結(jié)

以上是生活随笔為你收集整理的ab 模拟_Ab测试第二部分的直观模拟的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。