當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

傅里叶变换直观_A / B测试的直观模拟

發(fā)布時(shí)間：2023/11/29 编程问答 66 豆豆

生活随笔收集整理的這篇文章主要介紹了傅里叶变换直观_A / B测试的直观模拟小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

傅里葉變換直觀

Many of us have heard, read, or even performed an A/B Test before, which means we have conducted a statistical test at some point. Most of the time, we have worked with data from first or third-party sources and performed these tests with ease by either using tools ranging from Excel to Statistical Software and even more automated solutions such as Google Optimize.

我們當(dāng)中許多人以前都聽過，讀過甚至進(jìn)行過A / B測試，這意味著我們在某個時(shí)候進(jìn)行了統(tǒng)計(jì)測試。在大多數(shù)情況下，我們使用第一方或第三方來源的數(shù)據(jù)，并使用Excel到Statistics Software等工具以及更自動化的解決方案(例如Google Optimize)輕松地執(zhí)行了這些測試。

If you are like me, you might be curious about how these types of tests work and how concepts such as Type I and Type II Error, Confidence Intervals, Effect Magnitude, Statistical Power, and others interact with each other.

如果您像我一樣，可能會對這些類型的測試如何工作以及類型I和類型II錯誤 ， 置信區(qū)間 ， 影響幅度 ， 統(tǒng)計(jì) 功效以及其他概念之間的交互方式感到好奇。

In this post, I would like to invite you to take a different approach for one specific type of A/B test, which makes use of a particular statistic called Chi-Squared. In particular, I will try to explore and walk through this type of test by taking the great but long road of simulations, avoiding libraries and tables, hopefully managing to explore and build some of the intuition behind it.

在本文中，我想邀請您對一種特定類型的A / B測試采用不同的方法，該方法利用稱為Chi-Squared的特定統(tǒng)計(jì)量。特別是，我將嘗試通過漫長而漫長的模擬之路，避免使用庫和表，希望設(shè)法探索并建立其背后的一些直覺，從而探索并完成此類測試。

開始之前 (Before we start)

Even though we could use data from our past experiments or even third-party sources such as Kaggle, it would be more convenient for this post to generate our data. It will allow us to compare our conclusions with a known ground truth; otherwise, it will be most likely unknown.

即使我們可以使用過去實(shí)驗(yàn)中的數(shù)據(jù)，甚至可以使用第三方來源(例如Kaggle)中的數(shù)據(jù)，對于本帖子來說，生成我們的數(shù)據(jù)也會更加方便。它可以使我們將結(jié)論與已知的事實(shí)相比較; 否則，很可能未知。

For this example, we will generate a dummy dataset that will represent six different versions of a signup form and the number of leads we observed on each. For this dummy set to be random and have a winner version that will serve us as ground truth, we will generate this table by simulating some biased dice’s throws.

對于此示例，我們將生成一個虛擬數(shù)據(jù)集，該數(shù)據(jù)集將表示六個不同版本的注冊表單以及我們在每個表單上觀察到的潛在客戶數(shù)量。為了使這個虛擬集是隨機(jī)的，并且有一個獲勝者版本將用作 基礎(chǔ)事實(shí)，我們將通過模擬一些有偏向的骰子投擲來生成此表。

For this, we have generated an R function that simulates a biased dice in which we have a 20% probability of lading in 6 while a 16% chance of landing in any other number.

為此，我們生成了一個R函數(shù)，該函數(shù)模擬了一個有偏見的骰子，在該骰子中，我們有20％的概率在6中提貨，而在其他數(shù)字中有16％的機(jī)會著陸。

# Biased Dice Rolling Function
DiceRolling <- function(N) {
Dices <- data.frame()
for (i in 1:6) {
if(i==6) {
Observed <- data.frame(Version=as.character(LETTERS[i]),Signup=rbinom(N/6,1,0.2))
} else {
Observed <- data.frame(Version=as.character(LETTERS[i]),Signup=rbinom(N/6,1,0.16))
}
Dices <- rbind(Dices,Observed)
}
return(Dices)
}# Let's roll some dices
set.seed(123) # this is for result replication 86
Dices <- DiceRolling(1800)

Think of each Dice number as a representation of a different landing version (1–6 or A-F). For each version, we will throw our Dice 300 times, and we will write down its results as follows:

將每個骰子編號視為不同著陸版本(1-6或AF)的表示。對于每個版本，我們將擲骰子300次，并將其結(jié)果記錄如下：

If we are on version A (1) and throw the Dice and it lands on 1, we consider it to be Signup; otherwise, just a visit.
如果我們使用版本A(1)并將骰子扔到1，則認(rèn)為它是Signup；否則，只是一次訪問。
We repeat 300 times for each version.
每個版本重復(fù)300次。

樣本數(shù)據(jù) (Sample Data)

As commented earlier, this is what we got:

如前所述，這是我們得到的：

# We shuffle our results
set.seed(25)
rows <- sample(nrow(Dices))
t(Dices[head(rows,10),])Note that we marked Signups as a binary categorical variable (1 for Signup and 0 for Didn’t Signup)請注意，我們將Signups標(biāo)記為二進(jìn)制分類變量(1表示“ Signup”，0表示“ Did n't Signup”)

We can observe from our first ten results that we got one Signup for F, D, and A. In aggregated terms, our table looks like this:

我們可以從前十個結(jié)果中觀察到，我們?yōu)?strong>F ， D和A獲得了一個Signup 。總的來說，我們的表如下所示：

library(ggplot2)
ggplot(Result, aes(x=Version, y=Signup)) + geom_bar(stat="identity", position="dodge") + ggtitle("Summary Chart")
Result <- aggregate(Signup ~ Version, Dices, sum)
t(Result)As designed, there’s a higher proportion of F’s (6)按照設(shè)計(jì)，F的比例更高(6)

From now own, think of this table as Dice throws, eCommerce conversions, surveys, or a Landing Page Signup Conversion as we will use here, it does not matter, use whatever is more intuitive for you.

從現(xiàn)在開始，將此表視為Dice投擲，電子商務(wù)轉(zhuǎn)換，調(diào)查或著陸頁注冊轉(zhuǎn)換，就像我們將在此處使用的那樣，這沒關(guān)系，可以使用對您而言更直觀的方式。

For us, it will be signups, so we should produce this report:

對于我們來說，這將是注冊，因此我們應(yīng)該生成此報(bào)告：

Note the proportions (conversion rates) variability and how they differed from our “ground truth” planned ones.請注意比例(轉(zhuǎn)換率)的可變性以及它們與我們計(jì)劃的“基本事實(shí)”的差異。

觀察頻率 (Observed Frequencies)

We will now aggregate our results, including both our Signup (1) and Did not Signup (0) results, which will allow us to understand better how these differ from our expected values or frequencies; this is also called a Cross Tabulation or Contingency Table.

現(xiàn)在，我們將匯總我們的結(jié)果，包括“ 注冊”(1)和“未注冊”(0)結(jié)果，這將使我們能夠更好地了解這些結(jié)果與預(yù)期值或頻率之間的差異；這也稱為交叉表或列聯(lián)表。

# We generate our contigency table
Observed <- table(Dices)
t(Observed)

In summary:

綜上所述：

預(yù)期頻率 (Expected Frequencies)

Since we know how our Cross Tabulation looks, we can now generate a table simulating how we should expect our results to be like considering the same performance of all versions. It is equivalent to say that each version had the same Signup Conversion or probability in the case of our example or the expected result of a non-biased dice if you prefer.

由于我們知道交叉制表的外觀，因此我們現(xiàn)在可以生成一個表，該表模擬我們?nèi)绾纹谕覀兊慕Y(jié)果像考慮所有版本的相同性能一樣。可以說，在我們的示例中，每個版本都有相同的注冊轉(zhuǎn)換或概率，或者，如果您愿意，可以使用無偏向骰子的預(yù)期結(jié)果。

# We generate our expected frequencies table
Expected <- Observed
Expected[,1] <- (sum(Observed[,1])/nrow(Observed))
Expected[,2] <- sum(Observed[,2])/nrow(Observed)
t(Expected)

In summary:

綜上所述：

假設(shè)檢驗(yàn) (Hypothesis Testing)

We know our test had a higher-performing version not only by visually inspecting the results but because we purposely designed it to be that way.

我們知道我們的測試具有更高性能的版本，不僅是通過目視檢查結(jié)果，還因?yàn)槲覀冇心康牡貙⑵湓O(shè)計(jì)為這種方式。

This is the moment we have waited for: is it possible for us to prove this solely based on the results we got?.

這是我們等待的時(shí)刻： 是否有可能僅根據(jù)獲得的結(jié)果來證明這一點(diǎn)？ 。

The answer is yes, and the first step is to define our Null and Alternative Hypothesis, which we will later try to accept or reject.

答案是肯定的，第一步是定義零假設(shè)和替代假設(shè)，我們稍后將嘗試接受或拒絕。

Our alternative hypothesis (H1) is what we want to prove correct, which states that there is, in fact, a relationship between the landing version and the result we observed. In contrast, our null hypothesis states that there is no relationship meaning there is no significant difference between our observed and expected frequencies.

我們要證明的另一種假設(shè)(H1)是正確的，它指出著陸版本與我們觀察到的結(jié)果之間實(shí)際上存在某種關(guān)系。相反，我們的零假設(shè)指出沒有關(guān)系，這意味著我們的觀測頻率與預(yù)期頻率之間沒有顯著差異。

統(tǒng)計(jì) (Statistic)

Our goal is to find how often our observed data is located in a universe where our null hypothesis is correct, meaning, where our observed and expected signup frequencies have no significant difference.

我們的目標(biāo)是找到我們的觀測數(shù)據(jù)位于原假設(shè)正確的宇宙中的頻率，即我們的觀測和預(yù)期簽約頻率無顯著差異。

A useful statistic that’s able to sum up all these values; six columns (one for each version) and two rows (one for each signup state) into a single value is Chi-Square, which is calculated as follows:

有用的統(tǒng)計(jì)信息，能夠匯總所有這些值；六個值(每個版本一個)和兩行(每個注冊狀態(tài)一個)組成一個值是Chi-Square，其計(jì)算方式如下：

We will not get into details of how this formula can be found neither of its assumptions or requirements (such as Yates Correction), because it is not the subject of this post. On the contrary, we would like to perform a numerical approach through simulations, which should shed some light on these types of hypothesis tests.

我們不會詳細(xì)介紹如何從公式的任何假設(shè)或要求(例如Yates Correction)中都找不到該公式，因?yàn)樗皇潜疚牡闹黝}。相反，我們想通過仿真執(zhí)行數(shù)值方法，這應(yīng)該為這些類型的假設(shè)檢驗(yàn)提供一些啟發(fā)。

Resuming, if we compute this formula with our data, we get:

繼續(xù)，如果我們使用我們的數(shù)據(jù)計(jì)算此公式，則會得到：

# We calculate our X^2 score
Chi <- sum((Expected-Observed)^2/Expected)
Chi

空分布模擬 (Null Distribution Simulation)

We need to obtain the probability of finding a statistic as extreme as the one we observed, which in this case, is represented by Chi-Square equal to 10.368. This, in terms of probability, is also known as our P-Value.

我們需要獲得找到與我們觀察到的統(tǒng)計(jì)數(shù)據(jù)一樣極端的統(tǒng)計(jì)數(shù)據(jù)的概率，在本例中，該統(tǒng)計(jì)數(shù)據(jù)由卡方表示為10.368。 就概率而言，這也稱為我們的P值 。

For this, we will simulate a Null Distribution as a benchmark. What this means is that we need to generate a scenario in which our Null Distribution is correct, suggesting a situation where there is no relationship between the landing version and the observed signup results (frequencies) we got.

為此，我們將模擬空分布作為基準(zhǔn)。這意味著我們需要生成一個空分布正確的方案，這表明著陸版本與我們觀察到的注冊結(jié)果(頻率)之間沒有關(guān)系。

A solution that rapidly comes to mind is to repeat our experiment from scratch, either by re-collecting results many times or, as in the context of this post, using an unbiased dice to compare how our observed results behave in contrast to these tests. Even though this might seem intuitive at first, in real-world scenarios, this solution might not be the most efficient one since it would require extreme use of resources such as time and budget to repeat this A/B test many times.

Swift想到的解決方案是從頭開始重復(fù)我們的實(shí)驗(yàn)，方法是多次重新收集結(jié)果，或者如本文所述，使用無偏小骰子來比較觀察到的結(jié)果與這些測試相比的表現(xiàn)。盡管起初看起來似乎很直觀，但在實(shí)際情況下，此解決方案可能并不是最有效的解決方案，因?yàn)樗枰罅渴褂觅Y源(例如時(shí)間和預(yù)算)才能多次重復(fù)進(jìn)行此A / B測試。

重采樣 (Resampling)

An excellent solution to the problem discussed above is called resampling. What resampling does is make one variable independent of the other by shuffling one of them randomly. If there were an initial relationship between them, this relation would be lost due to the random sampling method.

解決上述問題的一種極好的解決方案稱為重采樣。重采樣的作用是通過隨機(jī)地對其中一個變量進(jìn)行改組，使一個變量與另一個變量無關(guān)。如果它們之間存在初始關(guān)系，則由于隨機(jī)抽樣方法，該關(guān)系將丟失。

In particular, we need to use the original (unaggregated) samples for this scenario. We will later permutate one of the columns several times, which will be Signup status in this case.

特別是，在這種情況下，我們需要使用原始(未匯總)樣本。稍后，我們將對其中一列進(jìn)行多次排列，在本例中為“注冊”狀態(tài)。

In particular, let us see an example of 2 shuffles for the first “10 samples” shown earlier:

特別是，讓我們看一下前面顯示的第一個“ 10個樣本”的2個隨機(jī)播放的示例：

Let us try it now with the complete (1800) sample set:

現(xiàn)在讓我們嘗試使用完整的樣本集(1800)：

Permutation #1

排列＃1

Perm1 <- Dices
set.seed(45)
Perm1$Signup <- sample(Dices$Signup)
ResultPerm1 <- aggregate(Signup ~ Version, Perm1, sum)
cat("Permutation #1:\n\n")
cat("Summary\n\n")
t(ResultPerm1)
cat("Chi-Squared")
Perm1Observed <- table(Perm1)
sum((Expected-Perm1Observed)^2/Expected)

Permutation #2

排列＃2

Perm1 <- Dices
set.seed(22)
Perm1$Signup <- sample(Dices$Signup)
ResultPerm1 <- aggregate(Signup ~ Version, Perm1, sum)
cat("Permutation #2:\n\n")
cat("Summary\n\n")
t(ResultPerm1)
cat("Chi-Squared")
Perm1Observed <- table(Perm1)
sum((Expected-Perm1Observed)^2/Expected)

As seen in both permutations of our data, we got utterly different summaries and Chi-Squared values. We will repeat this process a bunch of times to explore what we can obtain at a massive scale.

從我們的數(shù)據(jù)的兩個排列中可以看出，我們得到了完全不同的匯總和Chi-Squared值。我們將重復(fù)此過程很多次，以探索我們可以大規(guī)模獲得的東西。

模擬 (Simulation)

Let us simulate 15k permutations of our data.

讓我們模擬數(shù)據(jù)的15k排列。

# Simulation Function
Simulation <- function(Dices,k) {
dice_perm <- data.frame()
i <- 0
while(i < k) {
i <- i + 1;# We permutate our Results
permutation$Signup <- sample(Dices$Signup)# We generate our contigency table
ObservedPerm <- table(permutation)# We generate our expected frequencies table
ExpectedPerm <- ObservedPerm
ExpectedPerm[,1] <- (sum(ObservedPerm[,1])/nrow(ObservedPerm))
ExpectedPerm[,2] <- sum(ObservedPerm[,2])/nrow(ObservedPerm)# We calculate X^2 test for our permutation
ChiPerm <- sum((ExpectedPerm-ObservedPerm)^2/ExpectedPerm)# We add our test value to a new dataframe
dice_perm <- rbind(dice_perm,data.frame(Permutation=i,ChiSq=ChiPerm))
}
return(dice_perm)
}# Lest's resample our data 15.000 times
start_time <- Sys.time()
permutation <- Dicesset.seed(12)
permutation <- Simulation(Dices,15000)
end_time <- Sys.time()
end_time - start_timeTime running simulation時(shí)間運(yùn)行模擬

重采樣分布 (Resample Distribution)

As we can observe below, our 15k permutations look like it is distributed with a distinct shape and resembles, as expected, a Chi-Square distribution. With this information, we can now calculate how many of the 15k iterations, we observed a Chi-Squared value as extreme as our initial 10.36 calculation.

正如我們在下面可以看到的，我們的15k排列看起來像是分布有不同的形狀，并且與預(yù)期的卡方分布相似。有了這些信息，我們現(xiàn)在可以計(jì)算出15k迭代中有多少次，我們觀察到的Chi-Squared值與我們最初的10.36計(jì)算一樣極端。

totals <- as.data.frame(table(permutation$ChiSq))
totals$Var1 <- as.numeric(as.character(totals$Var1))
plot( totals$Freq ~ totals$Var1, ylab=”Frequency”, xlab=”Chi-Squared Values”,main=”Null Distribution”)

P值 (P-Value)

Let us calculate how many times we obtained a Chi-Square value equal to or higher than 10.368 (our calculated score).

讓我們計(jì)算獲得等于或高于10.368(我們的計(jì)算得分)的卡方值的次數(shù)。

Higher <- nrow(permutation[which(permutation$ChiSq >= Chi),])
Total <- nrow(permutation)
prob <- Higher/Total
cat(paste("Total Number of Permutations:",Total,"\n"))
cat(paste(" - Total Number of Chi-Squared Values equal to or higher than",round(Chi,2),":",Higher,"\n"))
cat(paste(" - Percentage of times it was equal to or higher (",Higher,"/",Total,"): ",round(prob*100,3),"% (P-Value)",sep=""))

決策極限 (Decision Limits)

We now have our P-Value, which means that if the Null Hypothesis is correct, saying there is no relationship between Version and Signups, we should encounter a Chi-Square as extreme only a small 6.5% of the time. If we think of this as only dice results, we should expect “results as biased as ours” even by throwing an unbias dice at most 6.5% of the time.

現(xiàn)在，我們有了P值，這意味著如果零假設(shè)是正確的，也就是說版本和注冊之間沒有關(guān)系，那么我們應(yīng)該僅在很小的6.5％的時(shí)間內(nèi)遇到卡方。如果我們認(rèn)為這只是骰子的結(jié)果，那么即使最多最多擲6.5％的時(shí)間來獲得無偏見的骰子，我們也應(yīng)該期望“結(jié)果像我們一樣有偏見”。

Now we need to define our decision limits on which we accept or reject our null hypothesis.

現(xiàn)在，我們需要定義我們接受或拒絕原假設(shè)的決策極限。

We calculated our decision limits for 90%, 95%, and 99% confidence intervals, meaning which Chi-Squared values we should expect as a limit on those odds.

我們計(jì)算了90％，95％和99％置信區(qū)間的決策極限，這意味著我們應(yīng)該期望將Chi-Squared值作為這些幾率的極限。

# Decition Limits
totals <- as.data.frame(table(permutation$ChiSq))
totals$Var1 <- as.numeric(as.character(totals$Var1))
totals$Prob <- cumsum(totals$Freq)/sum(totals$Freq)
Interval90 <- totals$Var1[min(which(totals$Prob >= 0.90))]
Interval95 <- totals$Var1[min(which(totals$Prob >= 0.95))]
Interval975 <- totals$Var1[min(which(totals$Prob >= 0.975))]
Interval99 <- totals$Var1[min(which(totals$Prob >= 0.99))]cat(paste("Chi-Squared Limit for 90%:",round(Interval90,2),"\n"))
cat(paste("Chi-Squared Limit for 95%:",round(Interval95,2),"\n"))
cat(paste("Chi-Squared Limit for 99%:",round(Interval99,2),"\n"))

Fact Check

事實(shí)檢查

Chi-Square Distribution Table卡方分布表

As observed by the classical “Chi-Square Distribution Table”, we can find very similar values from the ones we obtained from our simulation, which means our confidence intervals and P-Values should be very accurate.

正如經(jīng)典“卡方分布表”所觀察到的，我們可以從模擬中獲得非常相似的值，這意味著我們的置信區(qū)間和P值應(yīng)該非常準(zhǔn)確。

假設(shè)檢驗(yàn) (Hypothesis Testing)

As we expected, we can reject the Null Hypothesis and claim that there is a significant relationship between versions and signups. Still, there is a small caveat, and this is our level of confidence. As observed in the calculations above, we can see that our P-Value (6.5%) is just between 90% and 95% confidence intervals, which means, even though we can reject our Null Hypothesis with 90% confidence, we cannot reject it at 95% or any superior confidence level.

如我們所料，我們可以拒絕零假設(shè)，并聲稱版本和注冊之間存在重要關(guān)系。仍然有一點(diǎn)需要注意，這就是我們的信心水平 。從上面的計(jì)算中可以看出，我們可以看到P值(6.5％)介于90％和95％的置信區(qū)間之間，這意味著，即使我們可以90％的置信度拒絕零假設(shè)，我們也不能拒絕它95％或更高的置信度。

If we claim to have 90% confidence, then we are also claiming there is a 10% chance of wrongly rejecting our null hypothesis (also called Type I Error, False Positive, or Alpha). Note, in reality, such standard arbitrary values (90%,95%, 99%) are used, but we could easily claim we are 93.5% certain since we calculated a 6.5% probability of a Type I Error.

如果我們聲稱擁有90％的置信度，那么我們還聲稱有10％的機(jī)會錯誤地拒絕了我們的零假設(shè)(也稱為I型錯誤 ， 誤報(bào)或Alpha )。注意，實(shí)際上，使用了此類標(biāo)準(zhǔn)任意值(90％，95％，99％)，但由于我們計(jì)算出I型錯誤的概率為6.5％ ，因此我們可以很容易地?cái)嘌晕覀兙哂?3.5％的確定性。

Interestingly, even though we know for sure there is a relationship between version and signups, we cannot prove this by mere observation, simulations, and neither by doing this hypothesis test with a standard 95% confidence level. This concept of failing to reject our Null Hypothesis even though we know it is wrong is called a false negative or Type II Error (Beta), which is dependent on the Statistical Power of this test, which measures the probability that this does not happen.

有趣的是，即使我們確定知道版本和注冊之間存在關(guān)聯(lián)，我們也不能僅僅通過觀察，模擬以及通過以標(biāo)準(zhǔn)的95％置信度進(jìn)行假設(shè)檢驗(yàn)來證明這一點(diǎn)。即使我們知道錯誤假設(shè)也不會拒絕零假設(shè)的概念稱為假陰性或II型錯誤 ( Beta )，這取決于此測試的統(tǒng)計(jì)功效 ，該度量衡量了這種情況不會發(fā)生的可能性。

統(tǒng)計(jì)功效 (Statistical Power)

In our hypothesis test, we saw we were unable to reject our Null Hypothesis even at standard values intervals such as 95% confidence or more. This is due to the Statistical Power (or Power) of the test we randomly designed, which is particularly sensitive to our statistical significance criterion discussed above (alpha or Type I error) and both effect magnitude and sample sizes.

在我們的假設(shè)檢驗(yàn)中，我們看到即使在標(biāo)準(zhǔn)值間隔(例如95％置信度或更高)下也無法拒絕零假設(shè)。這是由于我們隨機(jī)設(shè)計(jì)的測試的統(tǒng)計(jì) 功效 (或功效 )，這對我們上面討論的統(tǒng)計(jì)顯著性標(biāo)準(zhǔn)(alpha或I型誤差)以及影響幅度和樣本量特別敏感。

Power is calculated as follows:

功率計(jì)算如下：

In particular, we can calculate our current statistical Power by answering the following question:

特別是，我們可以通過回答以下問題來計(jì)算當(dāng)前的統(tǒng)計(jì)功效：

If we were to repeat our experiment X amount of times and calculate our P-Value on each experiment, which percent of the times, we should expect a P-Value as extreme as 5%?
如果我們要重復(fù)實(shí)驗(yàn)X次并在每個實(shí)驗(yàn)中計(jì)算出我們的P值(占百分比的百分比)，那么我們應(yīng)該期望P值達(dá)到5％的極限嗎？

Let us try answering this question:

讓我們嘗試回答這個問題：

MultipleDiceRolling <- function(k,N) {
pValues <- NULL
for (i in 1:k) {
Dices <- DiceRolling(N)
Observed <- table(Dices)
pValues <- cbind(pValues,chisq.test(Observed)$p.value)
i <- i +1
}
return(pValues)
}# Lets replicate our experiment (1800 throws of a biased dice) 10k times
start_time <- Sys.time()
Rolls <- MultipleDiceRolling(10000,1800)
end_time <- Sys.time()
end_time - start_time

How many times did we observe P-Values as extreme as 5%?

我們觀察過多少次P值高達(dá)5％？

cat(paste(length(which(Rolls <= 0.05)),"Times"))

Which percent of the times did we observe this scenario?

我們觀察到這種情況的百分比是多少？

Power <- length(which(Rolls <= 0.05))/length(Rolls)
cat(paste(round(Power*100,2),"% of the times (",length(which(Rolls <= 0.05)),"/",length(Rolls),")",sep=""))Only 21.91% of times, we encounter P-Values as small.只有21.91％的時(shí)間，我們遇到的P值很小。

As calculated above, we observe a Power equivalent to 21.91% (0.219), which is quite low since the gold standard is around 0.8 or even 0.9 (90%). In other words, this means we have a 78.09% (1 — Power) probability of making a Type II Error or, equivalently, a 78% chance of failing to reject our Null Hypothesis at a 95% confidence interval even though it is false, which is what happened here.

根據(jù)上面的計(jì)算，我們觀察到的功效等于21.91％(0.219)，這是非常低的，因?yàn)榻饦?biāo)準(zhǔn)約為0.8甚至0.9(90％)。換句話說，這意味著我們有78.09％(1- Power)發(fā)生II型錯誤的概率，或者等效地， 即使它是假的 ，也有78％的機(jī)會未能在95％的置信區(qū)間內(nèi)拒絕零假設(shè) ，這就是這里發(fā)生的事情。

As mentioned, Power is a function of:

如前所述，Power是以下功能之一：

Our significance criterion: this is our Type I Error or Alpha, which we decided to be 5% (95% confidence).
我們的顯著性標(biāo)準(zhǔn) ：這是我們的I類錯誤或Alpha，我們決定為5％(置信度為95％)。
Effect Magnitude or Size: This represents the difference between our observed and expected values in terms of the standardized statistic of use. In this case, since we used Chi-Square statistic, this effect (named w) is calculated as the squared root of the normalized Chi-Square value and is usually categorized as Small (0.1), Medium (0.3), and Large (0.5) (Ref: Cohen, J. (1988).)
影響幅度或大小 ：這表示我們的觀察值與期望值之間的差異(使用標(biāo)準(zhǔn)化的使用統(tǒng)計(jì)數(shù)據(jù))。在這種情況下，由于我們使用的是卡方統(tǒng)計(jì)量，因此將此效果(稱為w )計(jì)算為歸一化卡方值的平方根，通常分為小(0.1)，中(0.3)和大(0.5)。 )(參考資料： Cohen，J.(1988)。 )
Sample size: This represents the total amount of samples (in our case, 1800).
樣本數(shù)量 ：代表樣本總數(shù)(在我們的示例中為1800)。

效果幅度 (Effect Magnitude)

We designed an experiment with a relatively small effect magnitude since our Dice was only biased in one face (6) with only a slight additional chance of landing in its favor.

我們設(shè)計(jì)的實(shí)驗(yàn)的效果等級相對較小，因?yàn)槲覀兊镊蛔觾H偏向一張臉(6)，只有很少的其他機(jī)會落在其臉上。

In simple words, our effect magnitude (w) is calculated as follows:

簡而言之，我們的影響幅度(w)計(jì)算如下：

1) Where our Observed Proportions are calculated as follow:

1)我們的觀察比例計(jì)算如下：

Probabilities of our alternative hypothesis

我們的替代假設(shè)的概率

Divided by total (6)除以總數(shù)(6)

2) And our Expected Proportions:

2)和我們的預(yù)期比例 ：

Probabilities of our null hypothesis

原假設(shè)的概率

Divided by sum (6)除以總和(6)

Finally, we can obtain our effect size as follows:

最后，我們可以獲得如下效果大小：

Our effect size is 0.04 and, as expected, is considered very small (<0.1).我們的效果大小為0.04，并且可以預(yù)期，它很小(<0.1)。

樣本量 (Sample Size)

Similarly to our effect size, our sample sizes, even though it seems of enough magnitude (1800), is not big enough to spot relationship (or bias) at 95% confidence since our effect size, as we calculated, was very small. We can expect an inverse relationship between sample sizes and effect magnitude. The more significant the effect, the lower the sample size needed to prove it at a given significance level.

與我們的效應(yīng)量相似，我們的樣本量即使看起來足夠大(1800)，也不足以在95％的置信度上發(fā)現(xiàn)關(guān)系(或偏差)，因?yàn)槲覀冇?jì)算出的效應(yīng)量很小。我們可以預(yù)期樣本量與效應(yīng)量之間存在反比關(guān)系。效果越顯著，在給定的顯著性水平下證明該結(jié)果所需的樣本量越少。

At this time, it might be more comfortable to think sample sizes of our A/B test as dice or even coin throws. It is somewhat intuitive that with one dice/coin throw, we will be unable to spot a biased dice/coin, but if 1800 throws are not high enough to detect this small effect at a 95% confidence level, this leads us to the following question: how many throws do we need?

目前，將A / B測試的樣本大小視為擲骰子甚至投擲硬幣可能會更舒服。從一個直觀的角度來看，擲一枚骰子/硬幣，我們將無法發(fā)現(xiàn)有偏見的骰子/硬幣，但是如果1800枚硬幣的高度不足以在95％的置信度水平上檢測到這種小影響，這將導(dǎo)致我們得出以下結(jié)論：問題：我們需要多少次擲球？

The same principle applies to the sample size of our A/B test. The lesser the effect, such as small variations in conversion from small changes in each version (colors, fonts, buttons), the bigger the sample and, therefore, the time we need to collect the data required to accept or reject our hypothesis. A common problem in many A/B tests concerning website conversion in eCommerce is that tools such as Google Optimize can take many days, if not weeks, and most of the time, we do not encounter a conclusive answer.

同樣的原則適用于我們的A / B測試的樣本量。效果越小(例如，每個版本(顏色，字體，按鈕)的微小變化帶來的轉(zhuǎn)換變化很小)，樣本就越大，因此，我們需要收集接受或拒絕我們的假設(shè)所需的數(shù)據(jù)的時(shí)間也越大。在許多與電子商務(wù)中的網(wǎng)站轉(zhuǎn)換有關(guān)的A / B測試中，一個普遍的問題是，諸如Google Optimize之類的工具可能要花費(fèi)很多天(如果不是幾周的話)，并且在大多數(shù)情況下，我們沒有得到最終的答案。

To solve this, first, we need to define the Statistical Power we want. Next, we will try answering this question by iterating different values of N until we minimize the difference between our Expected Power and the Observed Power.

為了解決這個問題，首先，我們需要定義所需的統(tǒng)計(jì)功效。接下來，我們將嘗試通過迭代N的不同值來回答這個問題，直到將期望功率和觀測功率之間的差異最小化為止。

# Basic example on how to obtain a given N based on a target Power.# Playing with initialization variables might be needed for different scenarios.
CostFunction <- function(n,w,p) {
value <- pchisq(qchisq(0.05, df = 5, lower = FALSE), df = 5, ncp = (w^2)*n, lower = FALSE)
Error <- (p-value)^2
return(Error)
}
SampleSize <- function(w,n,p) {

# Initialize variables
N <- n
i <- 0
h <- 0.000000001
LearningRate <- 40000000
HardStop <- 20000
power <- 0

# Iteration loop
for (i in 1:HardStop) {
dNdError <- (CostFunction(N + h,w,p) - CostFunction(N,w,p)) / h
N <- N - dNdError*LearningRate
ChiLimit <- qchisq(0.05, df = 5, lower = FALSE)
new_power <- pchisq(ChiLimit, df = 5, ncp = (w^2)*N, lower = FALSE)
if(round(power,6) >= round(new_power,6)) {
cat(paste0("Found in ",i," Iterations\n"))
cat(paste0(" Power: ",round(power,2),"\n"))
cat(paste0(" N: ",round(N)))
break();
}
power <- new_power
i <- i +1
}
}
set.seed(22)
SampleSize(0.04,1800,0.8)
SampleSize(0.04,1800,0.9)

As seen above, after different iterations of N, we obtained a recommended sample of 8.017 and 10.293 for 0.8 and 0.9 Power values, respectively.

如上所示，在N的不同迭代之后，我們分別針對0.8和0.9的Power值獲得了推薦的樣本8.017和10.293。

Let us repeat the experiment from scratch and see which results we get for these new sample size of 8.017 suggested by aiming a commonly used Power of 0.8.

讓我們從頭開始重復(fù)該實(shí)驗(yàn)，并查看針對這些新的8.017樣本大小(通過將常用功效設(shè)定為0.8)所獲得的結(jié)果。

start_time <- Sys.time()# Let's roll some dices
set.seed(11) # this is for result replication
Dices <- DiceRolling(8017) # We expect 80% Power
t(table(Dices))# We generate our contigency table
Observed <- table(Dices)# We generate our expected frequencies table
Expected <- Observed
Expected[,1] <- (sum(Observed[,1])/nrow(Observed))
Expected[,2] <- sum(Observed[,2])/nrow(Observed)# We calculate our X^2 score
Chi <- sum((Expected-Observed)^2/Expected)
cat("Chi-Square Score:",Chi,"\n\n")# Lest's resample our data 15.000 times
permutation <- Dices
set.seed(20)
permutation <- Simulation(Dices,15000)Higher <- nrow(permutation[which(permutation$ChiSq >= Chi),])
Total <- nrow(permutation)
prob <- Higher/Total
cat(paste("Total Number of Permutations:",Total,"\n"))
cat(paste(" - Total Number of Chi-Squared Values equal to or higher than",round(Chi,2),":",Higher,"\n"))
cat(paste(" - Percentage of times it was equal to or higher (",Higher,"/",Total,"): ",round(prob*100,3),"% (P-Value)\n\n",sep=""))# Lets replicate this new experiment (8017 throws of a biased dice) 20k times
set.seed(20)
Rolls <- MultipleDiceRolling(10000,8017)
Power <- length(which(Rolls <= 0.05))/length(Rolls)
cat(paste(round(Power*100,3),"% of the times (",length(which(Rolls <= 0.05)),"/",length(Rolls),")",sep=""))end_time <- Sys.time()
end_time - start_time

最后的想法 (Final Thoughts)

As expected by our new experiment design of sample size equal to 8017, we were able to reduce our P-Value to 1.9%.

正如我們新的樣本量等于8017的實(shí)驗(yàn)設(shè)計(jì)所預(yù)期的那樣，我們能夠?qū)值降低到1.9％。

Additionally, we observe a Statistical Power equivalent to 0.79 (very near our goal), which implies we were able to reduce our Type II Error (non-rejection of our false null hypothesis) to just 21%!

此外，我們觀察到的統(tǒng)計(jì)功效等于0.79(非常接近我們的目標(biāo))，這意味著我們能夠?qū)I型錯誤(不拒絕錯誤的虛假假設(shè))降低到21％！

This allows us to conclude with 95% confidence (in reality 98.1%) that there is, as we always knew, a statistically significant relationship between Landing Version and Signups. Now we need to test, with a given confidence level, which version was the higher performer; this will be covered in a similar future post.

這使我們能夠以95％的信心(實(shí)際上是98.1％)得出結(jié)論，正如我們一直知道的那樣，著陸版本和注冊之間存在統(tǒng)計(jì)上顯著的關(guān)系。現(xiàn)在我們需要在給定的置信度下測試哪個版本的性能更高；這將在以后的類似文章中介紹。

If you have any questions or comments, do not hesitate to post them below.

如果您有任何問題或意見，請隨時(shí)在下面發(fā)布。

翻譯自: https://towardsdatascience.com/intuitive-simulation-of-a-b-testing-191698575235

傅里葉變換直觀

總結(jié)

以上是生活随笔為你收集整理的傅里叶变换直观_A / B测试的直观模拟的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：梦到亲人给自己钱是什么预兆
下一篇：鸽子迷信_人工智能如何帮助我战胜鸽子

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

傅里叶变换 直观_A / B测试的直观模拟

開始之前 (Before we start)

樣本數(shù)據(jù) (Sample Data)

觀察頻率 (Observed Frequencies)

預(yù)期頻率 (Expected Frequencies)

假設(shè)檢驗(yàn) (Hypothesis Testing)

統(tǒng)計(jì) (Statistic)

空分布模擬 (Null Distribution Simulation)

重采樣 (Resampling)

模擬 (Simulation)

重采樣分布 (Resample Distribution)

P值 (P-Value)

決策極限 (Decision Limits)

假設(shè)檢驗(yàn) (Hypothesis Testing)

統(tǒng)計(jì)功效 (Statistical Power)

效果幅度 (Effect Magnitude)

樣本量 (Sample Size)

最后的想法 (Final Thoughts)

總結(jié)

傅里叶变换直观_A / B测试的直观模拟