Software Engineering, Machine Learning and Innovation blog.

A/B Testing, from scratch

If you work in a diligent web development business you probably know what an A/B test is. However, its fascinating statistical theory is usually left behind. Understanding the basics can help you avoid common pitfalls, better design your experiments, and ultimately do a better job in improving the effectiveness of your website.

Please hold tight, and enjoy a pleasant statistical journey with the help of R and some maths. You will not be disappointed.

This blog post is also published on my company blog which inspired those studies.

Introduction

An A/B test is a randomized, controlled experiment in which the performance of two product variants are compared. Those variants are usually called A and B 1. From a business perspective we want to know if the performance of a certain variant outperforms the other.

As an example let's take a typical checkout page where we want to assess if a green checkout button is more effective than an orange one.

Figure 1: Experiment example

After 1 week we may collect the following numbers:

Converted

Total

Variant A

100

10000

Variant B

120

10000

If at this point you are willing to conclude that the B variant outperforms A be aware you are taking a very naïve approach. Results can vary on a weekly basis because of the intrinsic randomness of the experiment. Put simply, you may be plain wrong. A more thorough approach would be to estimate the likelihood of B being better than A given the number we measured, and statistics is the best tool around for this kind of job.

Statistical modeling

Statisticians love jars and, guess what, our problem can be modeled as an extraction from two different jars. Both jars have a certain ratio of red and green balls. Red balls are customers who end up paying while green balls are customers who leave.

Does the jar belonging to variant B have a greater proportion of red balls compared to the other one? We can estimate this ratio by doing an extraction with replacement 2from the jar. We extract a finite number of balls from each jar and we measure the proportions. In this type of experiment each time we pick a ball we also put it back to the jar to keep the proportions intact.

Figure 2.a: Variant A population

Figure 2.b: Variant B population

Now the good news is that the Binomial distribution 3 models exactly this type of experiment. It will tell you the expected number of successes in a sequence of independent yes/no experiments, each of which yields success with probability .

In other words, each time we extract balls from our jar we are conducting an experiment, and a yes corresponds to a red ball while a no corresponds to a green ball. The binomial distribution 4 computes the probability of picking exactly red balls after extractions as follows:

(a)

which can be checked using the dbinom() function in R. Assuming we have a jar with 30% of red balls, what is the probability that we extract exactly 10 red balls out of a 100? This is equivalent to which is:

R

1

2

dbinom(10,100,0.3)

# => 1.17041796785404e-06

Very small. Fair enough, the jar has definitely a bigger proportions of red and I was the unlucky guy here. Now, let's plot these values for x (number of successes) ranging between 0 and 100:

Probability mass function of the binomial distribution

R

1

2

3

4

5

x=1:100

y=dbinom(x,100,0.3)

options(repr.plot.width=7,repr.plot.height=3)

qplot(x,y,xlab="Number of successes",ylab="Probability")+xlim(0,60)

Makes sense, doesn't it? The chances of having exactly k successes cumulates around the value of 30, which is the true proportion of red/green balls in our jar.

Naïve experiment assessment

Now that we know how statisticians model our problem let's go back to our conversions table.

One way of assessing if B is better than A is to plot their expected distributions. Assuming that A follows a Binomial distribution with (we had 100 conversions over 10000 trials) and that B follows a Binomial distribution with (we had 120 conversions over 10000 trials) this is how they relate to each other:

R

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

x_a=1:10000

y_a=dbinom(x_a,10000,0.01)

x_b=1:10000

y_b=dbinom(x_b,10000,0.012)

data=data.frame(x_a=x_a,y_a=y_a,x_b=x_b,y_b=y_b)

options(repr.plot.width=7,repr.plot.height=3)

cols=c("A"="green","B"="orange")

ggplot(data=data)+

labs(x="Number of successes",y="Probability")+xlim(0,200)+

geom_point(aes(x=x_a,y=y_a,colour="A"))+

geom_point(aes(x=x_b,y=y_b,colour="B"))+

scale_colour_manual(name="Variants",values=cols)

So if the true mean of the two distribution is and respectively we can conclude that B is better than A. If we repeat the experiment several times (always with 10000 participants) for A we will get most of the time values between and while for B we will get values between and . You can check these boundaries on the plot, or you can compute them manually with the 3 times standard deviation rule of thumb 5.

R

1

2

3

4

5

n=10000;p=0.01;q=1-p;mean=100

paste(mean-3*sqrt(n*p*q),",",mean+3*sqrt(n*p*q))

n=10000;p=0.012;q=1-p;mean=120

paste(mean-3*sqrt(n*p*q),",",mean+3*sqrt(n*p*q))

But hold on one minute. How do we know that and are indeed the true means of our distributions? In the end we simply did one extraction from our jars. If these numbers are wrong our distributions will look different and the previous analysis will be flawed. Can we do better?

More rigorous experiment assessment

In order to estimate what is the true mean of our variants statisticians rely on the Central Limit Theorem (CLT) 6 which states that the distribution of the mean of a large number of independent, identically distributed variables will be approximately normal, regardless of the underlying distribution.

In our case we are trying to estimate the distribution of the mean of the proportions and for our variants. Suppose you run your A/B test experiment times and that each time you collect samples you will end up having for variant A:

and the CLT tells us that these are normally distributed with parameters:

,

where is the standard deviation of the binomial distribution. I know, this is hard to believe and proving these numbers goes definitely beyond the scope of this blog post so you will find some maths-heavy material in the footnotes 78.

Back to our question, what is the true mean of and ? Well, we don't know really, but they are distributed normally like this:

R

1

2

3

4

5

6

7

8

9

10

11

12

13

14

x_a=seq(from=0.005,to=0.02,by=0.00001)

y_a=dnorm(x_a,mean=0.01,sd=sqrt((0.01*0.99)/10000))

x_b=seq(from=0.005,to=0.02,by=0.00001)

y_b=dnorm(x_b,mean=0.012,sd=sqrt((0.012*0.988)/10000))

data=data.frame(x_a=x_a,y_a=y_a,x_b=x_b,y_b=y_b)

options(repr.plot.width=7,repr.plot.height=3)

cols=c("A"="green","B"="orange")

ggplot(data=data)+

labs(x="Proportions value",y="Probability Density Function")+

geom_point(aes(x=x_a,y=y_a,colour="A"))+

geom_point(aes(x=x_b,y=y_b,colour="B"))+

scale_colour_manual(name="Variants",values=cols)

As you can see we are dealing with a risky business here. There is a good chance that the estimation of the true values of and are not correct since they can span anywhere between all the values plotted above. For a good number of them may actually outperform violating the conclusion we did in the previous section. There is no magic bullet that will solve this problem, this is the intrinsic nature of the probabilistic world in which we live. However, we can do our best to quantify the risk and take a conscious decision.

Quantitative evaluation

In the previous section we have seen that it is likely that Variant B is better than Variant A, but how can we quantify this statement? There are different ways in which we can look at this problem, but the ones that statisticians use is Hypothesis testing.

In a series of papers in the early 20th century, J. Neyman and E. S. Pearson developed a decision-theoretic approach to hypothesis-testing 9. The theory was later extended and generalised by Wald 10. For a full account of the theory, see the book of Lehmann and Romano 11.

In this framework we state a hypothesis (also called null-hypothesis) and by looking at the number we will try to reject it. In our example we hypothesize that the true conversion of our visitors is and that the proportion we collected in the B variant is simply due to chance. In other words we assume that our real world visitors behave like in variant A and we quantify the probability of seeing variant B's proportions under this hypothesis.

So, what is the probability of having or more conversions if the true mean of the binomial distribution is ? We simply have to sum the probability of all possible events:

To actually compute the number you can use the probability mass function in Formula (a) or you can use R:

R

1

binom.test(120,10000,p=0.01,alternative="greater")

which gives us the following:

R

1

2

3

4

5

6

7

8

9

10

Exact binomial test

data:120and10000

number of successes=120,number of trials=10000,p-value=0.0276

alternative hypothesis:trueprobability of success isgreater than0.01

95percent confidence interval:

0.010265231.00000000

sample estimates:

probability of success

0.012

I deliberately specified alternative = "greater"in the function call to compute the chance of getting more than 120. But there are other ways 12 to approach the problem. The value we are looking for is the p-value 130.0276 which is exactly the probability of getting more than 120 successes, i.e. . Visually it corresponds to the area under the right end tail of the distribution of A:

We are now ready to quantify to what degree our null-hypothesis is true or false according to the numbers we collected in our experiment.

Type I and Type II errors

It is well known that statisticians do not have the same talent in the art of giving names as computer scientists do. This is well proved by the definition of Type I and Type II errors which are equivalent to the Machine learning definitions of False positive and False negative.

A Type I error is the probability of rejecting the null-hypothesis when the null-hypothesis is true. In our example this happens when we conclude there is an effect in our AB test variant, while in reality there is none.

A Type II error is the probability of failing to reject the null-hypothesis when the null-hypothesis is false. In our example this happens when we conclude the AB test variant has no effect, while in reality it actually did.

In the previous section we quantified the probability of Type I error, which is equivalent to the p-value returned by the binom.test function. To quantify the Type II error we need to know for whatwe are willing to reject the null-hypothesis. It is common practice to use so I'll just go with that.

Now, what is the critical number of conversions such that we reject the null-hypothesis at ? This is called the percentile 14 and it is simply the such that:

which can also be computed easily in R with the following:

R

1

2

3

alpha=0.05

qbinom(1-alpha,10000,0.01)

# => 117

Now we know that starting from observations we will reject the null-hypothesis.

To compute the Type II error we need to assume that the null-hypothesis is false () and quantify how likely it is to measure a value of or less since this is exactly what will lead us to a mistake!

which is equivalent to:

R

1

2

pbinom(117,10000,0.012)

# => 0.414733285324603

which means that we have a ~= 40% chance to conclude our experiment did not have any effect, while in reality there was some.

That's seems harsh to me. What can we do? Apply the first law of the modern age of statistics which is, get more data. Assuming we can double our data and get visitors instead of and that we measure the same proportions, our Type II error will go down drammatically:

R

1

2

3

4

5

qbinom(0.95,20000,0.01)# critical value at which we reject the null-hypothesis

# => 223

pbinom(223,20000,0.012)# type II error

# => 0.141565461885161

now we have ~= 14% chance of concluding our experiment did not have any effect. We can even go one step further and see for every possible observation count what's the expected Type II error by simply brute forcing the computation as follows:

R

1

2

3

4

5

6

7

8

9

v=c();n=1000:50000

for(iinn){

critical=qbinom(0.95,i,0.01)

t2_error=pbinom(critical,i,0.012)

v=append(v,t2_error)

}

options(repr.plot.width=7,repr.plot.height=3)

qplot(n,v,xlab="P(type II error)",ylab="# Observations")

which seems a fair result, starting from visitors we can safely assume that our probability or Type II error will be low.

The analysis as I just presented it is flawed by a fundamental assumption. To estimate the Type I error we assumed that is exactly while to estimate the Type II error we assumed that is exactly . We know from the CLT that this is not exactly true and the distributions of those values span across a certain range (see here). So let's see what happens if I take several points across these distributions, for example the 1%, 25%, 50%, 75%, 99% percentiles and check what happens to our hypothesis testing errors.

For Type II errors I first collect the possible values of for my parametric analysis:

where the green line is the same as the one plotted here. It is quite interesting to observe the pink line at . This is the worst case scenario for us since we picked a value that is greater than and because of that our Type II error actually increases with the number of observations! However it is also worth mentioning that this value is very unlikely because the more data I collect the more the uncertainty around the values of decrease. You may also notice that the lines in the graph are quite thick, and this is because discrete tests like the binomial one 15 have quite large oscillations 16.

The same type of parametric analysis can be performed on the Type I error. Previously we saw how to compute it with the binom.test() function, but we can do the computation manually as follows when and :

R

1

pbinom(119,10000,0.01,lower.tail=FALSE)

where is the value starting from which we fail to reject the null-hypothesis. The parametric analysis is a generalisation of this code like we did for the Type II error:

as in the Type II error we notice that for the Type I error actually increase with the amount of data but this value become more and more unlikely as the data grows.

It also very interesting to notice how the value of the two type of errors goes down at a completely different rate. Overall, with this design, we are more likely to stick with the "button colour makes no difference" conclusion. When the reality is that button colour makes no difference, the tests will stick to reality most of the times (Type I error goes down quickly). When the reality is that button colour does make a difference, the test does take the risk of saying there is actually no difference between the two (Type II error goes down slowly).

Estimate the sample size

Before wrapping up let's make a step back and position ourself back in time before we started the experiment. How we can estimate for how long we should run our experiment in order to be confident that our results are statistically significant? To answer this question you simply need to use the same tools we just saw, from a different perspective.

First, you need to make an estimate around your current baseline conversion. If you use google analytics it should be straightforward to know what is the conversion of your checkout page with the green button.

Second, you need to make a guess on what type of effect size you are willing to detect (minimum detectable effect size). In our example we would have chosen an effect size of 20%. In various disciplines effect sizes have been standardised to make different experiments comparable. Most famous effect size measure is Cohen's d 1718.

Third, you need to assert what risk you are willing to take on the Type I error, or equivalently, what level you are willing to choose for your p-value.This is the value you will refer to at the end of the experiment to make your conclusion.

Finally, you need to assert what risk you are willing to take on the Type II error, or equivalently, how likely you are willing to say your orange button did not perfomed better when in reality there was an effect (i.e. orange button is better). This is equivalent to the power, where you assert how likely you are going to be correct when you conclude that the orange button is better, when it is actually real.

Mathematically speaking computing the required sample size seems to be a difficult problem in general and I would like to point you to the footnotes for a deeper discussion 192021. Here I will show the approach taken from the Engineering statistics handbook 22.

Now, you have to use some faith and intuition. What is the smallest value of , say , that we care about? Our minimum detectable effect size! You can imagine is a function of both (1) and (2). The smallest value in (1) for which we start rejecting is:

while the smallest value in (2) for which we start rejecting is:

where putting it all together we have:

Take the N out of this equation and you get:

So, using our example we state that: (i) our baseline conversion for the green button is 1% or ; (ii) our estimate of effect size is 20% which leads to the orange button converting at 1.2% or ; (iii) we accept a Type I error probability of 5% or ; (iv) we want a Power of 80% to make sure we detect the effect when it is there. This is translated in R as follows:

R

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

p_a=0.01

p_b=0.012

alpha=0.05

beta=0.2

delta=abs(p_a-p_b)

t_alpha=qnorm(1-alpha)

t_beta=qnorm(1-beta)

sd1=sqrt(p_a*(1-p_a))

sd2=sqrt(p_b*(1-p_b))

n=((t_alpha*sd1+t_beta*sd2)/abs(p_a-p_b))^2

# => ~= 16294

which seems to be consistent with the simulations we did previously. On the other hand it is 30% lower than the one reported by a fairly famous online tool 2324. If you know why please let me know in the comments.

Wrapping up

In this blog post we took a simple conversion table and we did a step by step statistical analysis of our results. Our experiment is designed like an extraction with replacement from a jar which is modelled with a binomial distribution.

We concluded that given the observed data the checkout page with the orange button is 20% more likely ((0.012 - 0.01) / 0.01) to convert than the one with the green button. If the green button is equivalent to the orange button (no effect), we came to the wrong conclusion with a probability of 2.76% (P(type I error), also called p-value). If the orange button is better than the green button, we came to the right conclusion with a probability of 60% (1 - P(type II error), also called the power). In this same scenario (orange button is better) if we double the number of visitors in our experiment our probability of being correct (power) will go up to 86%. Finally, we reviewed how to estimate the required sample size for our experiment in advance with a handy formula.

It is worth mentioning that in practice you rarely use a binomial test but I belive the simplicity of the distribution makes it adequate for this type of presentation. A better resource on choosing the right test can be found in the references 25.

If you enjoyed this blog post you can also follow me on twitter. Let me know your comments!

This is great and astonishing that so many statisticians and scientists completely overlook effect size and power. What are your thoughts on a Bayesian approach that actually calculates a posterior? We have powerful computers now that can explicitly compute these probabilities.

Hi Barry, thank you for your comment.
In the Bayesian approach I think the main advantage is that the results are much easier to interpret compared to the hypothesis testing framework. In terms of the quantitative result it should be equivalent on the long run. Writing an article on the bayesian approach is definitely on my todo list, so I'll get into the details 🙂

Thank you so much for such thorough post. I have a question regarding the 'Estimate the sample size' past. It says that the minimum value of Delta that we starts rejecting null hypothesis is z_(1-alpha)/sqrt(p_a(1-p_a)/N). Should it be z_(1-alpha)/sqrt(p_a(1-p_a)/N + p_b*(1-p_b)/N) as var(p_a-p_b) = Var(p_a) + Var(p_b)?

This is great and astonishing that so many statisticians and scientists completely overlook effect size and power. What are your thoughts on a Bayesian approach that actually calculates a posterior? We have powerful computers now that can explicitly compute these probabilities.

Hi Barry, thank you for your comment.
In the Bayesian approach I think the main advantage is that the results are much easier to interpret compared to the hypothesis testing framework. In terms of the quantitative result it should be equivalent on the long run. Writing an article on the bayesian approach is definitely on my todo list, so I'll get into the details 🙂

Thu Le

Thank you so much for such thorough post. I have a question regarding the 'Estimate the sample size' past. It says that the minimum value of Delta that we starts rejecting null hypothesis is z_(1-alpha)/sqrt(p_a(1-p_a)/N). Should it be z_(1-alpha)/sqrt(p_a(1-p_a)/N + p_b*(1-p_b)/N) as var(p_a-p_b) = Var(p_a) + Var(p_b)?