Is Your A/B Testing Tool Finding Differences That Aren’t?

You may not know it, your A/B testing tool may be finding differences that really aren’t. [tweetmeme source=”pricingright”] The tool may tell you that version 2 performs better than version 1 because it found statistically significant difference between the performance of the two versions, but in reality there may be no difference. As a result you may end up investing time and resources in more tests, fine tuning minor differences.

The problem comes from three fronts:

Definition of performance: Using percentage conversation rates, a choice forced by the next point.

Using extremely large samples: Samples larger than 300, a choice forced by the use of t-test on conversion rates.

A/B testing is about finding if there is statistically significant difference, at a preset confidence level (usually 95%) between the performances of the two versions under test. The statistical test that is used by some of the tools is the Student t-test and the performance metric compared is the percentage conversion rate.

Let p1 and p2 are the conversion rates of the two versions. If the difference p2-p1 (or vice versa) is found to be statistically significant, we are told version 2 WILL perform better. Worse, some may even conclude Version 2 WILL perform 47.5% (or some such umber) better based on the math (p2-p1)/p1%.

For the sake of running valid tests, these tools run the tests over long periods of time and collect large amount of data. Then they run the t-test on the entire data, typically thousands of data points.

In a paper titled Rethinking Data Analysis published in the International Journal of Marketing Research, Vol 52, Issue 1, Prof. Ray Kent writes,

For large samples – certainly samples over 300 or so–any association large enough to attract the attention of the researcher will always be statistically significant

With large samples we are violating the Random Sampling requirement for statistical testing. When everything else is held constant, large samples (most of the testings I see use upwards of 5000 sample size) increase statistical significance. Differences that are so small to show up in small samples are magnified in large samples. Large samples have one big problem: they lose all information about segmentation. While you may find no difference between the versions within each segment, put together you will find statistically significant difference with large samples.

Imagine this, suppose you collected 5200 samples for version 1 and 5300 for version 2. Let us say the samples include equal number of male and females. While you may find no statistically significant difference for males and females separately, you might find one for the total. What if you don’t the hidden segmentation dimensions? What about the hidden demographic and psychographic segmentation dimensions that are not teased out? (See below for detailed math.)

The net is, convinced by the magical words, “statistically significant difference”, you end up magnifying differences that are not real differences at all and continue to invest in more and more tests picking Red arrows over Green arrows.

How can you fix it? Stay tuned for the next in this series.

Here is the A/B test math as practiced by popular tools:

Let us use data published in a 2006 article by Avinash Kaushik (used only for illustrative purposes here and in I believe in that post as well).

Let us start with the hypotheses:

H0: p1=p2, any difference in conversion rate is due to chance
H1: p2> p1 Alternate hypothesis (we will be using one-tailed test)

Then you do the experiment. You do send out two offers to potential customer. Here is how the outcomes look:

First we compute the standard error SE for each offer, which is approximated as sqrt( p(1-p)/n ). Note the “n” in the denominator. So higher the “n” lower the SE.

In this example, SE1 = 0.001276 and SE2 = 0.001516. Then we compute the common SE between samples, which is square root of sum of square of SE1 and SE2. Here SE = 0.00198

Then we compute t-stat, p2-p1/SE = 1.72. From the t-table, for degrees of freedom = ∞ (more than 120 is infinity for this table) we find the one-tailed value for p-value 0.05 is 1.645. Since 1.72 > 1.645, we declare statistical significance.

Now let us say that the offers you sent to were to two Geos, US and EMEA. Let us assume exactly half the number of each offer was sent to each Geo. Let us also assume that we received exactly equal number of responses from each Geo.

Your p1 and p2 remain the same but your SE1 increases from 0.001276 to 0.001804 and SE2 increases from 0.001516 to 0.002144. So SE increases to 0.0028.

When you do the t-test for US and EMEA separately the t-stat you compute will be 1.216, less than the 1.645 from the t-table. In other words, there is no statistically significant difference between the two offers for US and so is the case for EMEA. But when we put these together, we found otherwise.

You could counter this by saying we collect 5200 samples for each Geo. What if we the segmentation dimensions are not known in advance? What about other demographic and psychographic segmentation?

Large samples will find statistically significant difference that are in reality not significant at all.

Another mistake is to quote % difference between versions. It is just wrong to say one version performed better than the other by x%. Note that the alternate hypothesis is p2>p1 and it DOES NOT say anything about by how much. So when we find statistically significant difference, we reject H0 but there is nothing in our hypothesis or the method to say p2 performed better than p1 by x%!

Certainly there is a lot of confusion (sometimes misinformation) in the world regarding what is the correct statistical analysis for A/B testing, so your articles serve a great purpose in helping to clarify this. However there are a couple statements in your articles that are misleading or wrong. (1) you said “Large samples will find statistically significant difference that are in reality not significant at all.” The key point of sampling is, for any given experiment, increasing sample size increases the confidence level. That isn’t an excuse for a poorly-designed experiment — if the way you get a larger sample is to throw together data that doesn’t meet the preconditions of the test (e.g. independence, which is a big worry when there’s segmentation), then a larger sample may not give a good result, but that’s because the sampling approach is itself defective, not the sample size per se. When the conditions required by the specific statistical test are satisfied by the sample, and keeping all other factors static, bigger samples ALWAYS deliver a higher confidence level. This is like homeopathy — you can’t say a smaller sample is more powerful than a bigger sample, that’s absurd. (Unless the sample is both smaller and also DIFFERENT, which would mean it’s actually a different experiment altogether). You also said (2) “Note that the alternate hypothesis is p2>p1 and it DOES NOT say anything about by how much. So when we find statistically significant difference, we reject H0 but there is nothing in our hypothesis or the method to say p2 performed better than p1 by x%!” That’s true but may be misleading to some readers. You can test if p2 is some amount greater than P1 by changing the test from p2>p1 to p2>(N x p1) or p2>(M + p1). Just make a substitution of variables in the original equation, carry through the math, and it all works. You can test ANY measurable hypothesis. In this case, you just need to change the equation that’s used to define the hypothesis. And of course, the larger the difference you are trying to judge, the larger the sample size will be needed to achieve the same level of significance in the result. Cheers.

Ryan
Thanks for the feedback.
Your statement is wrong. Statistical tests for significance are not about users having to decide between A and B. The parametric tests (e.g. t-test) and non-parametric tests (Chi-sqr) test whether the observed difference is due to chance or not.
-rags

I work in SEM and have been doing some research on analyzing A/B tests. Most tools I have found try to measure if there is a significant difference between the values. However, the statistical tests being used were meant for tests where a user is shown A and B and has to decide between the two. I do not think this is the correct way to think about the problem.

In another article you mention a coin toss problem. I think A/B tests for Landing Page or Ad Copy tests are closer to the example of a coin toss. Instead of seeing A and B and deciding between the two, they are seeing A or B and making some action or not (the probability they convert on A plus the probability they convert on B is NOT equal to 1). My thought is to treat A and B as two different coins that have probabilities of coming up Heads (conversion). From there we can find confidence intervals for the number of Heads (conversions) given the number of flips (users).

When I applied your numbers to the tool I made I get the CR of n1 is between 0.6% and 1.1% with 95% confidence and the CR of n2 is between 0.9% and 1.5% with 95% confidence.