A/B testing

The title refers to the famous anecdote about Marissa Mayer testing 40 shades of blue to determine the right color for the links. (Unfortunately I am colorblind, I know just one blue.)

Mayer is famous for many things at Google, but the one that always sticks out – and defines her in some ways – is the “Forty Shades of Blue” episode.

she ordered that 40 different shades of blue would be randomly shown to each 2.5% of visitors; Google would note which colour earned more clicks. And that was how the blue colour you see in Google Mail and on the Google page was chosen.

Thousands of such tests happen in the web world, every website running multiple experiments in a day. Contrary to what most in webapp development may believe AB testing does not have its origins in webapp world. It is simply an application of statistical testing, Randomized Control Trial, to determine if a ‘treatment’ made a difference on the performance of treatment group compared to performance of control group.

The simplest test is testing if the observed difference between the two sample means are statistically significant. What that means is measuring the probability, p-value, the difference is just random. If p-value is less than a preset level we declare the treatment made a difference.

“I have published about 800 papers in peer-reviewed journals and every single one of them stands and falls with the p-value. And now here I find a p-value of 0.0001, and this is, to my way of thinking, a completely nonsensical relation.”

Should you test 40 shades of blue to find the one that produces most click-thrus or conversions? xkcd has the answer:

Can Ms. Mayer test the way out of Yahoo’s current condition? Remember all these split testing are about finding lower hanging fruits not quantum leaps. And as Jim Manzi wrote in his book Uncontrolled,

Perhaps the single most important lesson I learned in commercial experimentation, and that I have since seen reinforced in one social science discipline after another, is that there is no magic. I mean this in a couple of senses. First, we are unlikely to discover some social intervention that is the moral equivalent of polio vaccine. There are probably very few such silver bullets out there to be found. And second, experimental science in these fields creates only marginal improvements. A failing company with a poor strategy cannot blindly experiment its way to success …

Suppose you have a service – be it a web based service or a brick and mortar service. Visitors walk through the front door. Most just leave without taking an action that is favorable to you. Some do and become converts.
As you function along, you form a belief/gut-feel/hypothesis that color of the door affects how many will convert. Specifically, certain color will improve conversion. (I am color blind, else I will call out the colors I used below)
You verify this hypothesis by running a split test. You evenly split your visitor stream, randomly sending them through Door A of current color or Door B of the new color which is supposed to increase conversion. This is the A/B split test.

How do you verify your hypothesis?

The most common way that is practiced by every A/B test tool in the market is shown below

These tools keep both Converts and Non-Converts for a given Door together and treats each as a separate population. Those who went through Door A (both Converts and Non-Converts) are kept separate from those who went through Door B. They test the hypothesis that the proportion of converts in the Door B population is higher than proportion of converts in the Door A population. The tools assume that the population data are normally distributed and use a 2-sample t-test to verify the difference between the two proportions is statistically significant.

What is wrong with this approach? For starters, you can see how it rewrites the hypothesis and re-wires the model. This approach treats conversion as an attribute of the visitor. This is using the t-test for the wrong purpose or using the wrong statistical test for A/B testing.

For example, if you want to test whether there is higher prevalence of heart disease among Indians living in US vs. India, you will draw random samples from the two populations and, measure the proportion of heart disease in each sample and do a t-test to see if the difference is statistically significant. That is a valid use of t-test for population proportions.

Conversion isn’t same as measuring proportion of population characteristic like heart disease. Treating the conversion rate as a characteristic of the visitor is contrived. You also need to keep the Converts and Non-Converts together while you only need to look at those who converted.

Is there another way?

Yes. Take a look at this model that closely aligns with the normal flow. We really do not care about the Non-Converts and we test the correct hypothesis that more Converts came through Door B than through Door A.

This method grabs a random sample of Converts and tests whether there are more that came through Door B than through Door A. It uses Chi-square test to verify that the difference is not just due to randomness. No other assumptions needed like assuming normal distribution and it tests the right hypothesis. Most importantly it fits the flow and model before we introduced Door B.

Want to know more? Want to know the implications of this and how you can influence your A/B test tool vendors to change? Drop me a note.

In A/B testing, you control for many factors and test only one hypothesis – be it two different calls to action or two different colors for BuyNow buttons. When you find statistically significant difference in conversion rates between the two groups, you declare one version is superior to other.

Hidden in this hypothesis testing are many implicit hypotheses that we treat as truth. If any one of them prove to be not true then our conclusion from the A/B testing will be wrong.

I work for an ecommerce site that has a price range of anywhere from $3 an item to $299 an item. So, I feel like in some situations only looking at conversion rate is looking at 1 piece of the puzzle.

I’ve often used sales/session or tried to factor in AOV when looking at conversion, but I’ve had a lot of trouble coming up with a statistical method to ensure my tests’ relevance. I can check to see if both conversion and AOV pass a null hypothesis test, but in the case that they both do, I’m back at square one.

Dave’s question is, whether the result from the conversion test experiment hold true across all price ranges.

He is correct in stating that looking at conversion rate alone is looking at one part of the puzzle.

When items vary in price, like he said from $3 to $299, the test for statistical significance of difference between conversion rates assumes an implicit hypothesis that is treated as truth.

A1: The difference in conversion rates does not differ across price ranges.

and the null hypothesis (same, just added for completeness)

H0: Any difference between the conversion rates is due to randomness

When your data tells you that H0 can or cannot be rejected, it is conditioned on the implicit assumption A1 being true.

But what if A1 is false? In Dave’s case he uncovered one. What about many other such hypotheses? Other examples include, treating the target population as the same (no male/female difference, no Geo specific difference etc) and products as the same.

I point out to two different results from the same data set by segmenting and not segmenting the population in one of my previous posts.

That is the peril of hidden hypotheses.

What is the solution for a situation like Dave’s? Either you explicitly test this assumption first or as simpler option, segment your data and test each segment for statistical significance. Since you have a range of price points I recommend you test over 4-5 price ranges.

What is the solution for the bigger problem of many different hidden hypotheses?

Take a quick look at pricing pages of most web services and products. Most offer 3 or 4 versions that differ in features, usage (number of users, responses etc) and of course price. In every pricing page I visited (sampling, not comprehensive) the first attribute is always price. Some of the pricing pages use font and other highlighting to make pricing prominent.

What if price isn’t the first attribute you present to your customers?

What if your pricing page pitches the benefits of each version before it talks about price?

What if price is the last attribute for each version listed in your pricing page?

Last week I wrote about the difference between the Price leader and Price-Less leader*. The core idea was to start the conversation with your customers about all other attributes but price. When price is not prominent, you get to talk to customers about factors that are relevant to them.

A version of the concept of Price-Less leader was published in Journal of Marketing Research Dec 2009. The article used the term “Benefits leader” instead of “Price-Less leader” and they made a very relevant finding,

“When customers choose benefits leader (purely based on benefits and without price information) they tend to stick with that choice even when the price information is revealed. Even when faced with a higher price, they tend to stick with their choice based on benefits”

Applying these findings to pricing page, I hypothesize, when price is listed as the last attribute:

More customers will pick your higher priced versions

More customers will signup for your basic version (higher conversion)

This hypothesis is based on previous research on pricing but from a different context. So it is worth testing for your pricing page before you roll out. This is definitely worth adding to the A/B testing that you probably are already doing for the rest of the pages. I recommend this A/B testing despite my earlier warnings about A/B testing.

Note that I am not recommending that you do not show the price at all or show it only after customers sign up – I am recommending that you move the price to be last attribute you list under each version.

I am every interested in hearing your results. Send me a note on your results, even if you did not find statistically significant difference.

For the analytically inclined: If you do not want to do the traditional A/B testing you can use Bayesian. But I do not recommend a full blown Bayesian verification in this case.

You have been using A/B split testing to improve your mail campaigns and web designs. The core idea is to randomly assign participants to group A or B and measure the resulting performance – usually in terms of conversion. Then perform statistical testing, either t-test (incorrect) or Chi-square test to see if the difference in performance between A and B is statistically significant at 95% confidence level.

There are significant flaws with this approach:

Large Samples: Use of large samples that are most likely to find statistical significance even for small differences. When using large samples (larger than 300) you lose segmentation differences.

Focus on Statistical Significance: Every test tool, sample size calculator and articles are narrowly focused on achieving statistical significance, treating that as final word on the superiority of one version over.

Ignoring Economic Significance: There may be statistical significance or not, but no test tool will tell you the economic significance of that for your decision making.

Misleading Metrics: When tools report Version A is X% better than version B, they are simply wrong. The hypothesis testing used in A/B testing is simply one version is better than other and not by what percent.

All or nothing: When the test results are inconclusive, there is nothing to learn from these tests.

Discontinuous: There is no carryover of knowledge gained from previous tests. We do not apply any knowledge gained from a test in later tests.

Test Everything and Test Often: The method wrests control from the decision maker in the name of “data driven”. This pushes one to suspend all prior knowledge (because these are considered hunches and intuition) and test every thing and test often, resulting in significant costs for minor improvements. Realize that the test tool makers are incentivized by your regular and excessive testing.

Mistaking X implies Y is same as Y implies X: The hypothesis testing is flawed. What we test is, “how well does the data fit the hypothesis that we assumed”. But at the end of the test we state, “the hypothesis is supported by the data and is true for all future data”.

The root cause of all the mistakes is in using A/B testing for decision making. When you are deciding between two versions you are deciding which option will deliver you better returns. The uncertainty is in deciding the version. If there is no uncertainty at all, why bother?

You are not in the hypothesis testing business. You are in the business of adding value to your shareholders (that is you, your investors). To deliver value you need to make decisions in the presence of uncertainties. With all its flaws, A/B testing is not the right solution for decision making!

So stop using A/B testing!

What do I recommend? Send me a note to read a preview of my article on “Iterative Bayesian (TM)”.

Like this:

Last time I wrote about the use of prior knowledge in A/B testing there was considerable push back from the analytics community. I think I touched a nerve when I suggested the use of “how confident you were before the test” to interpret the results after the test. While the use of such information may sound like gut-feel and arbitrary, we must recognize that we implicitly use considerable information priors in A/B testing. The Bayesian methods I used just made the implicit assumptions explicit.

When you finally get down to test two (or three) versions with A/B split testing, you have implicitly eliminated many other versions. You should stop and ask why you are not testing every possible combination. The answer is you applied tacit knowledge that you have, either based on your own prior testing or well established best practices and eliminated many versions that required no testing. That is the information prior!

Now let us take this one step further. Of the two versions you selected, make a call on how confident you are that one will perform better than the other. This can be based on prior knowledge about the design elements and user experience or an estimate that is biased. This should not surprise you, after all we all seem to be finding reasons why one performed better than the other after the fact. In fact the latter scenario has hindsight bias whereas I am simply asking you to state your prior expectation of which version will perform better.

Note that I am not asking you to predict by how much, only how confident you are that there will be real (not statistically significant, but economically significant) difference between the two versions. You should write this down, before you start testing and not after (I prefer to call A/B testing as collecting data). As long as the information is obtained through methods other than this test in question, it is a valid prior. It may not be precise but it is valid.

What we have is the application of information priors in A/B testing – valid and relevant.

Next up, I will be asking you get rid of the test for statistical significance and look at A/B testing as a mean to reduce uncertainty in decision making.