We ran a test where we tested 3 different button colors for a Call to Action. The CTA led to our offers page (where we had special offers to sign up for a TV package). But of the 3 variations, none showed a statistically significant improvement over the control for our primary goal of having users sign up for offers.

However 1 variation did show a statistically significant increase in the control, but for a completely unrelated goal: Pay per View movie purchases. Pay per View movies aren't even linked from our offers page, so we can't really explain the difference.

We used VWO and we saw a 20% improvement of our green button over the control orange button. The winning variation had 255 conversions per 49064 visitors, with a 98% chance to beat the control.

Should we launch this winning variation to live? How do we explain it to the business?

2 Answers
2

First of all, I think your statistical test is giving you a 1-tailed p-value, rather than a 2-tailed p-value that you should use in what sounds like exploratory work. I think you’re saying your p-value is 0.02 (i.e., there is a 2% chance of getting the observed difference in conversions by random luck). However, if the number of visitors to your control condition is about the same as the variation, it should be closer to the 0.04 to 0.05 range (I can’t calculate the exact value because (a) I need to know the number of visits and conversions for the control, and (b) a sample size of nearly 50,000 per variation blows the mind of my little ol’ home-made Fisher Exact calculator).

Elevated Type I Error?

Still, the p-value is low enough in my book that it’s worth believing that you have a systematic rather than random effect… except that it sounds like you’re doing a lot of tests. The way inferential statistics works, 1 in 20 variations that in fact have no real effect will come out “statistically significant” on average. Such an event is called a Type I error. It implies that if you test a lot of variations for effects they really should not have, you need to expect that one in twenty will show a spurious effect.

So did you do 20 tests? Is this exactly what you should expect to happen if all your variations in fact do nothing on anything? Even if you didn’t do 20 tests, the more tests you do, the more you raise the chance of one or more of them having a Type I error. For example, it seems you did three tests to compare three variations with the control on the primary goal, plus three additional tests for each variation on the unrelated goal, for at least six tests in total. If in fact none of your variations affect anything, you’d have a 0.26 chance of at least one coming out “statistically significant.” That’s a pretty high chance. If you did 15 tests (e.g., 3 variations tested on 5 goals), you’d have a 0.54 chance –you probably will get at least one spurious result. My guess is that’s what’s happening here.

Economic Impact?

In any case, if this is a real effect, our best guess is we'll get only about 40 more conversions per 50,000 visitors. It may literally not be worth the cost of moving the winning variation to production. Whether it is or not depends the number of visitors you get per month, the profit from each conversion, and how much work it is to put the variation into production. You should be able to calculate how many months it’ll take until it pays off. If it takes years, I wouldn’t bother.

Potential Lesson Learned

The lesson may be you can’t blindly trust what on-line A-B testing services tell you. Many of them give you only an approximately correct (i.e., wrong) p-value. In addition to giving only one-tailed values, they often force you to test only one variation against a control at a time, increasing the number of tests and therefore the chance of a spurious result. There are pretty simple and commonly known procedures to test all variations against the control (and each other) all at once on a given goal that yields a single p-value (Chi-square or G-test for independence with more than 2 columns), but on-line services don’t give you that option. There is also a simple adjustment, called the Bonferroni Correction, you can apply to tests for multiple goals that controls for these spurious results (I can tell you that if you apply the correction to your data it would no longer be close to significant any more).

I discuss some of the errors you see in on-line A-B tests at Stat 203. For a non-mathematical intro to stats for user performance testing, see Stat 101.

Without having the site in front of us it's probably hard for us to come up with some specific theories. However - some things to think about.

How sure are you of the experimental methodology? Could there have been an error?

Are you sure that the different options were presented randomly over the whole lifetime of the experiment? If not then external factors (e.g. a separate promotion for PPV) might uplift one variation more than another. This is really an instance of the previous problem - bad methodology - but one I've come across a few times (e.g. folk presenting option A, then option B, then option C - rather than all three in parallel).

While the variation didn't uplift offer signup - did it have other effects on user behaviour. For example say your offer signup process is

people click on the button you are testing

people arrive on page asking for their details

people get a final page to confirm their purchase

purchase made

If the variation makes it more likely that folk get from (1) to (3), but doesn't make it more likely that people get to (4) then you don't see any increase in offer signup. However if (2) and (3) are also showing PPV options in navigation/sidebars then you have the effect of presenting PPV to people who have already come to the site in an expectation of making a purchase - and maybe the PPV is a more attractive option for them at that point. Hence the PPV uplift.

Of course this is complete guesswork without seeing the site and understanding more about your experimental methodology.

Should we launch this winning variation to live?

Assuming that it's not caused by some kind of experimental error - yes.

How do we explain it to the business?

"Our experiments showed that this made more money in PPV purchases. We're not sure why yet and will carry on investigating. Here's a couple of theories (if you have 'em). In the mean time we proposed putting this live (or maybe running another longer term test) to see whether it performs in the real world".

Hey everyone, Thanks for the answers. The site is here: http://www.directv.com.ar We varied just the color of the button at the top of the page (Ofertas). This leads to our offers listing page. For Pay Per View purchases, you can only do that by logging into the site using your username/password, then clicking on any movie (presented in various places on the site) and then clicking on the "Comprar" button. I can't include a link since this is the secured portion of the site.
–
Charles ShimookaNov 21 '12 at 20:48

Also, we showed 3 variations distributed equally and concurrently to all visitors (33%, 33%, 34%) using Visual Website Optimizer tool. We usually run about 10-15 tests per month on our site.
–
Charles ShimookaNov 21 '12 at 21:00