Home Page Split Test Reveals Major Shortcoming Of Popular Testing Tools

Running split tests is addictive. You develop a theory about something that’s not working as hard as it could be on your website, and that becomes your new test idea. You develop the creative – a copy improvement, a usability fix, or perhaps a product repositioning – and then you launch your experiment. The excitement is palpable!

Here’s the addictive part: Whether you’ll openly admit it to others or not, I’m guessing you’ll check the testing tool multiple times per day – and more than likely, multiple times per hour – to see how your revised page is performing. It’s hard not to check.

And you want so badly for your testing tool to declare a winner. When it’s your test idea, it’s easy to let your emotions take over. If your new recipe pulls ahead of the default page, you feel confident, amazing, invincible. But if your recipe falls behind the default content, it’s like someone came along and shot your dog.

Trust In The Tool

Adobe Test & Target, Google Content Experiments, Optimizely, and Visual Website Optimizer are all great options for running your testing program. There is a tool for every budget. And no matter which one you go with, once you have a test underway, it will at some point – statistics allowing – declare a winner.

As subscribers to these tools, we (start-up founders, marketers, and product developers) trust what they tell us. I’d argue that the more you pay, the more you trust what the tool tells you.

Unless you’re a statistics expert, the tool is the authority. You rely on it for advice on when to stop a test. So when the tool congratulates you on achieving a winning variation, what do you do? I’d bet my next paycheck on the fact that you’ll take the money and run.

After all, why would you let the test continue to run if it’s telling you there is a 99.6% chance of beating the original page? That number seems as close to 100% as one would ever hope to achieve with a test. In my experience, most marketers will bank a win at 95% confidence… even 90%. But be careful – that lift as reported by your tool of choice may not be what it appears.

Wait, Where’d That Lift Disappear To?

If you’re a smart marketer, you probably spend time reading about other people’s tests. And if you’ve been reading recently, you may have come across a post from Neil Patel, where he poses the question (within a detailed post about his testing results in general), “Where did my lift go?”

Neil, it’s quite possible (even likely) that you’re not seeing the lift in sales or revenue from your test because it was never there in the first place. You may have unknowingly received a “false positive” in your test – known as a Type I statistical error, otherwise known as an incorrect rejection of a true null hypothesis. That’s a mouthful, so I simply remember it as a false positive.

False positives are insidious because they generally result in the experimenter taking action based on something that does not exist.

In the pharmaceutical business, you can imagine how much damage would result from companies acting on false positives during drug testing.

While perhaps not as economically far-reaching or emotionally damaging as giving patients false hope, acting on false positives in your web tests could, at the very least, put you in a sticky situation with your boss, senior leaders, or investors. Worse than that, it could turn you off testing altogether.

I personally cringe at the idea of getting false positives because I always expect to learn from tests. A false positive for Copy Hackers means we think we’ve learned something about our visitors – when in fact there was no learning. We end up going down a dirty rabbit hole as we try to apply that learning throughout our site and other marketing materials (e.g., emails).

On the other hand, a false negative is generally benign. It means you’ve missed the opportunity to take action on something real because it was not revealed as part of your test. You don’t take any action, so you’ve really lost nothing (unless you factor in opportunity cost).

Example Of A False Positive

Recently Joanna and I decided to run a simple two-way split test on the Copy Hackers home page.

Here is the default version of the home page hero section:

Our desired measure of conversion was clicks on the primary call-to-action (i.e., clicks on the big green button) by new visitors to the home page. But because there are many ways into the site from the home page, we created an alternate conversion metric: engagement. Engagement simply means that a visitor clicks any link on the home page. Think of it as the opposite of a visitor bounce.

We launched the test on November 11th, 2012, and for the first 2 days, we saw a lot of fluctuation in the performance of the two pages. Then the performance settled into a nice rhythm, until after 6 days, our testing tool declared a winner, with 95% confidence. Knowing what we know about confidence levels, Joanna and I let the test run for another day, just to be sure – after which the tool calculated a 23.8% lift (we have a winner!) at a confidence level of 99.6% (meaning that there is only a 0.4% chance of a false positive):

Were we excited? Stunned is more the word.

Why? Because our split test was an A/A test – not an A/B test. In other words, the tested variation was identical to the default page… to the pixel!

On occasion we’ll run an A/A test to validate that the results will turn out as we expect. And most of the time, there are no surprises. But not this time. With nearly 100 conversions per recipe and a week’s worth of data, an identical copy of the home page was declared a substantial winner over the default page.

There is virtually no way to predict such an outcome, in part because it happens only a fraction of the time, but more importantly, because most tests involve two or more different variations – and there is no way to know for sure if you’ve received a false positive outside of allowing the test to continue to run for a longer period.

But our experiment clearly illustrates that popular testing tools still have plenty of room for improvement.

Here is what happened when we let the A/A test continue to run:

As you can see above, on about day 12, the two conversion rates converged, and the lift disappeared completely.

How Does This Happen?

Evan Miller, the author of the above-mentioned post, explains that accurately measuring significance requires that your sample size be fixed in advance of the experiment. But that’s not what happens when you run your tests. Instead, you let a test run until the tool proclaims that you have a significant difference. And in order to make that calculation on the fly, the tool must make repeated significance tests – which are actually flawed.

In fact, the more frequently the tool tests for significance as the test progresses, the more inaccurate the calculation becomes – and you end up with a far higher probability of seeing the dreaded false positive.

What Do You Do Now?

For starters, keep testing using your tool of choice!

Joanna and I have witnessed massive, real conversion gains on our clients’ websites – validated through multiple iterations and by reconciling conversion data with financial data. This risk does not, in our opinion, outweigh the amazing benefits of continual optimization.

But just knowing that false positives are a possible outcome will benefit you.

For example, knowing this may cause you to let a test run longer (i.e., beyond the point at which the tool tells you it’s okay to stop the test). Or armed with this information, you may decide to run a test multiple times longitudinally.

Our recommendation is to calculate the sample size (i.e., number of visitors) required to accurately assess your test data – before you launch the test. Put another way, you’ll want to pre-determine the duration of your test (based on a number of required visitors).

And to help, here is an excellent post by Noah Lorang at 37signals on how to calculate the desired sample size for your next test. If you’re concerned about getting a false positive like we are, use Noah’s formula to arrive at the exact number of visitors who will need to enter your experiment in order to determine whether or not you have a statistically meaningful difference between your 2 (3, 4, 5, etc.) variations.

For the toolmakers, we’d challenge you to solve this problem around confidence. Not statistical confidence, but in solidifying the confidence people place in your tools to guide them on when to stop a test. Can you implement a new test set-up experience that will save people from making costly mistakes like acting on a false positive – even something as simple as a sample size calculator? Given the similarity in how popular testing tools declare a winner, developing a new user experience could be a key differentiator for you.

Google’s results have always been suspect. Any intelligent marketer looking at the results just knows that they smell a funny colour.

You hit the nail on the head when you said that Google changes the sample size and significance constantly, and this is a fact I think almost nobody understands or gets the significance of.

Anyone whose run a number of tests is sure to come across a results graph where the lines on the results graph cross over at some stage. In other words the results reverse. This is a prime example of the significance and sample size being changed..

If we apply Paretto’s law (80% of your revenue is going to come from 20% of your traffic) then what happens is that your high value buyers all buy at the start of the test, and then Google gets rid of them for the remainder of the test.

So what you end up testing is more and more on the crud, low value end of your list or traffic.

So if you make a decision based on that, you are essentially optimising your site for low value customers and ignoring the high value ones who make up 80% of your revenue or business.

One way around it is to add additional custom tracking variables to you variations so that you can see exactly what the different versions produced.

When you do that, you end up with results which are way different to what Google Experiments reports tell you.

Nick

Is there any reason why you chose the ‘Engagement’ goal to measure against for this test?

The problem I’ve found with Engagement is that it’s just too easy to register a conversion and because the tools place dependence on reaching 25 conversions before declaring a winner it’s very easy to get false positives.

Do you think the results would be the same if you aimed at a harder goal like ‘Guides Sold’?

Hi, Ophir! Thanks so much for stopping by, and I hope your new start-up gig is awesome! Thank you for posting the link, too… it was a very useful read. Also useful were some of the comments to your post — great discussion! I’m not sure the average marketer cares too much about this phenomenon, but great marketers, like the people who read Copy Hackers :-), want to know stuff like this… stuff that helps them better understand optimization its potential pitfalls.

Lance Jones

Hi Sapphire! So that’s what’s happening with Feedburner(!) — now I feel much better about what I’ve been seeing in my numbers!

To your second point… “better” is so subjective, it’d be difficult to assign that word to any winning test. But there is [almost] always a reason for a winning test. The hard part is uncovering it. Creating a very specific hypothesis and also ensuring that you isolate the design or copy change to a single variable (easy said than done!) are great places to start on your quest to learn “why” one variation won, but effective experimental design is the topic for an entirely separate post. Thank you for the idea!

http://vvires.com/ Sapphire

Obviously A/B tests are still a lot more reliable than “intuitive” guesses but it is an important lesson you can’t trust any metric. It’s like how Feedburner’s subscriber count wildly fluctuates because they count the number by who loaded their RSS readers everyday. Instead of going into a panic why your subscriber count dropped by half one day, maybe it’s a national holiday and everyone is off their computers!

The other weakness of A/B testing besides false positives you don’t know the reason why something is better converting. For example, if you’re doing email marketing, one version of the email may be declared a winner. But it may not necessarily be because its copy is better, its headline is more enticing, etc. It could be that more people are offended or annoyed by that specific version so they keep clicking. Negative reactions drives responses too.

http://www.astonishinc.com Annette Walker

Excellent. Thanks! In the rush to test, test, test and get better results, we are too quick to trust the numbers. There’s some software testing adage that goes something like “Just because you’ve counted all the trees doesn’t mean you see the forest.” You can be missing some fundamental stuff that leads you astray.

Running A/A tests is a GREAT idea. The screenshot of the A/A test results after you let it run & results converge sticks in my brain & will help me remember. And the hilarious image at the top (“is your winner truly a winner?”), helps make it stick, too!

Lance Jones

Hi, Annette! Well, someone else gets credit for the recommendation to run A/A tests — and Joanna gets credit for the hilarious image at the top of the post. I suppose I pulled it all together though. 😉 I’m glad you enjoyed the post, and be sure to let Joanna or me know if you see any funkiness in your own testing endeavors.

http://visualwebsiteoptimizer.com/ Paras Chopra

Hi Lance,

Thanks for the insightful post. As Tyler pointed out, we do recommend a test to be run for at least 7 days even after getting statistical significance and we also recommend people to do A/A testing. However, if you have to ask proper way to do testing, here is our answer:

It is true that statistical confidence may arrive early on the test and in most cases it does remain consistent but it is never a guarantee that the trend you have detected is for real. There’s always some uncertainty in the results. And to reduce the uncertainty, as a business user (depending on what sort of precision you are looking for, how much uncertainty you are comfortable with and how much improvement do you expect to detect), you have to calculate the number of visitors to test before starting the test and ideally take decisions only after the said number of visitors have been tested. The more certainty you desire, more you will have to spend time running tests. But please remember that there is never a point where you can say with 100% certainty that your results are what they appear. You can always reduce uncertainty but never eliminate it completely. If you want maximum certainty, for future, what we advice our users to:
– Set their chance to beat original thresholds to 99% or more
– Use our test duration calculator to estimate before starting the test how many visitors you would test and ONLY after testing those many visitors, see if you have got significant results (this will prevent the repeated poking at significance and hence drawing erroneous conclusions)

Also, beyond statistical significance one has to see if the variations have any “newness effects” or “learning effects” which may be temporarily increasing or decreasing the conversion rate for a variation.

-Paras, CEO of Visual Website Optimizer

Lance Jones

Hi, Paras! Thank you for adding on to this discussion. It’s great to see the leader for one of the big testing tool organizations chime in with advice on how to reduce the chances of a false positive. I am still seeing much evidence that users of the popular testing products do not factor in your advice or follow your steps to preventing a misread on the data, so there is plenty of work to be done to educate them!

http://contentverve.com Michael Lykke Aagaard

Hi Lance and Joanna – thanks for a great article on a very important and overlooked subject!

One basic element that I don’t think you mentioned here is the “standard error”.

In the first screenshot of the A/A test, there is a standard error of 6.8% on Variation 1 and a standard error of 7.9% on the Original Page. This means that – with the current sample size – there is a 99.6% chance that the conversion rate for variation 1 is somewhere between 66.8% and 80,4%, and that the conversion rate for the original page is somewhere between 51.7% and 67,4%.

What this tells us is that we need a larger sample size in order to get a lower standard error and thus a higher level of statistical confidence. Ideally, the standard error should be < 1%. As illustrated in the second screenshot, the standard error decreases, as the sample size increases.

Being aware of the standard error is a great and very simple way of avoiding falling for false positives.

Anyways, just a little piece of advice that might come in handy

Thanks!

– Michael

Lance Jones

Hi, Michael — you’re absolutely right — thank you for your tip on observing standard error. My biggest issue is that the tools don’t provide the kind of guidance you mention above. Why would someone question the tool when it declares a winner at 99.6% confidence? The marketer wants a winner, the tool declares a winner… and that’s typically where it ends. It’s pretty clear that we, as conversion consultants — and the toolmakers, too — need to do more work around educating people how to avoid pitfalls like false positives.

http://www.wordaim.com Tyler

Awesome post Lance! As Neil discussed, while CRO is powerful, it’s not a silver bullet. Since posts with headlines about huge lifts get so much attention, I think sharing this type of data that shows the other side of the story is extremely important. I also liked your challenge to the toolmakers. Interestingly, after a little digging through VWO’s Knowledgebase, I found a question about A/A testing, and part of VWO’s answer was “Even if statistical significance arrives, please allow the test to run for at least 7 more days.”

Lance Jones

Hey Tyler… interesting discovery you made in the VWO KB. Of course this issue does not just apply to VWO — false positives are tool agnostic. It’d be ideal if people didn’t have to dig for this stuff, would you agree?

http://www.websitesaleslab.com.au Kris

Good article Joanna. Dr Flint mentions exactly this and refers to it as environmental factors. It’s not talked about much but definitely interesting.

You never really know where the people are coming from or why they are coming. This is why testing with a reliable source like PPC may mitigate some of the environmental factors you are exposed to.

My question is, this test was on your homepage right? Was it open to all visitors? Or was it narrowed down to one segment or stream? If so I took a look at your back link profile and you have a lot of comments on sites. You’ll have some search traffic and some referrals. People would be clicking over to see more about you if you wrote a good post. This may explain why the start was rocky. Obviously a search for ‘copy optimization ebook’ isn’t going to have the same level of motivation as someone clicking on a post about spam keywords and liked your post and wanted to see you’re site out of curiosity.

Joanna

Great points, Kris. Yeah, this was a test of new visitors only to my home page. Surely we get a lot of people coming through with various motivations – nearly every site does (with porn always being the exception). And surely environmental factors can influence conversion. But can we really say that the traffic that moved through my home page early on was rife with unimaginable motivations and burdened by wild environmental factors… and then it all smoothed out? A tool that randomizes is supposed to manage those differences, right?

There were 2 identical treatments. Presented randomly. To two halves of a whole [100% of X type of traffic to Y page]. With 99.6% confidence reached on a “winning” treatment. …Not confidence-inspiring.

This isn’t the first time we’ve seen an A/A test do this. And it won’t be the last. It really is one of the big challenges with testing on a site that doesn’t get loads of traffic. My traffic is still sporadic for my young li’l company. Most startups and small bizzes deal with fluctuations in traffic and often low traffic. But we’re all trying to test! And if we continue relying on the testing tool’s calculations, we’re going to end up with false positives all over the place. What to do, what to do???

Lance Jones

Hey Kris — thank you for reading and leaving a thoughtful comment!

You’re so right about environmental factors, but A/B testing tools are supposed to factor out any such effects by randomizing which version each visitor sees. If you get a flood of traffic from someone else’s blog, that traffic is supposed to be equally distributed across the test variations — eliminating any effects of that particular traffic source’s increased motivation, etc.

http://www.sat-essay.net Rodney

Thanks for posting this. It makes me wonder how many times I’ve been fooled by a false positive.