Conversion Optimization Testing: Validity threats from running multiple tests at the same time

A/B testing is popular among marketers and businesses because it gives you a way to determine what really works between two (or more) options.

However, to truly extract value from your testing program, it requires more than simply throwing some headlines or images into a website testing tool. There are ways you can undermine your testing tool that the tool itself can’t prevent.

It will still spit out results for you. And you’ll think they’re accurate.

These are called validity threats. In other words, they threaten the ability of your test to give you information that accurately reflects what is really happening with your customer. Instead, you’re seeing skewed data from not running the test in a scientifically sound manner.

In the MECLABS Institute Online Testing certification course, we cover validity threats like history effect, selection effect, instrumentation effect and sampling distortion effect. In this article, we’ll zoom in on one example of a selection effect that might cause a validity threat and thus misinterpretation of results — running multiple tests at the same time — which increases the likelihood of a false positive.

Interaction Effect — different variations in the tests can influence each other and thus skew the data

The goal of an experiment is to isolate a scenario that accurately reflects how the customer experiences your sales and marketing path. If you’re running two tests at the same time, the first test could influence how they experience the second test and therefore their likelihood to convert.

This is a psychological phenomenon known as priming. If we talk about the color yellow and then I ask you to mention a fruit, you’re more likely to answer banana. But if we talk about red and I ask you to mention a fruit, you’re more likely to answer apple.

Another way interaction effect can threaten the validity is with a selection effect. In other words, the way you advertise near the beginning of the funnel impacts the type of customer and the motivations of the customer you’re bringing through your funnel.

“We run an SEO test where a treatment that uses the word ‘cheap’ has a higher clickthrough rate than the control, which uses the word ‘trustworthy.’ At the same time, we run a landing page test where the treatment also uses the word ‘cheap’ and the control uses ‘trustworthy.’ The treatments in both tests with the ‘cheap’ language work very well together to create a higher conversion rate, and the controls in each test using the ‘trustworthy’ language work together just as well. Because of this, the landing page test is inconclusive, so we keep the control. Thus, the SEO ad with ‘cheap’ language is implemented and the landing page with ‘trustworthy’ language is kept, resulting in a lower conversion rate due to the lack of continuity in the messaging.”

Running multiple tests and hoping for little to no validity threat

The level of risk depends on the size of the change and the amount of interaction. However, that can be difficult to gauge before, and even after, the tests are run.

“Some people believe (that) unless you suspect extreme interactions and huge overlap between tests, this is going to be OK. But it is difficult to know to what degree you can suspect extreme interactions. We have seen very small changes have very big impacts on sites,” Bartlinski says.

Another example Bartlinski provides is where there this is little interaction between tests. For example, testing PPC landing pages that do not interact with organic landing pages that are part of another test — or testing separate things in mobile and desktop at the same time. “This lowers the risk, but there still may be overlap. It’s still an issue if a percentage gets into both tests; not ideal if we want to isolate findings and be fully confident in customer learnings,” Bartlinski said.

How to overcome the interaction effect when testing at the speed of business

In a perfect scientific experiment, multiple tests would not be run simultaneously. However, science often has the luxury of moving at the speed of academia. In addition, many scientific experiments are seeking to discover knowledge that can have life or death implications.

If you’re reading this article, you likely don’t have the luxury of taking as much time with your tests. You need results — and quick. You also are dealing with business risk, and not the high stakes of, for example, human life or death.

There is a way to run simultaneous tests while limiting validity threats — running multiple tests on (or leading to) the same webpage but splitting traffic so people do not see different variations at the same time.

“Running mutually exclusive tests will eliminate the above validity threats and will allow us to accurately determine which variations truly work best together,” Bartlinski said.

There is a downside though. It will slow down testing since an adequate sample size is needed for each test. If you don’t have a lot of traffic, it may end up taking the same amount of time as running tests one after another.

What’s the big idea?

Another important factor to consider is that the results from grouping the tests should lead to a new understanding of the customer — or what’s the point of running the test?

Bartlinski explains, “Grouping tests makes sense if tests measure the same goal (e.g., reservations), they’re in the same flow (e.g., same page/funnel), and you plan to run them for the same duration.”

If you’re running multipletests on different parts of the funnel and aligning them, you should think of each flow as a test of a certain assumption about the customer as part of your overall hypothesis.

It is similar to a radical redesign. Much like testing multiple steps of the funnel can cause an interaction effect, testing multiple elements on a single landing page or in a single email can cause an attribution issue. Which change caused the result we see?

Bartlinski provides this example, “On the same landing page, we run a test where both the call-to-action (CTA) and the headline have been changed in the treatment. The treatment wins, but is it because of the CTA change or the headline? It is possible that the increase comes exclusively from the headline, while the new CTA is actually harming the clickthrough rate. If we tested the headline in isolation, we would be able to determine whether the combination of the new headline and old CTA actually has the best clickthrough, and we are potentially missing out on an even bigger increase.”

While running single-factorial A/B tests is the best way to isolate variables and determine with certainty which change caused a result, if you’re testing at the speed of business you don’t have that luxury. You need results and you need them now!

However, if you align several changes in a single treatment around a common theme that represents something you’re trying to learn about the customer (aka radical redesign), you can get a lift while still attaining a customer discovery. And then, in follow-up single-factorial A/B tests, narrow down which variables had the biggest impact on the customer.

Another cause of attribution effect is running multipletests on different parts of a landing page because you assume they don’t interact. Perhaps, you run a test on two different ways to display locations on a map in the upper left corner of the page. Then a few days later, while that test is still running, you launch a second test on the same page but in the lower right corner on how star ratings are displayed in the results.

You could assume these two changes won’t have an effect on each other. However, the variables haven’t been isolated from the tests, and they might influence each other. Again, small changes can have big effects. The speed of your testing might necessitate testing like this; just know the risk involved in terms of skewed results.

To avoid that risk, you could run multivariate tests or mutually exclusive tests which would essentially match each combination of multiple variables together into a separate treatment. Again, the “cost” would be that it would take longer for the test to reach a statistically significant sample size since the traffic is split among more treatments.

Test strategically

The big takeaway here is — you can’t simply trust a split testing tool to give you accurate results. And it’s not necessarily the tool’s fault. It’s yours. The tool can’t possibly know ways you are threatening the validity of your results outside that individual split test.

If you take a hypothesis-driven approach to your testing, you can test fast AND smart, getting a result that accurately reflects the real-world situation while discovering more about your customer.

Check Also

This website has expounded before on the many different ways our Conversion Sequence Heuristic can be applied to any aspect of a funnel, but I’d like to take some time to explain how it can be applied to verbal conversations with the customer.