Rebuilding Trust In Other People’s A/B Tests

Looking at and using other people’s a/b tests can be extremely beneficial to designers, marketers, data scientists and business owners. By observing repeatable patterns of effects in certain conditions with various degrees of probability, we open the door to prediction. When comparing multiple test results, including those of other people, we grow our understanding, confidence and chance for predicting effects correctly. At the same time, much has also been written about maintaining a healthy dose of skepticism while looking at tests as a source of best practices. Unfortunately although skepticism is important, some harsh criticisms of our own confidence or meaningless tests may also lead some readers to misinterpret the message while ending up in a cynical dead end – trusting no one but their own experiments (or avoid experimentation altogether). Dispelling such unscientific side effects, I wish to encourage the search for repeatable patterns while keeping 5 pointers in mind when looking at each others A/B tests.

The Benefit Of Learning From Others

People might be driven to learn from each others successes and failures because of the belief that there is a greater probability (rather than not) that the same success or failure can be repeated. For those effects that are in fact more general and do repeat, it makes total business sense to understand when, how and in which conditions they hold true. Running your own experiments although has the benefit of testing something specific to your audience and context (and should be done), unfortunately takes more time and effort. Hence copying others or using past test data can be an immense shortcut and advantage (if successful). Let’s not forget that imitating other people is also baked into our childhood development through thousands of years of evolution. Perhaps our chances of survival are higher if we stand on the mistakes and learnings of our parents (and neighbours or competitors) rather than having to discover everything on our own from scratch. So with that, here are some of the things we might look for before we build trust in an a/b test that belongs to someone else.

1. Understand What Exactly Is Being Measured And How

The first thing we look for when glancing over a published a/b test is how it defines a conversion. We want to understand that a conversion is defined by something similar as “signups measured by visits to a post-signup page” and not just “a conversion” or “signup” in general. In a test, there are different metrics that can be measured with different degrees of metric depth. We already know that some of the more shallow metrics (such as clicks) can often be manipulated or inflated without following through and correlating deeper with what actually matters for a business (real signups or sales). In order to build additional confidence in a test, it’s good to see how exactly the metric is measured.

2. Understand The Strength Of The Effect

Secondly, we want to have a solid grasp of how strong a test’s effect really is by looking at its sample size, effect magnitude, p-value and number of conversions. Knowing only the reported effect is simply not enough. As an example, when a test claims an effect of a 106% enrollment uplift without actually showing the complete conversion/sample size data, it might be a good reason to step back. Our skepticism increases when we see that there is a small change with big results (+40% or so) which although possible, is rather rare. Instead, before we might put more weight to the reported effect, we might wish to see a combination of a high number of conversions against a high sample size, with a very low p-value, and minimal overlap of effect ranges. (On the GoodUI Evidence project we have recently tagged strong tests as ones having at least 300+ conversion per variation, and p-Values lower than 0.03).

3. Compare The Test To Other Similar Tests

Next in line on the path of building confidence in other people’s tests is a true and honest assessment of how many other similar tests we have observed that resulted in similar effects. If we are acting on results of a single a/b test, we might not put as much trust into it as with a change behaving similarly on 8 out of 10 tests.

4. Understand The Number Of Changes

It’s also important to keep an eye out on the number of changes within a test. As a/b tests introduce changes to a UI, they are also prone to including unintentional ones. As an example, a test might claim that a button’s contrast was increased, but in reality the position of the button was also raised. In such situations when there are multiple changes acting on the effect, it no longer becomes valid to attribute the effect to only one of the changes. The effect could have been caused by both changes, either one of them, or an interaction between them.

5. Check For Traces Of Publication Bias

Finally, one last thing to look out for in published a/b tests, is an assessment of how much can we trust the publisher. This one is probably the most subjective one and the hardest to verify, but some of the questions that might be beneficial to ask might include: does the publisher also share failed tests or insignificant tests? Is the publisher incentivized by only showing successful or certain tests? What is the submission and selection process? It’s probably good to look for sources that also include failed tests and not just the pretty green ones.

Launching The GoodUI Evidence Project

With this in mind, I’d like to announce the start of a new and exciting project where we hope to collect and compare people’s a/b tests while following the above guidelines – meet GoodUI Evidence. The idea is this. If we can compare similar a/b tests together, openly and transparently, we should help people build more confidence, faster, and increase our chances to predict the effects of some of the more general UI patterns. If things unfold accordingly with enough momentum and data, I hope that one day we’ll earn the right to call some of the UI ideas, tactics and practices as truly “best” or at least “better than others”. But before that happens, we keep on testing and sharing our findings with a dose of healthy skepticism.