The Conversion Certainty Framework: Building Certainty From Multiple A/B Tests

Whether or not effects from similar UI changes will repeat across web sites is a useful thing to know for anyone doing conversion optimization. That’s why we started using the Conversion Certainty Framework for comparing and grouping multiple a/b tests together. It provides us with a quantifiable degree of certainty about whether or not an individual change will replicate were it to be tested or implemented. It’s a very simple framework based on the idea that the more similar repeating tests we have, the more sure we can be that future tests will also repeat. This essentially allows us to move towards prediction using past test data – a very useful thing indeed if done right. Here are the steps we follow to arrive at an expression of certainty for a given UI change.

Step 1: Grouping Similar Tests

The first thing we need are similar a/b tests to compare together. The best tests for comparison are ones which have isolated a similar change (ex: removal of a particular form field) as close as possible. Tests which have grouped multiple changes together into a single variant (larger redesigns) are of less interest here as the cause of the effect is most likely diluted. The chosen tests can also be ones which you’ve run yourself or they can be ones that have been run and shared by others. In order to combat publication bias (a tendency to share positive results more often than negative ones) do try hard to seek out the losers as well.

Step 2: Quantifying Certainty Based On Test Strength

For each of the tests which we collect and decide to compare we then assign a score based on how the strength of the test. We do this using a combination of significance levels and conversion count thresholds. For positive tests we assign a positive value. For negative tests we assign a negative value.

±1.0 Certainty From A Strong Test

The most certainty is attributed to fully transparent tests with a high significance. We give a +1 or -1 certainty for tests which have at least 300+ conversions (for a given variant) with a p-value of 0.03 or lower.

±0.5 Certainty From A Possible Test

Many tests are weaker, suggestive and only possible and we do want to make use of them as well. Therefore we assign a +0.5 or -0.5 certainty for tests which have at least 100+ conversions (for a given variant) with a p-value of 0.25 or lower.

±0.25 Certainty From An Insignificant Or Non-Transparent Test

And then there are tests which are either insignificant or their authors decided not to publish their absolute conversion counts / sample size data (as in most blog posts). Since without full data transparency we cannot gauge how strong a test really is, we only assign the lowest certainty of a + or – 0.25 in this case.

Step 3: Adding Everything Together

Finally, all certainty counts are added together to arrive at a net sum. We then are left with a positive or negative expression of certainty in favour or against our UI change which we could use to predict that it will or will not repeat. The bigger the number, the more sure we are that it will repeat.

Example: +1.25 Certainty That Removing Coupon Fields Increases Sales

Here is a simple certainty count based on two similar tests which have removed the coupon or gift fields from a checkout page. The first test has a strong +2.6% increase to sales from the removal of a coupon code field, and hence we assign a +1 certainty to it. Whereas the second test showed some +24% increase to revenue (and sales) from the removal of a gift field. Since we did not see full conversion counts / sample size data, we only assign a +0.25 certainty to it. Given these two tests, we now are +1.25 certain that doing a similar change on another checkout page will also result in a similar positive effect. It’s not much, but it’s definitely something as the search for similar tests is actively pursued.

By the way, this observation is an example comparison of what we offer through our Datastories publication. To help you become more certain about top converting ideas, each month we make new and update existing observations with our analysis. Stay ahead and increase your chance of finding winning tests by subscribing.

Extending With Additional Certainty

It’s important to note here that looking at past a/b tests is only one source of certainty. Our framework easily allows for the expression of additional certainty from other sources such as: subjective expressions (we’ve allowed anywhere between 0 and 3 subjective certainty points in the past), qualitative research (+1 certainty point from customer insights), and analytics (+1 point from GA analysis).

Two Use Cases For Having Certainty About UI Changes

Finally, there are at least two powerful use cases for using our framework which allows for prediction of effects from various UI changes.

Use Case I: Deciding Whether To Test Or To Implement

When you are optimizing you don’t need to test everything. Optimization is actually a balancing game of exploration vs. exploitation. When you decide to test, you explore and generate certainty. When you decide to implement a change, you exploit and make use of your existing certainty. Testing is usually slower than implementation (assuming a healthy dev team and relatively low effort changes). Hence, this quantifiable measure of certainty could potentially be used as a deciding input for whether to test or implement a UI change. As an example: a company might define its “implementation standard” requiring a minimum certainty of 6+ or more. If a UI change does not have this adequate certainty, the change moves into further testing. If on the other hand, a similar change has been observed to repeat numerou amounts of times, it may be worthy of exploitation (implementation) by passing testing.

Use Case II: Testing Idea Prioritization

The second use case for our framework is to better prioritize testing ideas. The framework moves beyond a purely subjective estimate to one which actually incorporates past test data. If this turns out to be an effective way to predict future tests, choosing to test higher certainty ideas above lower ones should also lead to more winning tests than not. As an example, a change which has shown to generate a positive effect in 4 out of 5 tests, has a higher chance of repeating in future ones.

With this in mind, we’re beginning to apply this framework ourselves on most of our projects and hope to share what we find in a number of months in a follow up post.