Portfolio

Connect

Blog: Empirically Sound AB Testing: A Guide for Non-Statisticians

AB testing is the primary method that most technology companies decide between product changes. An AB test compares two versions of the same product (Version A and Version B) that are identical except for one variation that might affect a user’s behaviour. They then observe the difference in performance of the two pages in an effort to estimate the effect of the change. Essentially, if version B outperforms version A, then you adopt the change from version B.

AB tests are a form of statistical hypothesis testing that use random assignment to infer causality from product changes. If any of the terms in the last sentence are unfamiliar to you, don’t panic! Below, I’ll walk through a strong process for conducting not only empirically valid tests, but also tests that are measuring the correct criteria for your product.

Step 1: Determine Your Success Criteria

What does success look like in your experiment? There are two key considerations here.

The first is choosing a metric that aligns with user value from your website. To better explain this idea, let’s suppose we’re running a test for Airbnb Categories, a new features announced last week that aims to help people find the perfect home of their trips by increasing the amount of “filters” users can add to their searches. There are many different metrics that we could track to evaluate the efficacy of this feature, including total searches, the number of filters selected, and the number of homes clicked. So how do we decide which metrics matter most?

Bonnie Barrilleaux, a Staff Data Scientist at LinkedIn, suggests that the solution lies in identifying the Member Value Metric – a measure of the value created from a product. This measure, she suggests, can be found by looking at the user journey associated with the product and identifying where the value is created. Here’s what the User Journey may look like for our Airbnb example.

Since we’re testing the Categories feature, it may be tempting to use something like the number of searches with categories or number of categories selected per search for our success criteria. That, however, is not where the value is created. The value is created for a user when they find the perfect home for their trip. A proxy measure for this point is when they successfully book a home.

Tip #1: Your Success Metric should be at the point of user value creation.

An adjacent point is to avoid vanity metrics. Below is a list of metrics that are rarely useful in tracking, but are often reported:

Step 2: Determine Test Parameters

Now that we’ve determined what a successful outcome looks like, we now need to consider when an outcome is successful.

For example, if 50 people see Version B of our site and they book 25% more often then those who see Version B, is that considered a success?

To answer these questions, we need to determine three things:

Sample Averages

Margin of Errors

Sample Size

I’ll explain these statistical concepts by continuing our Airbnb example. Let’s say our test of the new search feature yields the following results:

Version A: 1000 users, 150 Booked a Home

Version B: 1000 users, 300 Booked a Home

What is the average booking rate for Version B? You may be tempted to say 30%, but 30% is just the sample average for this 1000 person sample.

AB tests find the average success rate of a new feature for a certain sample of users and attempt to estimate a range in which the average success rate for all users (i.e. the population) lies.

To determine this range, we may decide to set a margin of error of +/- 1%. These are similar to the margin of errors you may see in political polls, where the survey body surveys a certain subset of people and attempts to report the population’s opinion about a certain political group.

Intuitively, the margin of error means that given the sample success rate, the test will estimate a success rate that is +/- 1% of the population’s estimated success rate. Every test needs to have a margin of error (i.e. it cannot be zero) and the smaller the margin of error is, the larger the sample size you will need (more on that below).

Tip #2 – Understand that the success rate of your AB test is representative of the sample, not the population. Include a margin of error to understand where your population success rate may lie.

The last thing to understand is how big the sample size needs to be to have a valid test. The sample size is how many people you need to have in your experiment to make the results representative of the population at the margin of error you’ve selected. While there are formulas to calculate the exact number, I recommend using a sample size calculator.

Step 3: Ensure the Integrity of Your Results

Now that the foundation has been laid, there are a few other things to consider to make sure your results are relevant:

Do the results of group B affect the results of group A or vice versa?

In our Airbnb example, people using the new filter features may be so successful that they are booking many more rooms. This affects group A! Hypothetically, if group B booked 100% of the time they searched, then there would be fewer homes available to book for group A, meaning their booking rate may go down. Be cautious of this problem and design the test so that the groups’ results are independent

Decide on a time period in advance and stick to it!

You may be tempted to stop the test early if the results are looking good. Doing this, however, may prevent your results from truly reflecting your user base. A general rule of thumb in statistics is that more data points (in this case users) the more accurate the results will be. Brown University has an awesome visualization on why this is the case.

Further Reading

If you’ve followed the steps above, you can more confidently launch AB tests for your product. If you’d like to dig into the ideas above in more detail, however, here are some great resources to check out: