Find important customer segments in A/B tests

When A/B testing at uSwitch, often our goal is to find which of two designs (or two ‘variants’) has a higher performance metric for users who see that design.

For example, we might look at the success rate (proportion of users who perform a specific action) of users who saw design A and compare it to the success rate for users who saw design B, then pick the design with the higher success rate to roll out on the site.

These overall metrics can hide big differences for isolated customer segments and, for this reason, we often break out these metrics by categorical variables like browser, device type, traffic source or date.

A variable value (e.g., device_type = 'mobile') might mark users with a large difference in the performance metric between the two designs, and this difference often needs to be investigated: this can uncover unexpected interaction differences and technical bugs present in one design but not the other.

At uSwitch, we have a method and an R function which helps us find these important customer segments quickly and reliably. Before we get to that, we need to explore the question:

What are the most important customer segments to investigate?

We assume here that we are running an A/B test on a website with an equal split of traffic going to each variant. Each observation in the test has a success variable which marks whether observation was associated with a successful action.

We’ll take ‘segment’ to mean a single value of a category we see in the dataset. An example of a segment would be the 'Chrome' value of the category 'browser'.

We’ll take ‘most important to investigate’ to mean ‘most evidence for a difference in success rate’. We’ll use the p-value from a chi-squared test for this, with lower p-values indicating higher evidence for a difference.

The meanings here for importance and segment aren’t the only useful ones; I’ve suggested some more later.

Now that we've made the question less vague, we can move on to:

The Method

We start with data from an A/B test that looks like

uuid

variant_id

success

browser

device_type

traffic_medium

1

A

fail

Chrome

desktop

direct

2

B

success

Firefox

desktop

organic

3

A

success

Safari

mobile

direct

4

A

success

Chrome

desktop

email

5

B

fail

Chrome

desktop

email

6

B

success

Safari

tablet

paid-search

Where 'success' gives whether or not the person was successful in doing an action we wanted to make a difference in. 'uuid' is a user ID, 'variant_id' is what bucket in the test they are in, and the rest of the fields are categorical variables about the user.

The defining characteristics of this dataset are:

Each row is a unique observation in the test

There is a variable for the test bucket the observation is in

There are several categorical variables

Next:

For every category value in the test data, filter the dataset down to include only observations that fall into that category value.

with each row corresponding to exactly one unique category and value pair, along with the associated total observations metrics and p-value. This table should be ordered by ascending p-value.

From this, we first see those segments of users who have the largest evidence for a difference in conversion rates. We can also balance this by looking at the total observations column to check relative sizes of buckets, which can affect what order we investigate the top few segments.

Other interpretations of the question

There are other ways to interpret the question 'What are the most important customer segments to investigate?' given the dataset format.

Customer segment could be expanded to include combinations of category values (e.g., pairs of category variable values).

‘Most important’ could mean:

Largest expected difference in success rates between the two buckets;

Largest 5% lower bound on the difference in success rates, assuming success rates for each bucket for that segment were generated from a uniform distribution (i.e., an uninformative prior).

Final thoughts

There is a danger of data dredging and p-value hacking if this method is used incorrectly. At no point should the low p-values here be presented as indication of significant change in the success rate. Instead, this method should be used for guiding investigations.

That being said, the method described in this post has helped us at uSwitch several times by highlighting customers who were experiencing issues with recently launched A/B tests. It has saved us time and effort, and improved the consistency at which we spot issues and unexpected design consequences in tests. Hopefully it can help you in a similar way.