Inside The Enterprise | The PoliteMail Blog

The right way to conduct A/B testing

Remember science class? If you did any sort of basic scientific testing in the lab, odds are you’ve already conducted an A/B test. You just didn’t call it that.

Any experiment with one or more test groups and a control group is in effect an A/B test; the purpose is to observe and measure the difference in the effect.

Most often you’ll hear about A/B testing of marketing and advertising campaigns or for testing effectiveness of alternate webpage designs. This exhaustive guide from CXL largely focuses on web page testing, but many of the tips can be applied to internal email newsletters and other internal communications.

Here is a summary to help you successfully conduct A/B testing on your email messaging.

Start with your hypothesis

A hypothesis is “an assumption about certain characteristics of a population.”

As a communicator, you most likely know your audience and what is important to them. (If not, consider conducting a simple survey or do some analytics analysis to find out.)

With that in mind, develop a testable hypothesis, such as “We expect that sending from the leader’s name, instead of a shared mailbox, will cause our attention rate to go up”.

The A/B basics

An A/B test (or A/B/n test if you’re testing more than two variations) is a direct comparison of the results between two or more versions of the same email.

Only one variable or element differs between the versions, so you might test a different From address, or a different subject line, or a different call to action, but not a combination of those.

A combination test is called a multivariate test, which is more complicated to analyze because it tests multiple changes (variables) at once.

To conduct the test, first you’ll need to decide what element you are testing, and what metric will serve as your measure. You might use the open rate or attention rate when you are testing From addresses or subject lines, or the click through rate when you are testing different calls to action.

Next, you will need samples of your list (A and B), which should be randomly selected and not overlap. To determine the size of test lists, you will need to do a little number crunching.

Understanding statistical significance

You will want your test to accurately measure cause and effect, and not be a result of randomness or chance. This is called statistical significance.

While most communicators are much more comfortable with writing and design than with statistics, there is no need to get scared away by the terminology or the math, so let’s walk through it the concepts of sample size, confidence level, and margin of error.

In a typical A/B test A is the control group—those who get your original email—and B is the experimental group—which is getting the alternate version.

You will need enough sample data from the experiment to accurately represent the whole population, generally the more data the better, but there is a minimum amount which can be calculated.

Population, confidence and margin of error

First, we need to know our population size or audience. Let’s say we are testing the From address as sent to the All Employees list. The population is not simply the count of people on your All Employees list, it is the reachable, measurable subset of that population, for which we could use the open rate.

When we look at the past four messages sent to the All Employees list, we see the average number of employees sent to is 12,000 and the average open rate is 81%. So, 12,000 * 0.81 = a population size of 9,720. The reason we are using averages here is to account for the variability in both the size of the list and the resulting reach.

You’ve likely heard the term “margin of error” for any popular opinion poll or survey result, as it is the plus-or-minus figure included to compensate for the inexact nature of sampling. Technically the “confidence interval”, the margin of error in part defines the sample size you need to achieve it. Generally, the larger the sample the tighter (smaller) your margin of error – but know that doubling your sample does not cut your margin of error by half. 5% is considered a reasonable margin of error.

Next, we need to select a “confidence level” which is how certain you can be the results will come inside of the margin of error. So if you conducted the same test 100 times, 95% of the time you can expect your answer to be the same range, plus or minus the margin of error. A higher confidence level requires a larger sample. It’s worth noting that sometimes, your results will end up in that 5% outliers group, which is why it’s worth running tests more than once.

So, when we select the 95% confidence level, and plug in our confidence interval/margin of error of 5% and population of 9,720, we see we need a sample size of at least 370, which is surprisingly small at just 4% of your population.

Interestingly, if the example company here was ten times smaller, just 1200 employees (so a population of 972), we would still need a sample size of 276 (28% of the population) to reach a statistically valid result.

If we wanted a 99% confidence level for our test of 12,000 employees, we would need a sample size of 623.

If we wanted a tighter margin of error, say of 2.0%, then we would need a much larger sample of 2,941.

Now, it’s important to note this sample size is not the size of the list you need to email to, it is the number of actual data samples we need. To figure out the list size, we need to take reach (open rate) and timing into consideration.

To get at least 370 responses we know, based on that average open rate of 81%, we will need to send out 19% more messages, a list size of 440 people.

It’s also very important to note your sample must be truly random, otherwise you can’t generalize your results to the whole population or rely on your margin of error. If you have a global population of all employees but only take samples from your U.S. location (or from your time zone), that is a flaw which invalidates your results.

For this From test example, it would be a good idea to perform this test more than once, to make sure the result of one test isn’t an outlier. So you could test three different sample sets for three rounds of email messages, then look at the overall results to decide. Another sampling method for this type of longer term testing would be to simply split your list in half, and test each.

For more time-sensitive tests, where you might test samples and, after you have collected enough data, send the remainder of the audience the winning email variation, you are going to have to adjust your sample size based upon the timing.

We know from our benchmark data that 80% of your email results will come in within 4 hours of your send going out, and that 80% of that will come in within ninety minutes. In order to get 370 responses, starting from the 440 sample list size (having already taken open rate into consideration), if we can wait four hours we will need a sample list size of 440 * 120% = 528 and if we can only wait ninety minutes for results we need a sample list size of 440 * 136% (1.36) = 598.

Note that the sample sizes apply to both A and B. So you need two randomized lists, A and B, of sufficient size to generate the statistically valid sample size.

Measurement

Now use your email analytics reporting tools to compare your metrics, and see if your hypothesis proves correct or not.

It is very important to note that if the difference between your test result is less than the total margin of error, then you cannot make a conclusion. So, in this case, if the difference in attention rate between A and B was < 10%, no conclusion should be drawn. To get there, you will need a much larger sample size, allowing you a smaller margin of error.
A/B testing is a scientific process and it takes some math, time, and diligence to do well. A slapdash approach won’t necessarily help you improve your internal email communications, but if you do it the right way, it will pay off.