Why is your split test sample size important?

Do a split test – 50% to one group, 50% to the other. Use the same subject line, creative and everything for each split. Click launch.

If the world were perfectly predictable, the results would be the exact same – you’re sending out the exact same thing to the entire group. But – guess what – the results won’t be the same! Groups A and B will have different results.

This is because of random variance. In any given sample group, random things can happen. And if your sample size is too small, you will face false positives.

For example – we’ve heard of people running 15 splits to a sample size of 500 each… and thinking the “winner” is the true winner. If you’re doing this, you’re using bad statistics… and making questionable decisions.

How Phrasee calculates your sample sizes

There are several factors that impact the power of an analysis, such as:

Defining good hypotheses

Determining test variables

Controlling other sources of variance (where possible)

And of course, ensuring you’re using the correct sample size to learn as much as possible as quickly as possible.

You want to maximise the statistical power of your split tests… and Phrasee does this for you.

Determining your effect size

First, we need to estimate the effect size – or, how big a difference we would hypothetically consider a “success” versus a “failure” of a given subject line.

Having run thousands of split tests for customers, Phrasee knows how to do this.

First, we use the global Phrasee data set to predict the likely effect size. Then, we calculate the smallest effect size we consider to be relevant. Lastly, we insert a small level of randomness to control for experimental bias.

Calculating your sample size

Phrasee then creates a test family using the t-test and sample size estimation for correlation coefficients.

Then, we calculate the appropriate alpha level – that is, the probability of falsely rejecting a null hypothesis. This ranges from 0.009 for intricate tests, to 0.1 for fundamental tests. Lastly, we set a statistical power level to predict whether or not the result actually exists in nature.

This gives you your # of splits and sample size

We use a maximum of 30% of your overall email list as a test group. Sometimes we’ll use a lot, sometimes not so much: it all depends on what our statistical engine requires.

You then send out your generated subject lines to samples of this amount… and whichever subject line wins, send to the remaining audience.