Marketing Experiment: Learn from our split testing mistakes

We recently held a subject line experiment contest with the fine folks at Copyblogger. On Tuesday, we published the results of this experiment on the MarketingExperiments Blog to hopefully help you improve your own A/B testing.

While Tuesday’s post was exciting, today’s post is a word of caution.

Essentially, I’ll discuss what we (intentionally) did wrong so you can learn from our mistakes. By running a split test publicly for teaching purposes, we necessarily introduced some validity threats.

Split tests should run in a controlled environment

An A/B test should be conducted in a controlled environment to ensure only the variables that affect the key performance indicators (KPI) are the ones impacted by the experiment.

Since the purpose of this test was to be educational and teach you about A/B testing, we did not isolate these variables in a controlled environment. In consequence, this introduced some validity threats.

For example, the power of an A/B test (over a focus group or opinion survey, for example) is that your potential customers do not know they are being tested.

They are not giving you their opinion; they are making real decisions with real money in real-world conditions. However, for this test, by writing about the test so publicly before it ran, we likely primed some of the recipients of the email, and they knew they were being tested.

A hypothesis should be created before the experiment is designed

Here is the hypothesis for the split test, as written by Rebecca Strally, Optimization Manager, MECLABS:

By comparing the clickthrough and open rates generated by five different subject lines based on different elements of incentive and value (Early Bird, $300, urgency, Summit and Vegas), we will determine which incentive and value congruence between subject line and email body copy is most impactful.

A hypothesis, when written before you see the final test results, can help ensure you focus on truly learning from your test, not trying to justify results on the back end.

In addition, a hypothesis should be written before the experiment is designed and conducted. The entire point of designing an experiment is to test the hypothesis.

In this case, although we wrote the hypothesis before the test was run, it was created after the experiment was designed because, essentially, we did not design this experiment.

You did.

The MarketingExperiments Blog audience was deeply engaged in the experiment design by writing the subject lines. We couldn’t control that; we could only pick from what you wrote.

In the case of Copyblogger’s audience, we weren’t even involved in picking from its audience’s subject lines.

We did this because we didn’t want to prime you on the front end to write a subject line around anything specific. We wanted your raw, unvarnished take.

However, since the subject lines weren’t written to test a specific hypothesis, they aren’t perfectly crafted to test each value and incentive category.

“Right there was an opportunity for bias,” said Taylor Lightfoot, Data Scientist, MECLABS.

We coded the 10 chosen subject lines as belonging to one of five subject line categories (e.g., urgency, $300, etc.), which may or may not have reflected the intent of the authors who wrote them. This did not affect the outcome of the test, only our interpretations and conclusions.

In order to interpret the results in the way that we did, we assumed the subject lines truly fell into the categories they were assigned.

“Their interpretation is hinged on the categories, so how you define those categories is really how you’re going to interpret them,” Taylor explained. “This inherently creates a bias, but if everyone agrees that the subject lines belong in those categories, that may reduce some of the bias to a degree.”

We held a seven-person meeting to choose categories for the subject lines from the MarketingExperiments audience because we specifically chose a subject line from each of these categories to be the final five we tested out of 300 submissions.

I must admit, I alone assigned those categories after receiving the subject lines from Sonia Simone at Copyblogger. Hey, we’re all busy and time is tight so I chose not to spend the time to collaborate on that one. Being busy isn’t the only way time can threaten your split testing.

Time isn’t always on your side when A/B testing

History effects are effects on a test variable by an extraneous variable associated with the passage of time.

In plain English, sometimes the clock and calendar can work against your goal to learn what really works.

In this example, we sent the email we were testing in the middle of the week before Christmas. The top two performing subject lines leveraged urgency. Urgency may have performed so well because of the time of the year, and might not perform as well in January or February.

We can also surmise that marketers who received the email were likely preparing to go on vacation and leave the office for the holidays.

Or maybe if you’re in e-commerce or retail, you were working on a last minute holiday push, while your peers in the B2B space were hoping to close Q4 strong.

For these reasons, we would have to be careful assuming that an urgency-related subject line would perform as well during other times of the year.

“This is particularly true considering the winning subject line for both prizes mentioned both Christmas and last-minute sales,” Rebecca said. “I think this would have played a huge role in the open rate win and clearly contributed to the clickthrough rate win.”

You should seek to learn the motivations of individual segments of your list

One last thing you can hopefully gain from our mistakes involves learning more about the types of customers on your list.

We sent this test to a large, unsegmented list because we wanted to test several of your subject lines (10 in total) and still have statistically significant results.

Specific segments of a list may favor different elements of value or incentive, and your goal should be to discover motivations of your audience as granularly as possible through your A/B testing.

If you keep asking the same questions, you’ll keep getting the same answers

I just went through a litany of what we did wrong to help you avoid these mistakes in your A/B tests.

However, we did one major thing right – we collaborated.

At the end of the day, that’s the huge benefit of testing. We all have an intrinsic confirmation bias. A tendency to use it to bolster the beliefs we already have, even when we see evidence to the contrary. If you don’t believe me, just look at the world of politics.

By conducting split tests, you prevent yourself from launching a campaign or writing a headline in a vacuum. You’re inviting potential customers to give you feedback and help you learn what really works.

But even before the test’s launch, you must collaborate to design split tests that will give you the information needed to be successful. After all, if you keep asking the same questions, you’ll keep getting the same answers.

So for this experiment, we collaborated with the MarketingExperiments Blog audience. We collaborated with the staff at Copyblogger, as well as their audience. We pulled dozens and dozens of people internally into the test to gain their feedback, help us identify blinds spots, and sometimes, just be a devil’s advocate.

To show you another example of that, I also asked Rebecca to be a devil’s advocate for this blog post. She was kind enough to let me publish her feedback and give an important counterpoint to the “mistakes we made” angle.

As part of writing a blog post to teach, it is easy to overstate and overemphasize, which I may have done. Of course, there were many things we did right in this test, and validity threats we did account for, or as Rebecca explained:

It seems to imply that our test is not reliable or powerful and despite the issues you list, I would consider these pretty good results. Is it worth mentioning in this post that we did work hard to isolate our variables to remove some of the possible validity threats?

Daniel Burstein, Senior Director of Editorial Content, MECLABS Institute Daniel oversees all editorial content coming from the MarketingExperiments and MarketingSherpa brands while helping to shape the editorial direction for MECLABS – working with our team of reporters to dig for actionable information while serving as an advocate for the audience. Daniel is also a frequent speaker and moderator at live events and on webinars. Previously, he was the main writer powering MarketingExperiments publishing engine – from Web clinics to Research Journals to the blog. Prior to joining the team, Daniel was Vice President of MindPulse Communications – a boutique communications consultancy specializing in IT clients such as IBM, VMware, and BEA Systems. Daniel has more than 15 years of experience in copywriting, editing, internal communications, sales enablement and field marketing communications.

Oftentimes you don’t have the luxury of complete control over the environment enough to normalize all variables that can impact an A/B test. Then you have to determine ways to weigh the results against each other relative to a baseline.

I had the luxury of doing analytical chemistry, spectroscopy and chromatography, for a decade. That let me exercise many of these techniques.

We no longer accept comments on the MarketingExperiments blog, but we'd love to hear what you've learned about customer-first marketing. Send us a Letter to the Editor to share your story.