First: Know What You Want To Know

We'll turn to Popeye as an example.

Remember your elementary school science class? We return there for the bedrock of experimentation: the hypothesis. A hypothesis puts onto paper that gut feeling of how you think the results will pan out. When we run tests, we need not one, but two hypotheses:

The null hypothesis, which is what currently exists today. For instance, if you wanted to test whether or not eating more spinach made you stronger like Popeye, your null hypothesis would read: "Consuming spinach had no effect on strength."

The alternative hypothesis, which is why you're testing in the first place. In the same example, your alternative hypothesis would read: "Consuming spinach had an effect on strength." The alternative must be the exact opposite of the null.

Notice that it doesn't mention stronger or weaker in either hypothesis. That's because even though you might think that it makes you stronger, you would still want to test in both directions, known as a two-sided test. It refers to the two sides of the normal distribution curve. This approach prevents bias and entertains all possibilities you might encounter, up or down, regardless of your hunch.

If you're really sure you're sure one way or another, or the other half isn't relevant to what you want to know, then you would perform a one-sided test. It's done more or less the same way, but you're only looking at one possible outcome. In my Popeye scenario, I'd only be looking at whether I became stronger, not bothering to see whether I was made weaker by the consumption of leafy greens.

Creating The Best Customer Experience

In digital marketing, we test all the time: the A/B test. In web development, things move fast, and small, incremental changes can make a big difference on user experience. Whether you're looking for the success of an ad campaign or the smallest design changes to an arrow, running an A/B test tells you which would "win."

An A/B test changes one specific feature, like a new send button, and ONLY one. It shows a random sample of web visitors the treatment (the newly proposed, more awesome button) and a second random sample of web visitors the control (the current button). From there, you look at the metrics for each of the samples, such as views, bounce rates, units sold, or total revenue in ecommerce, and determine whether or not the change made a difference.

Making Sure Your Results Matter

To know which one "won," you need to look not just at the metrics, but at the significance of those metrics. We need to know how likely our results would be, assuming the null hypothesis is true. If it's highly unlikely, then we can reject the null hypothesis because our result is too far from a likely outcome.

We use something called a p-value to mathematically measure it. When we want to be 95% confident in our sample (otherwise known as the confidence level), then we would use a p-value of .05 to determine whether or not there's strong enough evidence to reject the null hypothesis. If the value is less than .05, we can reject it. If it's greater than .05, we would fail to reject the null hypothesis. We can never accept the null, only fail to reject it and live to test another day.

Note this image shows a one-sided, not two-sided test.

Enough evidence makes sure that you're not taking the numbers for granted. Performing test after test without really knowing if your results are correct and significant will only lead you down the wrong direction. We generally see two types of errors in these tests:

Type I errors result from rejecting the null hypothesis, even if the null hypothesis is true. For example, if a construction company wants to know if a bridge is safe for a 2-ton truck to drive on, a type I error would mean the company would tell the driver it was unsafe, even if it's perfectly fine.

Type II errors result from failing to reject the null hypothesis, even if the null hypothesis is false. Type II errors are much scarier in real life. In the same example above, this would mean the company would tell the driver it was perfectly safe, even though it wasn't. Cut to the bridge collapsing.

Test, Test, Test Again

So, we keep testing and testing and testing. Too few tests mean we're relying too much on intuition, and tests too far between assumes that our customer's expectations haven't changed, when that might not be the case.

Just as it's not recommended to only follow our gut, we also need to cast a critical eye to the A/B test, particularly if we run multiple tests at once. When the tests overlap, it's possible to see additive results when they don't exist, or skew the results of one test vs. another.

The most important thing to keep in mind? Data-driven analysis keeps us informed and grounded in reality. If you're not sure, test it. Being willing to fail--and encouraging it--makes sure that our business decisions make sense.