Share

How and when to calculate statistical significance

Luckily for marketers, there is no international statute against misrepresenting data in a business setting. If there were, many could be found guilty of such intolerable crimes as fuzzy math, data-dredging, and the particularly pernicious sin of p-hacking, the last of which a Wharton study found57 percent of marketers frequently and unknowingly commit. It’s a similar story for product, support, and analytics teams the world over. Few professionals assess the statistical accuracy of their studies.

Not knowing whether data is valid renders data useless. And worse: It inspires teams to think they are driven by data when they aren’t. Misled teams are less likely to double-check themselves and more likely to only discover errors after they’ve committed them.

What keeps teams from checking the statistical significance of their results? For one, the method itself could use marketing help. The official definition is “a result that is unlikely to have occurred given a null hypothesis,” and it’s typically found alongside riveting descriptions of “parametric tests” in such page-turning classics as the 1925 Statistical Methods for Research Workers.

For teams that don’t have time to return to school for another degree, here’s a plain language version and guide.

What is a statistical significance test?

A statistical significance test measures whether test results from a sample population are likely to apply to the entire population. Teams can use it to determine whether they should trust the results of an A/B test. For example, if they learn that 20 percent of their subscriber base loved an email, they can verify that the result was significant before sending the email to the entire list.

Businesses today run lots of tests and generate lots of data, but they also must demonstrate the validity of their results. Without the presumption of validity, numbers are dangerously fungible. Or as Mark Twain put it, “There are lies, there are damn lies, and then there are statistics.”

Any team that wants to see an example of questionable statistics need only run a Google search for “the best email subject lines.” They’ll find pages of definitive-sounding studies, none of which explain their methodology, cite their demographics (typically customers of just one company), or calculate their own statistical significance. These are not to be trusted.

Any team that conducts A/B testing should do so for their own audience, and must confirm that any relationships they discover are valid. For example, whether a particular headline really influences email open-rates, or whether the color of a call to action (CTA) button actually increases clicks.

Basic errors and how to avoid them

Testing is important because any time teams test the sample of a larger population, there’s always a small chance that the test only includes, say, diehard fans of the color orange, whereas the broader population’s tastes resemble a rainbow. If a test result is statistically significant, it means the likelihood that the sample only included orange lovers is lower than a predetermined threshold—almost always five percent—which is to say, the odds are exceedingly low and the result is probably valid. (If the teams run a test on their entire population, there’s no need to test statistical significance.)

There are a wide variety of biases to consider when assessing a statistical test. During World War II, the statistician Abraham Wald had something resembling the graphic below, which showed where planes returning from battle had been hit, and had to determine where the planes should receive more armor.

What to do? The seemingly logical answer–place more armor where the planes have been hit–is in fact the wrong one. Why? These are the planes that were returning. The unseen part of the population–planes that did not make it back, are the ones that were the ones that were hit in the spots that appear unmarked on the above graphic. This effect is called survivorship bias, and is one of many statistical biases to consider in assessing both the design and results of an experiment. Consider what things are pushing and pulling a sample in ways that make it less representative of the whole population: is it overwhelmingly biased to one geographic region? Is it reliant on people responding to surveys? Is the experiment affected by the mere fact of the participants in the experiment being aware the experiment is occurring? Answers to these sorts of questions should inform your reaction to results.

How to calculate statistical significance

The most common way to test statistical significance is Pearson’s chi-squared distribution, so called because it was invented by a man named Pearson, “Chi” is “x”in Greek, and the test asks users to square their data to accentuate the differences.

Chi-squared tests are used for discrete data sets, or data that only fits into whole numbers, and isn’t measured on a spectrum. For example, marketing conversions, where visitors either convert or don’t convert—they’re either a one or a two—and can’t fall somewhere in-between.

A test is deemed statistically significant if there’s a very low probability the result could have occurred by chance. That is, if the probability (p) is lower than a threshold the team selects ahead of time (ɑ), also called the alpha.

Statistically significant = Probability (p) < Threshold (ɑ)

There are six steps to run an A/B test and then apply the chi-squared test:

Step 1: State a null hypothesis

Teams first state the null hypothesis for their A/B test. The idea of the null hypothesis is that it will not show significant results. So the null hypothesis could be something like “The evidence does not suggest that prospects prefer our old landing page to our new one.” The null hypothesis will either be disproven or not disproven. It is not an affirmative statement.

Step 2: State an alternative hypothesis

Teams state a hypothesis they hope to prove. For example, “Customers prefer our new landing page.”

Step 3: Set a threshold

Teams determine a percentage threshold under which the hypothesis will be considered valid, known as their ɑ (the Greek letter alpha). The lower ɑ is, the more stringent the test. A threshold of five percent is very strict—another way to think about it is that if the results of the test appear statistically valid, there’s only a one in 20 chance that the result is an error. A higher threshold for error might more suitable for tests in businesses, however, it is imperative to choose the threshold before the experiment to avoid letting the desired outcome become determinative of what counts as significant or not.

Step 4: Run the test

Teams run their A/B test. For example, they test a new variation of a landing page against the old version, and record the results. Below are sample results where new landing page A has outperformed the old landing page B, proving the alternate hypothesis correct.

To recreate the chart below, teams record the results of their landing page test, then add their results across rows and columns.

OBSERVED RESULTS

Converted?

Landing Page A

Landing Page B

TOTAL

No

7,611

7,850

15,461

Yes

2,345

1,999

4,344

TOTAL

9,956

9,849

19,805

Step 5: Run the chi-squared test

The chi-squared test compares the observed results from the A/B test to the expected results, or the numbers the team could have expected to see if there were no differences between the two landing pages. In this example, the expected overall conversion rate is 22 percent—the total conversions for both landing pages (4,344) divided by the total visits for both pages (19,805).

Teams then replace the observed numbers (highlighted in green) with the expected numbers. To calculate each expected value, teams multiply the column total by the row total and divide by the total visitors.

Expected = (column total * row total) / total visitors

Expected = (9,956 * 15,461) / 19,805

= 7,772

Repeat the calculation for each of the four boxes. The resulting chart is the a view of the numbers that the team would have recorded had both landing pages been identical. All totals remain the same.

EXPECTED RESULTS

Converted?

Landing Page A

Landing Page B

TOTAL

No

7,772

7,689

15,461

Yes

2,184

2,160

4,344

TOTAL

9,956

9,849

19,805

The team then uses the chi-squared method to compare whether the observed results are significantly different from the expected results. For each of the green boxes, teams subtract the observed from the expected, square the result, and divide that result by the expected.

Chi-squared = (expected – observed)² / expected

Chi-squared = (7,772 – 7,611)² / 7,772

Chi-squared = 3.34

CHI-SQUARED

Converted?

Landing Page A

Landing Page B

TOTAL

No

3.34

3.37

6.71

Yes

11.87

12.00

23.87

TOTAL

15.21

15.37

30.58

Now the team can complete the test. If the probability, or p-value, in the blue box above, is greater than its corresponding value on a chi-squared distribution table for the five percent threshold, the team has discovered a statistically significant relationship.

In this example, the p-value of 30.58 is greater than the five percent threshold value of 3.84. Thus, the results are statistically significant.

Step 6: Apply the results

If a team determines that the results of its A/B test are statistically significant, they should feel confident applying what they’ve learned to their entire user population. In this example, marketers could use landing page A with their entire audience. Had the results not been statistically significant, the team could have instead tested again with a larger sample.

When not to use significance testing

Significance testing needn’t be applied to every test. Unless the team can calculate it quickly, they should save it for instances where knowing whether test results are valid saves them significant time, effort, money, or credibility. For example, when an improperly designed feature would be difficult to remove later, or if a marketing campaign to the company’s entire subscriber list could devastate users’ trust. But if the downside is inconsequential, significance tests may only slow progress.

“When decisions are low-cost or reversible, just try it. Most things are reversible anyway,” says serial entrepreneur and CTO of Helpful.com Farhan Thawar. “Trying and failing is learning. But if there are consequences you can’t reverse—or as Jeff Bezos puts it, doors you can’t walk back through—then test.”

Best paired with sound judgement

As the old mathematician’s aphorism goes, all models are wrong, but some are useful. Statistical significance isn’t a bed of coals to rake colleagues across when they show up to a meeting chattering excitedly about numbers they’ve just run. Nor is it a commandment. It’s simply a tool for reducing errors and making decisions with greater confidence.

There are also often more important criteria for testing a study’s validity than statistical significance, such as ensuring the data wasn’t contaminated by the tester’s biases. As Tom Redman, author of Data Driven, told the Harvard Business Review, the important question is, “Does the result stand up in the market, if only for a brief period of time? I’m all for using statistics, but always wed it with good judgment.”