You May Be P-Hacking and Don’t Even Know It

P-Hacking is a big problem. It can lead to bad decisions, wasted effort, and misplaced confidence in how your business works.

P-Hacking sounds like something you do to pass a drug test. Actually, it’s something you do to pass a statistical test. “P” refers to the “P” value, the probability that an observed result is the result of random chance and not something real. “Hacking” in this case means manipulating, so P-Hacking is manipulating an experiment in order to make the P value look more significant than it really is, so that it looks like you discovered a real effect when in fact there may be nothing there.

It’s the equivalent of smoke and mirrors for statistics nerds. And it’s really really common. So common that some of the foundational research in the social sciences has turned out to not be true. It’s led to a “Replication Crisis” in some fields. forcing a fresh look at many important experiments.

And as scientific techniques like A/B testing have become more common in the business world, P-Hacking has followed. A recent analysis of thousands of A/B tests through a commercial platform found convincing evidence of P-Hacking in over half the tests where a little P-Hacking might make the difference between a result that’s declared “significant” and one that’s just noise.

The problem is that P-Hacking is subtle: it’s easy to do without realizing it, hard to detect, and extremely tempting when there’s an incentive to produce results.

One common form of P-Hacking, and the one observed the recent analysis, is stopping an A/B test early when it shows a positive result. This may seem innocuous, but in reality it distorts the P value and gives you a better chance of hitting your threshold for statistical significance.

Think of it this way: If you consider a P value of less than 0.05 to be “significant” (a common threshold), that means that there’s supposed to be a 5% chance that you would have gotten the same result by random chance if there was actually no difference between your A and B test cases. It’s the equivalent of rolling one of those 20-sided Dungeons and Dragons dice and declaring that “20” means you found something real.

But if you peek at the results of your A/B test early, that’s a little like giving yourself extra rolls of the dice. So Monday you roll 8 and keep the experiment running. Tuesday you roll 12 and keep running. Wednesday you roll 20 and declare that you found something significant and stop. Maybe if you had continued the experiment you would have kept rolling 20 on Thursday and Friday, but maybe not. You don’t know because you stopped the experiment early.

The point is that by taking an early look at the results and deciding to end the test as soon as the results crossed your significance threshold, you’re getting to roll the dice a few more times and increase the odds of showing a “significant” result when in fact there was no effect.

If there is a real effect, we expect the P value to keep dropping (showing more and more significance) as we collect more data. But the P value can bounce around, and even when the experiment is run perfectly with no P-Hacking there’s still a one-in-20 chance that you’ll see a “significant” result that’s completely bogus. If you’re P-Hacking, the odds of a bogus result can increase a lot.

What makes this so insidious is that we are all wired to want to find something. Null results–finding the things that don’t have any effect–are boring. Positive results are much more interesting. We all want to go to our boss or client and talk about what we discovered, not what we didn’t discover.

How can you avoid P-Hacking? It’s hard. You need to be very aware of what your statistical tests mean and how they relate to the way you designed your study. Here’s some tips:

Be aware that every decision you make while an A/B test is underway could be another roll of the dice. Don’t change anything about your study design once data collection has started.

Every relationship you analyze is also another roll of the dice. If you look at 20 different metrics that are just random noise, you actually expect that one of them will show a statistically significant trend with p < 0.05.

When in doubt, collect more data. When there’s a real effect or trend, the statistical significance should improve as you collect more data. Bogus effects tend to go away.

Don’t think of statistical significance as some hard threshold. In reality, this is just a tool for estimating whether or not the results of an analysis are real or bogus, and there’s nothing magical about crossing p < 0.05, p <0.01, or any other threshold.

Another useful tip is to change the way you think and speak about statistical significance. When I discuss data with clients, I prefer to avoid the phrase “statistically significant” entirely: I’ll use descriptive phrases like, “there’s probably something real” when the P value is close to the significance threshold, and “there’s almost certainly a real effect” when the P value is well below the significance threshold.

I find this gives my clients a much better understanding of what the data really means. All statistics are inherently fuzzy, and anointing some results as “statistically significant” tends to give a false impression of Scientific Truth.

Peter U. Leppik is president and CEO of Vocalabs. He founded Vocal Laboratories Inc. in 2001 to apply scientific principles of data collection and analysis to the problem of improving customer service. Leppik has led efforts to measure, compare and publish customer service quality through third party, independent research. At Vocalabs, Leppik has assembled a team of professionals with deep expertise in survey methodology, data communications and data visualization to provide clients with best-in-class tools for improving customer service through real-time customer feedback.

Expectations for service and support continue to rise. Creating an experience to meet customer expectations includes digital service channels, but internal operations and processes must also support the delivery of a pleasing end-to-end service experience. Customer service and support executives discuss real-world results and reveal best practices for success.

Combining his own professional experiences working as a CEO with his extensive research and expertise as an international authority on customer relationships, author Bob Thompson reveals the five routine organizational habits of successful customer-centric businesses: Listen, Think, Empower, Create, and Delight.

Most bots have failed and consumers are avoiding them. Fortunately, these ‘dumb bots’ are on their way out. New advances in Conversational AI technology has made it possible to create smart virtual assistants that understand real human dialog. Learn how to identify opportunities to leverage this new technology.

Only 25% of Customer Experience (CX) initiatives are "winning" -- able to show business value or gain a competitive edge. Technology can play a key role in helping CX leaders deliver an experience that sets the brand apart. Learn how CXTech innovations can drive Customer Experience success.

Join now to get "10 Big Ideas for Customer Experience Success," an e-book of thought leadership articles. Members receive weekly Advisor newsletter with Editor’s Picks and Alerts of insightful content and events.

Email*

Terms*

I agree to CustomerThink's privacy policy and understand I can easily unsubscribe at any time.