Related Stories

Tighter limits Many scientific studies are coming to the wrong conclusion simply because the standards for statistical significance are too low, argues a US statistician.

The finding may explain the growing number of studies that cannot be replicated, because, says Professor Valen Johnson of Texas A&M University, such studies may not have found a real result in the first place.

He adds that the problem is worst in the social and biological sciences.

"There's been quite an effort in psychology circles to explain the high number of studies that are not able to be replicated," says Johnson. "They've even been accusing scientists of making up data. But the problem isn't that, it's the threshold for determining significance is too weak."

Traditionally, scientists test an alternative hypothesis against the 'null hypothesis' - which represents the status quo, or lack of an effect.

When a result is obtained in an experiment, a decision is made whether to accept or reject the null hypothesis. The null hypothesis is rejected, and alternative adopted, only if the probability of getting such an extreme result is lower than a certain value under the assumption that the null hypothesis is true.

Often a 'P value' of 0.05 or 'significance at the 5 per cent level' is used.

"The threshold of 0.05 was decided on arbitrarily by Fischer, Neyman and Pearson on the 1930s," says Johnson. "They were the leading statisticians of their day, so everyone just accepted it."

But people often misinterpret a P value of 0.05 for a test statistic as meaning that there is only a 5 per cent chance that the null hypothesis is true, says Johnson.

What the P value actually means is that there is a 5 per cent probability of results as extreme as these occurring when there is really no difference occurring in the experiment - a drug has no effect for example.

A marriage of methods

Classical statistical tests leading to P values are popular among scientists, but there's also another approach known as Bayesian statistics, he explains.

The Bayesian approach genuinely compares the null hypothesis with an alternative hypothesis and produces a Bayes factor - which must be high to favour the alternative hypothesis.

Johnson has married up some special Bayesian statistical tests with the classical ones, enabling a direct connection to be made between P values and the chance that the null hypothesis is wrong.

"The key idea of this research is that I defined Bayesian tests that reject the null hypothesis exactly when the classical tests do."

This enabled Johnson to work out just what a value of P=0.05 actually means in practice.

Analysing data from more than 800 experiments in two psychological journals he calculated both P values and Bayes factors for the results.

"It turns out that when a classical test has a P value of 0.05, there is a 17 to 25 per cent chance that the null hypothesis is still likely to be true," he says.

"Based on this we are almost guaranteed to have 20 to 25 per cent of studies that are not reproducible because there was no real effect."

Big cost to society

Johnson thinks results should be regarded as significant only at P values of 0.005 for "strong evidence" or 0.001 for "very strong evidence".

This is already happening in the physics community, he says.

"If a physicist makes a discovery, it's normal for the results to be replicated in another lab. They found that using a P value of 0.05 was much too low to accept a new result as being true."

The criterion used for the recent Higgs Boson discovery was P = 0.0000003, he adds.

"Biology and medical researchers [usually use] P=0.05 as the threshold for a new discovery," says Johnson.

"This can have a big effect in drug trials, for instance. They find an effect in a Phase 2 trial at P= 0.05, but then it isn't there when they do larger clinical trials."

"The use of the 0.05 threshold is costing drug companies, and society in general, a lot of money because of false discoveries."

Raising the bar

"This study is suggesting a dramatic raising of the bar," says Professor Ian Marschner of Sydney's Macquarie University, who was not involved in the work.

"That's quite a reasonable recommendation and it's something that statisticians have thought of before. It's really a trade-off. If you raise the bar, studies are more expensive and tend to be bigger and take longer."

"I don't think this study proves that this is the way to go, but I certainly think it's a helpful contribution and it's a reasonable discussion for the scientific community to be having."

He says what this study adds to the debate is the theoretical marrying up of the Bayesian and classical approaches.