Multiple proposals about P values suggest science needs repairs

In the Twitterverse, science can stir up some vigorous debates. And they’re not all about the standard issues of climate change, vaccines and evolution. Some dueling tweets involve the scientific enterprise itself.

For instance, one recent tweet proclaimed “Science isn’t ‘self-correcting.’ Science is broken,” linking to a commentary about the well-documented problem that many scientific study results cannot be reproduced by follow-up experiments. To which an angry biologist/blogger replied: “No it’s not. Journalism is broken, as your clickbait-y title shows.”

Without taking sides (yet), it’s safe to say that part of the problem is that tweets don’t allow room for nuance. Whether science is broken or not depends on what you mean by “broken.” Maybe saying science is broken is not a fair assessment out of context. But nobody who has been paying attention could intelligently disagree that some aspects of scientific procedure are in need of repair. Otherwise it’s hard to explain why so many scientists are proposing so many major fixes.

Most such proposals have to do with one of the most notorious among science’s maladies: the improper use of statistical inference. One new paper, for instance, examines the use of statistics in medical clinical trials. Because of the flaws in standard statistical methods, even a properly conducted trial may reach erroneous conclusions, writes mathematician-biostatistician Leonid Hanin of Idaho State University. “Our main conclusion is that even a totally unbiased, perfectly randomized, reliably blinded, and faithfully executed clinical trial may still generate false and irreproducible results,” he writes in a recent issue of BMC Medical Research Methodology.

Clinical trials are not a special case. Many other realms of science, from psychology to ecology, are as messy as medicine. From the ill effects of pollutants to the curative power of medical drugs, deciding what causes harm or what cures ills requires data. Analyzing such data typically involves formulating a hypothesis, collecting data and using statistical methods to calculate whether the data support the hypothesis.

Such calculations generally produce a P value — the probability of obtaining the observed data (or results even more extreme) if there is no real effect (the null hypothesis). If that probability is low (by the usual convention, less than 5 percent, or P less than .05), most scientists conclude that they have found evidence for a real effect and send a paper off for publication in a journal. Astute critics of this method have long observed, though, that a low P value is not really evidence of an effect — it just tells you that you should be surprised to see such data if there is no effect. In other words, the P value is a statement about the data, not the hypothesis.

Scientists therefore often conclude they have found an effect when none actually exists. Such “false positive” results plague many fields, particularly psychology. Studies have shown that many if not most reported psychology findings are not reproduced when the experiment is repeated. But no scientific discipline is immune from this “irreproducibility” problem. Many scientists think it’s time to do something about it.

One recent paper, with 72 authors, proposes attacking the problem by changing the convention for a “statistically significant” P value. Instead of .05, the current convention, these authors suggest .005, so you could claim statistically significant evidence for an effect only if the chances of getting your result (with no true effect) was half a percent. “This simple step would immediately improve the reproducibility of scientific research in many fields,” the authors write. A P value of less than .05 should be labeled as merely “suggestive,” they say, not significant.

Such a tougher threshold no doubt would reduce the number of false positives. But this approach does not address the underlying problems that P values pose to begin with. They are still evidence about the data, not the hypothesis. And while a tougher standard would reduce false positives, it would surely also increase the number of false negatives — that is, finding no effect when there really was one. In any case, changing one arbitrary standard to another would do nothing about the widespread misinterpretation and misuse of P values, or change the fact that a statistically significant P value can be calculated for an effect that is insignificant in practical terms.

A second fix suggests not changing the P value significance threshold, but better explaining what a given P value means. One common misinterpretation is that a P value of .05 implies a 95 percent probability that the effect is real (or, in other words, that the chance of a false positive is only 5 percent). That’s baloney (and a logical fallacy as well). Gauging the likelihood of a real effect requires some knowledge of how likely such an effect was before conducting the experiment.

David Colquhoun, a retired pharmacologist and feisty tweeter on these issues, proposes that researchers should report not only the P value, but also how likely the hypothesis needed to be to assure only a 5 percent false positive risk. In a recent paper available online, Colquhoun argues that the terms “significant” or “nonsignificant” should never be used. Instead, he advises, “P values should be supplemented by specifying the prior probability” corresponding to a specific false positive risk.

For instance, suppose you’re testing a drug with a 50-50 chance of being effective — in other words, the prior probability of an effect is 0.5. If the data yield a P value of .05, the risk of a false positive is 26 percent, Colquhoun calculates. If you’re testing a long shot, say with a 10 percent chance of being effective, the false positive risk for a P value of .05 is 76 percent.

Of course, as Colquhoun acknowledges, you never really know what the prior probability is. But you can calculate what the prior probability needs to be to give you confidence in your result. If your goal is a 5 percent risk of a false positive, you need a prior probability of 87 percent when the P value is .05. So you’d already have to be pretty sure that the effect was real, rendering the evidence provided by the actual experiment superfluous. So Colquhoun’s idea is more like a truth in labeling effort than a resolution of the problem. Its main advantage is making the problem more obvious, helping to avoid the common misinterpretations of P values.

Many other solutions to the problem of P values have been proposed, including banning them. Recognition of the problem is so widespread that headlines proclaiming science to be broken should not come as a surprise. Nor should such headlines be condemned because they may give aid and comfort to science’s enemies. Science is about seeking the truth, and if the truth is that science’s methods aren’t succeeding at that, it’s every scientist’s duty to say so and take steps to do something about it.

Yet there is another side to the “science is broken” question. The answer does not depend only on what you mean by “broken,” but also on what you mean by “science.” It’s beyond doubt that individual scientific studies, taken in isolation, are not a reliable way of drawing conclusions about reality. But somehow in the long run, science as a whole provides a dependable guide to the natural world — by far a better guide than any alternative way of knowing about nature. Sound science is the science established over decades of investigation by well-informed experts who take much more into account than just the evidence provided by statistical inference. Wisdom and judgment, not rote calculation, produce the depth of insight into reality that makes science the valuable and reliable enterprise it is. It’s the existence of thinkers with wisdom and judgment that makes science, in the biggest truest sense, not really broken. We just need to get more of those thinkers to tweet.