Statistical (In-)Significance

In 2005 John Ioannidis famously declared that Most Published Research Findings Are False. How can this be? Ioannidis refers to studies employing statistical significance testing which has become the norm in many fields, especially medicine – Ioannidis himself is an epidemiologist. Research involving only abstract reasoning (e.g. mathematics) or reliably repeatable mechanisms (e.g. engineering) is unaffected. But if you have any interest in following medical research you should definitely be aware of the problem Ioannidis summarizes here:

the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05.

Publications in medical research and similar fields frequently contain expressions such as “(p<0.05)”. This is the fabled p-value, and it has two very different interpretations: what it actually means, and what researchers think (or want you to think) it means. The real history and meaning of p-values is thoroughly explained in Regina Nuzzo’s 2014 Nature article Scientific method: Statistical errors.

The idea was to run an experiment, then see if the results were consistent with what random chance might produce. Researchers would first set up a ‘null hypothesis’ that they wanted to disprove, such as there being no correlation or no difference between two groups. Next, they would play the devil’s advocate and, assuming that this null hypothesis was in fact true, calculate the chances of getting results at least as extreme as what was actually observed. This probability was the P value. The smaller it was, suggested [UK statistician Ronald] Fisher, the greater the likelihood that the straw-man null hypothesis was false.

This testing method is fine for what it is, but it has serious limitations that unsound research publications would very much like you to ignore. They want you to believe that tests like p<0.05 establish important research findings. Unfortunately they do nothing of the sort.

Small p-values don’t indicate that any specific non-null hypothesis proposed by the research paper is actually true. They merely hint that some effect exists, whatever it might be. The researchers must still definitely rule out other causes, which is often difficult or impossible.

Given a somewhat randomly distributed effect, p-values can vary dramatically depending on the specific sample set. This is entertainingly visualized by Geoff Cumming’s Dance of the p-values.

The magnitude of the p-value is totally unrelated to the magnitude of the actual effect under consideration. This is a notorious problem with medical science reporting which loves producing headlines such as “X increases risk of cancer” while never saying exactly by how much.

Why is so much published research based on such a faulty measure? Quite simply, because it’s very easy to massage data sets into producing p-values considered low enough for publication – and advancing scientific careers requires lots of publications claiming important findings. Nuzzo cites Uri Simonsohn’s term “P-hacking” for this practice, which inevitably produces the large number of false results and “high rate of nonreplication” noted by Ioannidis. Any single paper basing its conclusions on p-values should be considered meaningless until widely replicated.

This is not to say that p-values are completely useless, only that they are generally too weak to support published results. Colquhoun again: the function of significance tests is to prevent you from making a fool of yourself, and not to make unpublishable results publishable. Researchers should use p-values internally, as a preliminary test for which hypotheses are worth pursuing. Positive results need stronger support, and you should remember that when reading scientific papers.

2015-04-17: Regina Nuzzo reports on Basic and Applied Social Psychology taking the drastic step of banning null hypothesis significance testing for publications, including p-values. They don’t seem to have settled on a definite replacement, although Bayesian analysis is being considered. Quoth psychologist Eric-Jan Wagenmakers at the University of Amsterdam,

What p-values do is address the wrong questions, and this has caused widespread confusion […] I believe that psychologists have woken up and come to the realization that some work published in high-impact journals is plain nonsense.