What does scientific reproducibility mean, anyway?

Reproducibility is one of the buzziest terms in science today. After all, science by its nature is not supposed to be a one-and-done affair.

But the reality is somewhat different. Half of preclinical research appears not to be reproducible. In psychology, that figure rises to 60 percent. A recent Nature survey of scientists found that more than half called the problem of reproducibility a “significant” crisis. That sense of urgency helps explain why the number of published studies of reproducibility in research has risen from about 100 per year in 1990 to more than 300 per year now.

But a new paper in Science Translational Medicine argues that the current movement to replicate results is crippled by a lack of agreement about the very nature of the word “replication” and its synonyms.

advertisement

The heart of the confusion, the authors wrote, is that many scientists have come to think of reproducibility — or replicability, reliability, robustness, or generalizability — as ends in themselves. But in fact, these terms, each of which have subtly different meanings, end up being imprecise stand-ins for a broader term: “truth.”

Trouble is, they’re not particularly good proxies. At least that’s the argument of a group of researchers that includes John Ioannidis, a Stanford researcher whose essay claiming most clinical findings are false triggered a decade and counting of soul-searching.

What, for example, does “reproducibility” really mean? Does it mean that the methods of an experiment can be reproduced? The results of a subsequent experiment based on those methods? Or, maybe it means that two groups analyzing the same data would reach the same conclusions — which is far from a given. (The last, underappreciated piece, which the authors call “inferential” reproducibility, “might be the most important” component of replication in science, they note.)

Similarly, is there a difference between “replicability” and “reproducibility”? The National Science Foundation defines the former as “the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected.” Think of it as a cake recipe that must be followed to the letter to succeed — but with new flour, sugar, eggs, milk, and other ingredients. The point is not that we’re eating the same dessert, it’s that it would always (in theory) come out essentially the same way under the same conditions.

And some of what we’re talking about is in fact “generalizability,” which one might loosely define as the likelihood that findings in one population of animals or people will be meaningful in another population.

Even when scientists think they know what’s required to reproduce a finding, they often are mistaken. Take, for example, studies that gather huge amounts of data, such as those on gene activity. Such analyses, which scientists call “batch effects,” are particularly finicky; their results hinge on “exactly which samples were tested on which machine in what order and on what day, together with calibration data,” the authors wrote. “This level of detail is typically not provided in publications and is not always retained by the investigator.”

Indeed, maybe it’s time to add yet another category or term, one for “the researchers included a specific enough description of their materials and methods for other groups to work with.” That’s important, because the percentage of studies considered de facto “not reproducible” in some analyses includes these studies. It’s not that they’re false; we just have no way to tell. For this, the authors suggest the term “methods reproducibility.”

Another problem is that context may matter more in science than it’s been credited. A recent paper in the Proceedings of the National Academy of Sciences found that factors like local culture appear to affect the likelihood of reproducing studies in psychology, although the size of the effect is likely small.

Ioannidis and his coauthors don’t tip their hand on whether they’re in the science-has-a-crisis camp. But they do argue that science won’t free itself from the Babel of incoherence without a much fuller sharing of data, methods, and everything else relevant to the experimental process.

“Such transparency,” they write, “will allow scientists to evaluate the weight of evidence provided by any given study more quickly and reliably and design a higher proportion of future studies to address actual knowledge gaps … rather than explore blind alleys suggested by research inadequately conducted or reported.”

In the tale of Chicken Little, reasonable farm animals could disagree about whether the sky was falling, but no one had any misconceptions about what the petite poulet was chirping about. The discussion of reproducibility needs its own lingua franca.