How to make a scientific result disappear

by Justin Esarey

Nathan Danneman (a co-author and one of my graduate students from Emory) recently sent me a New Yorker article from 2010 about the “decline effect,” the tendency for initially promising scientific results to get smaller upon replication. Wikipedia can summarize the phenomenon as well as I can:

In his article, Lehrer gives several examples where the decline effect is allegedly showing. In the first example, the development of second generation anti-psychotic drugs, reveals that the first tests had demonstrated a dramatic decrease in the subjects’ psychiatric symptoms. However, after repeating tests this effect declined and in the end it was not possible to document that these drugs had any better effect than the first generation anti-psychotics.

Experiments done by Jonathan Schooler were trying to prove that people describing their memories were less able to remember them than people not describing their memories. His first experiments were positive, proving his theory about verbal overshadowing but repeated studies showed a significant declining effect.

In 1991, Danish zoologist Anders Møller discovered a connection between symmetry and sexual preference of females in nature. This sparked a huge interest in the topic and a lot of follow-up research was published. In three years following the original discovery, 90% of studies confirmed Møller’s hypothesis. However, the same outcome was published in just four out of eight research papers in 1995, and only a third in next three years.

Why would a treatment that shows a huge causal effect in an experiment seem to get weaker when that experiment is repeated later on? “‘This was profoundly frustrating,’ he [Schooler] says. ‘It was as if nature gave me this great result and then tried to take it back.’”

Regression to the mean: As the number of data points increases, we expect the average values to regress to the true mean…and since often the initial work is done on the basis of promising early results, we expect more data to even out a fortuitously significant early outcome.

The file drawer effect: Results that are not significant are hard to publish, and end up stashed away in a cabinet. However, as a result becomes established, contrary results become more interesting and publishable.

These are common, well-known and well-understood phenomena. But as far as I know, no one’s really tried to formally assess the impact of these phenomena or to propose any kind of diagnostic of how susceptible any particular result is to these threats to inference.

Let’s start with a simple example. Suppose that the data generating process is , where . If we repeatedly generate data sets of size 1000 out of this DGP, run an appropriate linear model , and save only those estimated coefficients that are statistically significant in a one-tailed test, .

In short, we find that none of the statistically significant results are near the actual coefficient of 0.5. In fact, the statistically significant coefficients are biased upward (the mean coefficient is 1.20 1.008 in this simulation). This makes sense: only the largest slopes are capable of overcoming the intrinsic noise in the DGP and being detected at this sample size (1000).

What does this mean? Well… the estimator is not itself intrinsically biased: if you plotted all the coefficients from our 1000 simulated samples, they would be normally distributed around the true mean of 0.5 with appropriate variance. But we’re not talking about the distribution of an estimator given a true value, ; we’re talking about the distribution of scientifically notable, publishable results . This is the distribution of results we expect to see in journal articles and in the media. And that distribution is biased because the scientific review process requires that results reach a certain signal-to-noise ratio (viz., a p-value smaller than 0.05) before they deserve scientific attention: .

In short: when you look at what’s in journal articles or scientific magazines, you’re looking at results that are biased. Not that this is a terrible thing: we are imposing this bias so as to avoid printing a stream of chance results or reams of uninformative non-effects (do we need to know that the price of tea in China is not related to rainfall in London?).

In fact, given the size of the underlying effect, I can tell you precisely how large we should expect that bias to be. The analysis below is for a normally distributed with a standard error of 0.5.

So: the smaller the true coefficient, the larger we expect a statistically significant (and positive) estimate of that coefficient to be. The lesson is relatively straightforward: comparatively small relationships are very likely to be overestimated in the published literature, but larger relationships are more likely to be accurately estimated.

The bias plot I just produced is neat, but the x-axis shows the true beta value—which, of course, we cannot know in advance. It explains the overall pattern of overestimated effects in journal articles and medical studies, but doesn’t give us a way to assess any particular result. It would be better, I think, for us to have some guess about the probability that the coefficient we are seeing in a journal article is biased upward. I think we can get that, too.

This block of code illustrates the gist of how I would proceed. I simulate a data set (of size 1000) out of the DGP , , and superimpose the posterior distribution of (using the frequentist procedure, and therefore implying a flat prior) from that regression onto the bias curve computed for this application.

The idea is to use the posterior distribution of as our best guess about the underlying state of the world, then to infer the expected bias in the published literature on the basis of this distribution. In principle, this can be done with any appropriate data set for the study of ; for example, I could imagine collecting data, running an analysis, finding a statistically insignificant , and then constructing this plot to determine the distribution of this relationship that I will see published in extant or even future studies! But it’s probably more likely that someone will look at a published posterior, and then infer something about the likelihood that this published result overstates the actual effect.

That is, we would calculate .

The inner integral is the expected bias of the estimated coefficient given its true value, and the outer integral calculates this expectation over the posterior.

I created some R code to calculate this expectation, and then applied it to the posterior distribution above.

The result: an average bias of 0.352 0.299. What this tells us is that, given the range of true values of with which this result is consistent, we would on average expect it to overstate the true value by about 40% 33%.

Note again: the estimator itself is not biased, or is anything about the data set wrong. What this tells us is that, given the “gating” process in publishing that is imposed by statistical significance testing, this particular published result is likely to overstate the true value of . If all posteriors were equally likely to be published, the distribution of published estimates would be symmetric about and there would be no bias.

But not all results are equally susceptible: larger and more certain published results are more likely to accurately represent the actual value of because deviations of that are smaller and larger than are equally likely to be published, as we see by computing the expected bias for a coefficient of 2 instead of 0.895: the expected bias drops to 0.023 0.016.

So, why do effects weaken over time? As PZ Myers said, “However, as a result becomes established, contrary results become more interesting and publishable.” What that means is that the statistical significance threshold for publication disappears, and the publication distribution shifts from to ; as we have already seen, the first distribution is going to exaggerate the size of effects whereas the second will reflect them accurately, and thus subsequent publications will indicate a weakened effect.

What I hope that I’ve done here is to create a diagnostic tool to identify possibly problematic results before the replication happens.

Comments and criticisms welcome! I’m still figuring out whether any of this makes any sense.

[Update 2/27/2013, 10:27 AM CST: I made a couple of very small edits to the text, source code, and pictures. The text edits were minor LaTeX errors and clarifications. I also changed the code so as to reflect the standard error of the estimated result, rather than a close but not quite , in the third plot. Blogging at midnight be hard, yo.]

Like this:

Related

22 Comments to “How to make a scientific result disappear”

I agree with everything here, but you should have based this off of a different scientific article, as Jonah Lehrer was fired from both the New Yorker and Wired rather recently for fabricating quotations, recycling content and plagiarism.

I thought about whether to reference Lehrer’s piece given the comparatively recent exposure of his transgressions. But as far as I can tell, the core arguments of this piece aren’t compromised by any of those shenanigans (nor was Lehrer the discoverer of “decline effect;” he just shone a public light on the phenomenon). I also just don’t have the heart for hard-core shunning of someone’s good work because they did some rather bad work later. Ultimately, I decided to use it.

If the piece is compromised by Lehrer’s ethical lapses, let me know and I’ll be happy to make a note of it in the post.

I think there’s been a lot of work in this vein already. For example, Ioannidis’s ‘Why Most Research Findings Are False’ seems to do something similar.

> Not that this is a terrible thing: we are imposing this bias so as to avoid printing a stream of chance results or reams of uninformative non-effects (do we need to know that the price of tea in China is not related to rainfall in London?).

You don’t think that being able to exclude large global weather patterns of precipitation is interesting or something we need to know?

> As PZ Myers said, “However, as a result becomes established, contrary results become more interesting and publishable.”

I’m not sure this is true at all. One of my side-projects is building up a meta-analysis of all studies on dual n-back and IQ: http://www.gwern.net/DNB%20FAQ#meta-analysis The evidence is mounting that most of the claimed IQ gain is a methodological artifact. So by Myers’s logic, weak or null results would become more interesting and publishable, right?

Actually, one researcher tells me in private that reviewers of their work gave them hell – since everyone knew that dual n-back increased IQ, their procedures or results must be wrong! (Also relevant to discussions of publication bias: one researcher refuses to give me the data from a study that was abandoned after dual n-back failed to increase IQ scores…)

>Ioannidis’s ‘Why Most Research Findings Are False’ seems to do something similar.

Ioanndis’ paper (titled “Why Most Published Research Findings are False” and available from PLoS Medicine at http://goo.gl/EDdeQ) is about the probability that a particular finding comes out of a null hypothesis’ sampling distribution rather than the distribution of a “true” effect. It’s much closer to a topic that I posted about earlier, http://goo.gl/AGYFx. It’s a great paper and one that I’ve read, and that probably influenced this post, but I wouldn’t say it’s all that similar except that it casts doubt on a naive interpretation of the literature as unvarnished truth. Ioanndis seems much more interested in the issue of bias in study formulation.

>You don’t think that being able to exclude large global weather patterns of precipitation is interesting or something we need to know?

I’m not that interested in how tea prices in China are *not* affected by London rainfall, as there are countless benign and unimportant reasons why they would not be.

>Actually, one researcher tells me in private that reviewers of their work gave them hell – since everyone knew that dual n-back increased IQ, their procedures or results must be wrong!

That is interesting, and somewhat discouraging if true (as it would suggest a persistently skewed or even more skewed, publication distribution over time). But I do see the argument for exceptional claims requiring exceptional proof, and once a result has been established many, many times in a literature such that our prior is shifted toward the existence of an effect than any one null result might be plausibly ascribable to chance. I’d have to think about what the trajectory of research over time would look like under different scenarios, and the tradeoffs in various types of error implied by various sorts of rules.

> Not that this is a terrible thing: we are imposing this bias so as to avoid printing a stream of chance results or reams of uninformative non-effects (do we need to know that the price of tea in China is not related to rainfall in London?).

Thanks very much for the paper suggestions; I’ve downloaded and printed them both. I wasn’t aware of them before (they’re not in my discipline) and they certainly look closely related to the overall point of this post.

In re: using statistical significance cutoffs: I have to agree that using a 5% or better significance rule for publishability results in very misleading results under the null. But the naive alternative of publishing analyses regardless of statistical significance would clog the journals; I don’t want to read everyone’s dead ends and false starts. They’d make it difficult to find the truly interesting and pathbreaking results in a journal. That is a real cost, and in the past it may have justified the cutoff rule or something like it.

In the modern age, there are alternatives that may make the best of both worlds: open-access and online-only journals of minor prestige that are easily searched and cost little to produce. That way, when someone sends some out-of-this-world result to Science or Nature, then the editors and reviewers can first look through this repository to see if this is the 5% of statistically significant noise before they proceed. It’s good for researchers, too: at least I could put a line on my CV for all that work that didn’t produce a breakthrough!

Absolutely nicely done!
BTW, do you know that there is a special issue about replicability in “Perspectives on Psychological Science” (open access!): http://pps.sagepub.com/content/7/6.toc . Anyway, I think your “diagnostic tool” is worth a publication.

One of my graduate students and I are working on a paper in re: the topic of interpreting hypothesis tests in published work, which has informed my last string of posts… hopefully it will be received well!

One wrinkle which you might want to spend more time on is that your analysis is on standardized variable, Justin. Small doesn’t mean small in your conclusion, it just means small in the world of mean zero unit variance distributions.

Right, it means small relative to the variance of the coefficient distribution (which is itself a product of sample size and the intrinsic noisiness of the data set). The examples I worked on had relatively noisy DGPs, but nothing out of the ballpark of tons of data sets I’ve worked with. The “average bias” function I cooked up includes entries for both the coefficient value and its standard error, which standardizes the distribution.

which grabs the “significant” estimates for b_1. These are plotted and the interpretation is:
“In short, we find that none of the statistically significant results are near the actual coefficient of 0.5″

Now IF my logic is correct, the actual value is 1 and is contained within the estimates from the simulations.

which grabs the “significant” estimates for b_1. These are plotted and the interpretation is
“In short, we find that none of the statistically significant results are near the actual coefficient of 0.5″

Now if my logic is correct, the actual value is 1 and is contained within the estimates from the simulations.

Yo Dogg, I herd you like coding errors, so I put a coding error in my example so you can correct my attempt to correct the literature’s practice!

Sorry, I just had to work the internet meme in there somehow. :-)

Obviously, you’re completely correct; I originally had this programmed correctly and somehow screwed it up. The example still works, mutatis mutandis, once I fix the code to be what it should be (which I will do momentarily). Thanks very much for pointing out my error!

More details on claims of discoveries based on “statistical significance”, but still at a rather ‘popular’ level, are given in a presentation to Italian teachers last October (http://www.roma1.infn.it/~dagos/IdF2012, including slides and video in Italian), in which I also tried to promote R.