Thursday, November 28, 2013

What Can we Learn from the Many Labs Replication Project?

The first massive replication project in psychology has just
reached completion (several others are to follow). A large group of researchers,
which I will refer to as ManyLabs, has attempted to replicate 15 findings from
the psychological literature in various labs across the world. The paper is posted on
the Open Science Framework (along with the data) and Ed Yong has authored a
very accessible write-up. [Update May 20, 2014, the article is out now and is open access.]

What can we learn from the ManyLabs project? The results
here show the effect sizes for the replication efforts (in green and grey) as
well as the original studies (in blue). The 99% confidence intervals are for
the meta-analysis of the effect size (the green dots); the studies are ordered by
effect size.

Let’s first consider what we canNOT learn from these data.
Of the 13 replication attempts (when the first four are taken together), 11
succeeded and 2 did not (in fact, at some point ManyLabs suggests that a
third one, Imagined Contact also doesn’t really replicate). We cannot learn from
this that the vast majority of psychological findings will replicate, contrary to this
Science
headline, which states that these findings “offer reassurance” about the
reproducibility of psychological findings. As Ed Yong (@edyong209) joked on
Twitter, perhaps ManyLabs has stumbled on the only 12 or 13 psychological
findings that replicate! Because the 15 experiments were not a random sample of
all psychology findings and it’s a small sample anyway, the percentage is not informative,
as ManyLabs duly notes.

But even if we had an accurate estimate of the percentage of
findings that replicate, how useful would that be? Rather than trying to
arrive at a more precise estimate, it might be more informative to follow up the
ManyLabs projects with projects that focus on a specific research area or
topic, as I proposed in my first-ever
post, as this might lead to theory advancement.

So what DO we learn from the ManyLabs project? We learn that for some experiments,
the replications actually yield much larger effects that the original studies,
a highly intriguing findings that warrants further analysis.

We also learn that the two social priming studies in the
sample, dangling at the bottom of the list in the figure, were resoundingly nonreplicated. One study found that
exposure to the United States flag increases conservatism among Americans; the
other study found that exposure to money increases endorsement of the current
social system. The replications show that there essentially is no effect
whatsoever for either of these exposures.

It is striking how far the effects sizes of the
original studies (indicated by an x) are away from the rest of the experiments. There
they are, by their lone selves at the bottom right of the figure. Given that
all of the data from the replication studies have been posted online, it would
be fascinating to get the data from the original studies. Comparisons of
the various data sets might shed light on why these studies are such outliers.

We also learn that the online experiments in the project yielded
results that are highly similar to those produced by lab experiments. This does
not mean, of course, that any experiment can be transferred to an online
environment, but it certainly inspires confidence in the utility of online
experiments in replication research.

Most importantly, we learn that several labs working
together yield data that have an enormous evidentiary power. At the same time,
it is clear that such large-scale replication projects will have diminishing
returns (for example, the field cannot afford to devote countless massive replication efforts
to not replicating all the social priming experiments that are out there).
However, rather than using the ManyLabs approach retrospectively, we can also
use it prospectively: to test novel hypotheses.

Here is how this might go.

(1) A group of researchers form a hypothesis (not by pulling
it out this air but by deriving it from a theory, obviously).

And so I agree with the ManyLabs authors when they conclude
that a consortium of laboratories could
provide mutual support for each other by conducting similar large-scale
investigations on original research questions, not just replications. Among
the many accomplishments of the ManyLabs project, showing us the feasibility of
this approach might be its major one.

9 comments:

Thanks for the post on this Rolf. What I was wondering when I read about the replications in Nature (http://www.nature.com/news/psychologists-strike-a-blow-for-reproducibility-1.14232) was whether or not these were really replications. Doesn't the very fact that they "combined tests from earlier experiments into a single questionnaire — meant to take 15 minutes to complete" mean that they did not, technically, "replicate" the original studies? They essentially created a new study (survey instrument), that contained items from prior studies. That, then, created a set of new contextual factors surrounding these questionnaire items.

Anyway, since you've thought a lot more about this issue, I'd be interested in your interpretation.

1. LITERAL. Exact, only the subjects and time changes (e.g., in-lab replication)2. OPERATIONAL. Reproduce the methods as best as possible.3. CONSTRUCTIVE. Replicate the theoretical construct.

The scientific credibility awarded to a succesful constructive replication is largest of all, after that operational and least impressive in terms of credibility awarded to a theory, is a literal replication.

I think ManyLabs shows it is possible to conduct constructive type replications and therefore am not surprised to see variation.

However, I do wonder about the following: There were original studies that had a power of ~99% to detect the original effect, as well as the replicated effect... there's more to power than sample size!

By the way... The idea to use this for novel predictions... Where do I sign up? :)

If you subliminally prime a semantic category in order to examine its effect on the response latency in a lexical decision task so you can build a better computational model of reading in adults, I don't think anyone would call it social priming, but semantic priming or something similar.

If you study the effect of a prime that has acquired meaning as a symbol at the level of a nation, society or culture on the behaviour or attitudes of individuals that concern concepts that are meaningful with respect to a similar aggregate level of nation or society (voting behaviour, position in a political debate, attitude towards conservative or liberal), I think a lot of people would call that social priming.

The effect, by any other name, would still be 0 on average in this sample.

"The effect, by any other name, would still be 0 on average in this sample." :) Yes indeed! I was not arguing that point at all.

To take your reply and push the point further, if you allow that any concept that has "acquired meaning as a symbol" to be called "social priming," then actually, your first example is social priming. In fact, all priming is social priming to the extent that language is a social activity wherein symbols come to acquire meaning within a "nation, society, or culture." That describes language perfectly and therefore any priming involving language is social priming.

One point I haven’t seen discussed but I’m wondering about: how come the effect sizes for the original studies mostly fall within a relatively narrow range? Much narrower than the range of effect sizes from the ManyLabs replication, but centered on about the same grand mean. Is that just happenstance? Is there some obvious explanation I’m missing?