The sample Gantt chart above nicely illustrates a typical scenario.Let's suppose we have a postdoc with 30
months’ funding. Amazingly, she is not held up by patient recruitment issues,
or ethics approvals, and everything goes according to plan, so 24 months in,
she writes up the study and submits it to a journal. At the same time, she may be applying for further funding or
positions. She may plan to start a family at the end of her fellowship.
Depending on her area of study it may take anything from two weeks to six
months to hear back from the journal*. The decision is likely to be revise and
resubmit. If she’s lucky, she’ll be able to do the revisions and get the paper
accepted to coincide with the end of her fellowship.All too often, though, the reviewers suggest
revisions. If she's very unlucky they may demand additional experiments, which
she has no funding for. If they just
want changes to the text, that's usually do-able, but often they will suggest
further analyses that take time, and she may only get to the point of
resubmitting the manuscript when her money runs out. Then the odds are that the
paper will go back to the reviewers – or even to new reviewers – who now have
further ideas of how the paper can be improved. But now our researcher might have
started a new job, have just given birth, or be unemployed and desperately
applying for further funds.

The thing about this scenario, which will be all too
familiar to seasoned researchers (see a nice example here), is that it is
totally unpredictable. Your paper may be accepted quickly, or it may get
endlessly delayed. The demands of the reviewers may involve another six month’s
work on the paper, at a point when the researcher just doesn’t have the time. I’ve
seen dedicated, hardworking, enthusiastic young researchers completely ground
down by this situation, faced by the choice of either abandoning a project that
has consumed a huge amount of energy and money, or somehow creating time out of
thin air. It’s particularly harsh on those who are naturally careful and
obsessive, who will be unhappy at the idea of doing a quick and dirty fix to
just get the paper out. That paper which started out as their pride and joy,
representing their best efforts over a period of years is now reduced to a
millstone around the neck.

But there is an alternative. I’ve recently, with a graduate
student, Hannah Hobson, put my toe in the waters of Registered Reports, with a
paper submitted to Cortex looking at an electrophysiological phenomenon
known as mu suppression. The key difference from the normal publication route
is that the paper is reviewed before the study is conducted, on the basis of an
introduction and protocol detailing the methods and analysis plan. This, of
course takes time – reviewing always does. But if and when the paper is
approved by reviewers, it is provisionally accepted for publication, provided
the researchers do what they said they would.

One advantage of this process is that, after you have
provisional acceptance of the submission, the timing is largely under your own
control. Before the study is done, the introduction and methods are already
written up, and so once the study is done, you just add the results and
discussion. You are not prohibited from doing additional analyses that weren’t
pre-registered, but they are clearly identified as such. One the study is written up the
paper goes back to reviewers. They may make further suggestions for improving
the paper, but what they can’t do is to require you to do a whole load of new
analyses or experiments. Obviously, if a reviewer spots a fatal error in the
paper, that is another matter. But reviewers can’t at this point start
dictating that the authors do further analyses or experiments that may be
interesting but not essential.

We found that the reviewer comments on our completed study were
helpful: they advised on how to present the data and made suggestions about how
to frame the discussion. One reviewer suggested additional analyses that would
have been nice to include but were not critical; as Hannah was working to tight
deadlines for thesis completion and starting a new job, we realised it would
not be possible to do these, but because we have deposited the data for this
paper (another requirement for a Registered Report), the door is left open for
others to do further analysis.

I always liked the idea of Registered Reports, but this
experience has made me even more enthusiastic for the approach. I can imagine
how different the process would have been had we gone down the conventional
publishing route. Hannah would have started her data collection much sooner, as
we wouldn’t have had to wait for reviewer comments. So the paper might have
been submitted many months earlier. But then we would have started along the
long uncertain road to publication. No doubt reviewers would have asked why we
didn’t include different control conditions, why we didn’t use current source
density analysis, why we weren’t looking at a different frequency band, and whether
our exclusionary criteria for participants were adequate. They may have argued
that our null results arose because the study was underpowered. (In the
pre-registered route, these were all issues that were raised in the reviews of
our protocol, so had been incorporated in the study). We would have been at
risk of an outright rejection at worst, or requirement for major revisions at
best. We could then have spent many months responding to reviewer recommendations
and then resubmitting, only to be asked for yet more analyses. Instead, we had a pretty clear idea of the
timeline for publication, and could be confident it would not be enormously
protracted.

This is not a rant against peer reviewers. The role of the
reviewer is to look at someone else’s work and see how it could be improved. My
own papers have been massively helped by reviewer suggestions, and I am on
record as defending the peer review system against attacks. It is more a rant
against the way in which things are ordered in our current publication system.
The uncertainty inherent in the peer review process generates an enormous
amount of waste, as publications, and sometimes careers, are abandoned. There
is another way, via Registered Reports, and I hope that more journals will
start to offer this option.

So now the folks in the media are confused and don’t know
what to think.

The bulk of debate has been focused on what exactly we mean
by reproducibility in statistical terms. That makes sense because many of the
arguments hinge on statistics, but I think that ignores the more basic issue,
which is whether psychology has a problem. My view is that we do have a
problem, though psychology is no worse than many other disciplines that use
inferential statistics.

The Reproducibility Project showed that many effects
described in contemporary literature are not like that. But was it ever thus? I’d
love to see the reproducibility project rerun with psychology studies reported
in the literature from the 1970s – have we really got worse, or am I aware of the
reproducible work just because that stuff has stood the test of time, while
other work is forgotten?

My bet is that things have got worse, and I suspect there
are a number of reasons for this:

1. Most of the phenomena I describe above were in
areas of psychology where it was usual to report a series of experiments that
demonstrated the effect and attempted to gain a better understanding of it by
exploring the conditions under which it was obtained. Replication was built in
to the process. That is not common in many of the areas where reproducibility of
effects is contested.

2. It’s possible that all the low-hanging fruit has
been plucked, and we are now focused on much smaller effects – i.e., where the
signal of the effect is low in relation to background noise. That’s where
statistics assumes importance. Something like the phonological confusability
effect in short-term memory or a Müller-Lyer illusion is so strong that it can
be readily demonstrated in very small samples. Indeed, abnormal patterns of performance
on short-term
memory tests can be used diagnostically with individual patients. If you
have a small effect, you need much bigger samples to be confident that what you
are observing is signal rather than noise. Unfortunately, the field has been
slow to appreciate the importance of sample size and many studies
are just too underpowered to be convincing.

3. Gilbert et al
raise the possibility that the effects that are observed are not just small but
also more fragile, in that they can be very dependent on contextual factors.
Get these wrong, and you lose the effect. Where this occurs, I think we should
regard it as an opportunity, rather than a problem, because manipulating
experimental conditions to discover how they influence an effect can be the key
to understanding it. It can be difficult to distinguish a fragile effect from a
false positive, and it is understandable that this can lead to ill-will between
original researchers and those who fail to replicate their finding. But the
rational response is not to dismiss the failure to replicate, but to first do
adequately powered studies to demonstrate the effect and then conduct further
studies to understand the boundary conditions for observing the phenomenon. To
take one of the examples I used above, the link between phonological awareness
and learning to read is particularly striking in English and less so in some
other languages. Comparisons
between languages thus provide a rich source of information for
understanding how children become literate. Another of the effects, the right
ear advantage in dichotic listening holds at the population level, but there
are individuals for whom it is absent or reversed. Understanding this
variability is part of the research process.

4. Psychology, unlike many other biomedical disciplines,
involves training in statistics. In principle, this is thoroughly good thing, but
in practice it can be a disaster if the psychologist is simply fixated on
finding p-values less than .05 – and assumes that any effect associated with
such a p-value is true. I’ve blogged about this extensively, so won’t repeat
myself here, other than to say that statistical
training should involve exploring simulated datasets so that the student
starts to appreciate the
ease with which low p-values can occur by chance when one has a large
number of variables and a flexible approach to data analysis. Virtually all
psychologists misunderstand p-values associated
with interaction terms in analysis of variance – as
I myself did until working with simulated datasets. I think in the past
this was not such an issue, simply because it was not so easy to conduct
statistical analyses on large datasets – one of my early papers
describes how to compare regression coefficients using a pocket calculator,
which at the time was an advance on other methods available! If you have to put
in hours of work calculating statistics by hand, then you think hard about the
analysis you need to do. Currently, you can press a few buttons on a menu and
generate a vast array of numbers – which can encourage the researcher to just
scan the output and highlight those where p falls below the magic threshold of
.05. Those who do this are generally unaware of how problematic this is, in
terms of raising the likelihood of false positive findings.

Nosek et al have demonstrated that much work in psychology
is not reproducible in the everyday sense that if I try to repeat your
experiment I can be confident of getting the same effect. Implicit in the critique
by Gilbert et al is the notion that many studies are focused on effects that
are both small and fragile, and so it is to be expected they will be hard to
reproduce. They may well be right, but if so, the solution is not to deny we
have a problem, but to recognise that under those circumstances there is an
urgent need for our field to tackle the methodological issues of inadequate
power and p-hacking, so we can distinguish genuine effects from false
positives.