Sunday, 27 May 2018

Sowing seeds of doubt: how Gilbert et al’s critique of the reproducibility project has played out

In
Merchants of Doubt, Eric Conway and Naomi Oreskes describe how raising
doubt can be used as an effective weapon against inconvenient science. On
topics such as the effects of tobacco on health, climate change and causes of
acid rain, it has been possible to delay or curb action to tackle problems by
simply emphasising the lack of scientific consensus. This is always an option,
because science is characterised by uncertainty, and indeed, we move forward by
challenging one another’s findings: only a dead science would have no
disagreements. But those raising concerns wield a two-edged sword: spurious and discredited criticisms can disrupt scientific progress,
especially if the arguments are complex and technical: people will be left with
a sense that they cannot trust the findings, even if they don’t fully
understand the matters under debate.

The parallels with Merchants of Doubt occurred to me as I
re-read the critique
by Gilbert et al of the classic paper by the Open Science Collaboration (OSC)
on ‘Estimating
the reproducibility of psychological science’. I was prompted to do so
because we were discussing the OSC paper in a journal club* and inevitably the
question arose as to whether we needed to worry about reproducibility, in the
light of the remarkable claim by Gilbert et al: ‘We show
that OSC's article contains three major statistical errors and, when corrected,
provides no evidence of a replication crisis. Indeed, the evidence is also
consistent with the opposite conclusion -- that the reproducibility of
psychological science is quite high and, in fact, statistically
indistinguishable from 100%.’

The Gilbert et al critique has, in turn, been the subject of
considerable criticism, as well as a response by a
subset of the OSC group. I summarise the main points of contention in Table
1: at times they seem to be making a defeatist argument that we don’t need to
worry because replication in psychology is bound to be poor: something I
have disputed.

But my main focus in this post is simply to consider the
impact of the critique on the reproducibility debate by looking at citations of
the original article and the critique. A quick check on Web of Science found
797 citations of the OSC paper, 67 citations of Gilbert et al, and 33 citations
of the response by Anderson et al.

The next thing I did, admittedly in a very informal fashion,
was to download the details of the articles citing Gilbert et al and code them
according to the content of what they said, as either supporting Gilbert et
al’s view, rejecting the criticism, or being neutral. I discovered I needed a
fourth category for papers where the citation seemed wrong or so vague as to be unclassifiable. I discarded any papers where the relevant information could
not be readily accessed – I can access most journals via Oxford University but
a few were behind paywalls, others were not in English, or did not appear to
cite Gilbert et al. This left 44 citing papers that focused on the commentary
on the OSC study. Nine of these were supportive of Gilbert et al, two noted
problems with their analysis, but 33 were categorised as ‘neutral’, because the
citation read something like this:

“Because of the
current replicability crisis in psychological science (e.g., Open Science
Collaboration, 2015; but see Gilbert, King, Pettigrew, & Wilson, 2016)….”

The strong impression was that the authors of these papers lacked
either the appetite or the ability to engage with the detailed arguments in the
critique, but had a sense that there was a debate and felt that they should
flag this up. That’s when I started to think about Merchants of Doubt: whether intentionally or not, Gilbert et al had created an atmosphere of uncertainty to suggest there is no consensus on whether or not psychology has a reproducibility problem - people are left thinking that it's all very complicated and depends on arguments that are only of interest to statisticians. This makes it easier for those who are reluctant to take action to deal with the issue.

Fortunately, it looks as if Gilbert et al’s critique has
been less successful than might have been expected, given the eminence of the
authors. This may in part be because the arguments in favour of change are
founded not just on demonstrations such as the OSC project, but also on logical
analyses of statistical practices and publication biases that have been known
about for years (see slides 15-20 of my presentation here). Furthermore, as evidenced in the footnotes to Table 1, social
media allows a rapid evaluation of claims and counter-claims that hitherto was
not possible when debate was restricted to and controlled by journals. The publication this
week of three more big replication studiesjust heaps on further empirical evidence that we have a problem that
needs addressing. Those who are saying ‘nothing to see here, move along’ cannot
retain any credibility.

Table 1

Criticism

Rejoinder

‘many of OSC’s replication studies drew their samples from
different populations than the original studies did’

·‘Many’ implies the majority. No attempt to
quantify – just gives examples

·Did not show that this feature affected
replication rate

‘many of OSC’s replication studies used procedures that
differed from the original study’s procedures in substantial ways.’

·‘Many’ implies the majority. No attempt to
quantify – just gives examples

·OSC showed that this did not affect
replication rate

·Most striking example used by Gilbert et al is
given detailed explanation by Nosek (1)

‘How many of their replication studies should we expect to
have failed by chance alone? Making this estimate requires having data from
multiple replications of the same original study.’

Used data from pairwise comparisons of studies from the
Many Labs project to argue a low rate of agreement is to be expected.

·Ignores publication bias impact on original
studies (2, 3)

·G et al misinterpret confidence intervals (3,
4)

·G et al fail to take sample size/power into
account, though this is crucial determinant of confidence interval (3, 4)

·‘Gilbert
et al.’s focus on the CI measure of reproducibility neither addresses nor can
account for the facts that the OSC2015 replication effect sizes were about
half the size of the original studies on average, and 83% of replications
elicited smaller effect sizes than the original studies.’ (2)

Results depended on whether original authors endorsed the
protocol for the replication: ‘This strongly suggests that the infidelities
did not just introduce random error but instead biased the replication
studies toward failure.

·Use of term ‘the infidelities’ assumes the only
reason for lack of endorsement is departure from original protocol. (2)

*Thanks to the enthusiastic efforts of some of our grad
students, and the support of Reproducible
Research Oxford, we’ve had a series of Reproducibilitea
journal clubs in our department this term.I can recommend this as a great – and relatively cheap and easy - way of
raising awareness of issues around reproducibility in a department: something
that is sorely needed if a recent Twitter survey by Dan Lakens
is anything to go by.

1 comment:

Your "fourth category for papers where the citation seemed wrong or so vague" doesn't surprise me.

I commonly look at the references in the, mostly medical, journal articles I read if the author has written something surprising and usually find the reference cannot justify the article text. One typical example is the statement "hand washing saves lives" which referenced the "Marsden Manual of Nursing" - hardly a primary paper. No idea what the peer reviewers were supposed to be doing.