Menu

Month: April 2016

The replicability of psychological research is surprisingly low. Why? In this blog post I present new evidence showing that questionable research practices contributed to failures to replicate psychological effects.

Quick recap. A recent publication in Science claims that only around 40% of psychological findings are replicable, based on 100 replication attempts in the Reproducibility Project Psychology (Open Science Collaboration, 2015). A few months later, a critical commentary in the same journal made all sorts of claims, including that the surprisingly low 40% replication success rate is due to replications having been unfaithful to the original studies’ methods (Gilbert et al., 2016). A little while later, I published an article in Psychonomic Bulletin & Review re-analysing the data by the 100 replication teams (Kunert, 2016). I found evidence for questionable research practices being at the heart of failures to replicate, rather than the unfaithfulness of replications to original methods.

However, my previous re-analysis depended on replication teams having done good work. In this blog post I will show that even when just looking at the original studies in the Reproducibility Project: Psychology one cannot fail to notice that questionable research practices were employed by the original discoverers of the effects which often failed to replicate. The reanalysis I will present here is based on the caliper test introduced by Gerber and colleagues (Gerber & Malhotra, 2008; Gerber et al., 2010).

The idea of the caliper test is simple. The research community has decided that an entirely arbitrary threshold of p = 0.05 distinguishes between effects which might just be due to chance (p > 0.05) and effects which are more likely due to something other than chance (p < 0.05). If researchers want to game the system they slightly rig their methods and analyses to push their p-values just below the arbitrary border between ‘statistical fluke’ and ‘interesting effect’. Alternatively, they just don’t publish anything which came up p > 0.05. Such behaviour should lead to an unlikely amount of p-values just below 0.05 compared to just above 0.05.

The figure below shows the data of the Reproducibility Project: Psychology. On the horizontal axis I plot z-values which are related to p-values. The higher the z-value the lower the p-value. On the vertical axis I just show how many z-values I found in each range. The dashed vertical line is the arbitrary threshold between p < .05 (significant effects on the right) and p > .05 (non-significant effects on the left).

The independent replications in blue show many z-values left of the dashed line, i.e. replication attempts which were unsuccessful. Otherwise the blue distribution is relatively smooth. There is certainly nothing fishy going on around the arbitrary p = 0.05 threshold. The blue curve looks very much like what I would expect psychological research to be if questionable research practices did not exist.

However, the story is completely different for the green distribution representing the original effects. Just right of the arbitrary p = 0.05 threshold there is a surprising clustering of z-values. It’s as if the human mind magically leads to effects which are just about significant rather than just about not significant. This bump immediately to the right of the dashed line is a clear sign that original authors used questionable research practices. This behaviour renders psychological research unreplicable.

For the expert reader, the formal analysis of the caliper test is shown in the table below using both a Bayesian analysis and a classical frequentist analysis. The conclusion is clear. There is no strong evidence for replication studies failing the caliper test, indicating that questionable research practices were probably not employed. The original studies do not pass the caliper test, indicating that questionable research practices were employed.

over caliper

(significant)

below caliper (non-sign.)

Binomial test

Bayesian proportion test

posterior median

[95% Credible Interval]1

10 % caliper (1.76 < z < 1.96 versus 1.96 < z < 2.16)

Original

9

4

p = 0.267

BF10 = 1.09

0.53

[-0.36; 1.55]

Replication

3

2

p = 1

BF01 = 1.30

0.18

[-1.00; 1.45]

15 % caliper (1.67 < z < 1.96 versus 1.96 < z < 2.25)

Original

17

4

p = 0.007

BF10 = 12.9

1.07

[0.24; 2.08]

Replication

4

5

p = 1

BF01 = 1.54

-0.13

[-1.18; 0.87]

20 % caliper (1.76 < z < 1.57 versus 1.96 < z < 2.35)

Original

29

4

p < 0.001

BF10 = 2813

1.59

[0.79; 2.58]

Replication

5

5

p = 1

BF01 = 1.64

0.00

[-0.99; 0.98]

1Based on 100,000 draws from the posterior distribution of log odds.

As far as I know, this is the first analysis showing that data from the original studies of the Reproducibility Project: Psychology point to questionable research practices [I have since been made aware of others, see this comment below]. Instead of sloppy science on the part of independent replication teams, this analysis rather points to original investigators employing questionable research practices. This alone could explain the surprisingly low replication rates in psychology.

Psychology failing the caliper test is by no means a new insight. Huge text-mining analyses have shown that psychology as a whole tends to fail the caliper test (Kühberger et al., 2013, Head et al., 2015). The analysis I have presented here links this result to replicability. If a research field employs questionable research practices (as indicated by the caliper test) then it can no longer claim to deliver insights which stand the replication test (as indicated by the Reproducibility Project: Psychology).

It is time to get rid of questionable research practices. There are enough ideas for how to do so (e.g., Asendorpf et al., 2013; Ioannidis, Munafò, Fusar-Poli, Nosek, & Lakens, 2014). The Reproducibility Project: Psychology shows why there is no time to waste: it is currently very difficult to distinguish an interesting psychological effect from a statistical fluke. I doubt that this state of affairs is what psychological researchers get paid for.

PS: full R-code for recreating all analyses and figures is posted below. If you find mistakes please let me know.

PPS: I am indebted to Jelte Wicherts for pointing me to this analysis.

Update 25/4/2015:

I adjusted text to clarify that caliper test cannot distinguish between many different questionable research practices, following tweet by Daniël Lakens.

The way science is currently funded is very controversial. During the last 6 months I was on a break from my PhD and worked for the organisation funding science in the Netherlands (NWO). These are 10 insights I gained.

1) Belangenverstrengeling

This is the first word I learned when arriving in The Hague. There is an anal obsession with avoiding (any potential for) conflicts of interest (belangenverstrengeling in Dutch). It might not seem a big deal to you, but it is a big deal at NWO.

2) Work ethic

Work e-mails on Sunday evening? Check. Unhealthy deadline obsession? Check. Stories of burn-out diagnoses? Check. In short, I found no evidence for the mythical low work ethic of NWO. My colleagues seemed to be in a perfectly normal, modern, semi-stressful job.

3) Perks

While the career prospects at NWO are somewhat limited, there are some nice perks to working in The Hague including: an affordable, good cantine, free fruit all day, subsidised in-house gym, free massage (unsurprisingly, with a waiting list from hell), free health check … The work atmosphere is, perhaps as a result, quite pleasant.

4) Closed access

Incredible but true, NWO does not have access to the pay-walled research literature it funds. Among other things, I was tasked with checking that research funds were appropriately used. You can imagine that this is challenging if the end-product of science funding (scientific articles) is beyond reach. Given a Herculean push to make all Dutch scientific output open access, this problem will soon be a thing of the past.

5) Peer-review

NWO itself does not generally assess grant proposals in terms of content (except for very small grants). What it does is organise peer-review, very similar to the peer-review of journal articles. My impression is that the peer-review quality is similar if not better at NWO compared to the journals that I have published in. NWO has minimum standards for reviewers and tries to diversify the national/scientific/gender background of the reviewer group assigned to a given grant proposal. I very much doubt that this is the case for most scientific journals.

6) NWO peer-reviewed

NWO itself also applies for funding, usually to national political institutions, businesses, and the EU. Got your grant proposal rejected at NWO? Find comfort in the thought that NWO itself also gets rejected.

7) Funding decisions in the making

In many ways my fears for how it is decided who gets funding were confirmed. Unfortunately, I cannot share more information other than to say: science has a long way to go before focussing rewards on good scientists doing good research.

8) Not funding decisions

I worked on grants which were not tied to some societal challenge, political objective, or business need. The funds I helped distribute are meant to simply facilitate the best science, no matter what that science is (often blue sky research, Vernieuwingsimpuls for people in the know). Approximately 10% of grant proposals receive funding. In other words, bad apples do not get funding. Good apples also do not get funding. Very good apples equally get zero funding. Only outstanding/excellent/superman apples get funding. If you think you are good at what you do, do not apply for grant money through the Vernieuwingsimpuls. It’s a waste of time. If, on the other hand, you haven’t seen someone as excellent as you for a while, then you might stand a chance.

9) Crisis response

Readers of this blog will be well aware that the field of psychology is currently going through something of a revolution related to depressingly low replication rates of influential findings (Open Science Framework, 2015; Etz & Vandekerckhove, 2016; Kunert, 2016). To my surprise, NWO wants to play its part to overcome the replication crisis engulfing science. I arrived at a fortunate moment, presenting my ideas of the problem and potential solutions to NWO. I am glad NWO will set aside money just for replicating findings.

10) No civil servant life for me

Being a junior policy officer at NWO turned out to be more or less the job I thought it would be. It was monotonous, cognitively relaxing, and low on responsibilities. In other words, quite different to doing a PhD. Other PhD students standing at the precipice of a burn out might also want to consider this as an option to get some breathing space. For me, it was just that, but not more than that.

— — —

This blog post does not represent the views of my former or current employers. NWO did not endorse this blog post. As far as I know, NWO doesn’t even know that this blog post exists.