Why Gilbert et al. are missing the point

This (hopefully brief) post is yet again related to the replicability debate I discussed in my previous post. I just read a response by Gilbert et al. to the blog comments about their reply to the reply to the reply to the (in my view, misnamed) Reproducibility Project Psychology. I won’t go over all of this again. I also won’t discuss the minutia of the statistical issues as many others have already done so and will no doubt do so again. I just want to say briefly why I believe they are missing the point:

The main argument put forth by Gilbert et al. is that there is no evidence for a replicability crisis in psychology and that the “conclusions” of the RPP are thus unfounded. I don’t think that the RPP ever claimed anything of the kind one way or the other (in fact, I was impressed by the modesty of the claims made by the RPP study when I read it) but I’ll leave that aside. I appreciate what Gilbert et al. are trying to do. I have myself frequently argued a contrarian position in these discussions (albeit not always entirely seriously). I am trying to view this whole debate the same way any scientist should: by evaluating the evidence without any investment in the answer. For that reason, the debate they have raised seems worthwhile. They tried to estimate a baseline level of replicability one could expect from psychology studies. I don’t think they’ve done it correctly (for statistical reasons) but I appreciate that they are talking about this. This is certainly what we would want to do in any other situation.

Unfortunately, it isn’t that simple. Even if there were no problems with publication bias, analytical flexibility, and lacking statistical power (and we can probably agree that this is not a tenable assumption), it wouldn’t be a straightforward thing to estimate how many psychology studies should replicate by chance. In order to know this you would need to know how many of the hypotheses are true and we usually don’t. As Einstein said – or at least the internet tells me he did: “If we knew what it was we were doing, it would not be called research, would it?”

One of the main points they brought up is that some of the replications in the RPP may have used inappropriate procedures to test the original hypotheses – I agree this is a valid concern but it also completely invalidates the argument they are trying to make. Instead of quibbling about what measure of replication rates is evidence for a “crisis” (a completely subjective judgement) let’s look at the data:

This scatter graph from the RPP plots effect sizes in the replications against the originally reported ones. Green (referred to as “blue” by the presumably colour-blind art editors) points are replications that turned out significant, red ones are those that were not significant and thus “failed to replicate.” The separation of the two data clouds is fairly obvious. Significant replication effects have a clear linear relationship with the original ones. Non-significant ones are uncorrelated with the original effect sizes.

We can argue until the cows come home what this means. The red points are presumably at least to a large part false positives. Yes, of course some – perhaps many – may be because of methodological differences or hidden moderators etc. There is no way to quantify this reliably. And conversely, a lot of the green dots probably don’t tell us about any cosmic truths. While they are replicating wonderfully, they may just be replicating the same errors and artifacts. All of these arguments are undoubtedly valid.

But that’s not the point. When we test the reliability of something we should aim for high fidelity. Of course, perfect reliability is impossible so there must be some scatter around the identity line. We also know that there will always be false positives so there should be some data points scattering around the x-axis. But do you honestly think it should be as many as in that scatter graph? Even if these are not all false positives in the original but rather false negatives in the replication, for instance because the replicators did a poor job or there were unknown factors we don’t yet understand, this ratio of green to red dots is not very encouraging.

Replicability encompasses all of the aforementioned explanations. When I read a scientific finding I don’t expect it to be “true.” Even if the underlying effects are real, the explanation for them can be utterly wrong. But we should expect a level of replicability from a field of research that at least maximises the trustworthiness of the reported findings. Any which way you look at it, this scatter graph is unsettling: if two thirds of the dots are red because low statistical power and publication bias in the original effects, this is a major problem. But if they are red because the replications are somehow defective this isn’t exactly a great argument either. What this shows is that the way psychology studies are currently done does not permit very reliable replication. Either way, if you give me a psychology study I should probably bet against it replicating. Does anyone think that’s an acceptable state of affairs?

I am sure both of these issues play a role but the encouraging thing is that probably it is the former, false positives, that is more dominant after all. In my opinion the best way anyone has looked at the RPP data so far is Alex Etz’s Bayesian reanalysis. This suggests that one of the main reasons the replicability in the RPP is so underwhelming is that the level of evidence for the original effects was weak to begin with. This speaks for false positives (due to low power, publication bias, QRPs) and against unknown moderators being behind most of the replication failures. Believe it or not, this is actually a good thing – because it is much easier to address the former problem than the latter.