Recent hand-wringing over failed replications in social psychology is largely pointless, because unsuccessful experiments have no meaningful scientific value.

Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any failed replication will always be that the replicator bungled something along the way […] Targets of failed replications are justifiably upset, particularly given the inadequate basis for replicators’ extraordinary claims.

I applaud his taking a stand on this, and I respect Mitchell’s pugnacious style. I also agree with his ‘recommendations for moving forward’. However, I think he badly misses the mark. Mitchell’s piece strikes me as an aggressive defense of a naive position.

Mitchell starts by bravely giving some personal examples of how the simplest errors can wreck the most beautiful scientific plans:

I have, for instance, belatedly realized that a participant was earlier run in a similar pilot version of the experiment and already knew the hypotheses; I’ve inadvertently run analyses on a dozen copies of the same set of fMRI images instead of using different data for each subject; I have written analysis code that incorrectly calculated the time of stimulus onset; and on and on. I might be embarrassed by a full accounting of my errors, except for the fact that I’m in good company – every other scientists I know has experienced the same frequent failings…

This is all too true. But Mitchell then argues that any given null result might merely result from simple mistakes like this. As a result, null findings “have no meaningful evidentiary value”, and should “as a rule not be published” at all.

Whereas the replication movement sees a failure to find a significant effect as evidence that the effect being investigated is non-existent, Mitchell denies this, saying that we have no way of knowing if the null result is genuine or in error: “when an experiment fails, we can only wallow in uncertainty” about what it means. But if we do find an effect, it’s a different story: “we can celebrate that the phenomenon survived these all-too-frequent shortcomings [experimenter errors].”

And here’s the problem. Implicit in Mitchell’s argument is the idea that experimenter error (or what I call ‘silly mistakes’) is a one-way street: errors can make positive results null, but not vice versa.

Unfortunately, this is just not true. Three years ago, I wrote about these kinds of mistakes and recounted my own personal cautionary tale. Mine was a spreadsheet error, one even sillier than the examples Mitchell gave. But in my case the silly mistake created a significant finding, rather than obscuring one.

There are manydocumentedcases of this happening and (scary thought) probably many others that we don’t know about. Yet the existence of these errors is the fatal spanner in the works of Mitchell’s whole case. If positive results can be erroneous too, if errors are (as it were) a neutral force, neither the advocates nor the skeptics of a particular claim can cry ‘experimenter error!’ to silence their opponents.

Mitchell skirts around this contradiction, admitting that positive results can be wrong but saying that

negative evidence can never triumph over positive evidence, [but] we can always bring additional positive evidence to bear on a question… I might assert that the observer is lying, or is herself deceived. I might identify faults in her method and explain how they lead to spurious conclusions.

But in saying this, Mitchell is simply sawing off the branch he stands on. He has just been arguing that we don’t even need to try to provide any evidence to show that negative results are the product of error – we can just assume it. Now he says that only hard evidence should ever convince us that a positive result isn’t true.

Firstly, imagine the paradox this would create if two scientists were to hold different hypotheses, and thus different criteria for ‘positive’ and ‘negative’! Yet even assuming everyone had the same hypotheses, this would only make sense if we believe that errors (almost) always make for null results, in other words that errors are well-behaved and predictable. They’re not.

This guy is apparently living in a world where Popper never existed. I hope we’re misinterpreting his position, because if not, the intellectual mistake that Mitchell is making borders on the absurd. Yes, lately there has been an ill-informed flare up of scorn against the methods of social science, but *this* is not the way to answer the critics! I will now start the countdown until somebody posts a response to the following effect:

“If somebody could think something so utterly moronic about basic scientific methodology and still become a Harvard neuroscientist, that goes to show that his discipline is completely bereft of all rigor.”

(Of course, it is not. But dumb statements like this sure do make it look worse than it should.)

DS

OMG. Wishful thinking wins!

http://selfawarepatterns.com/ SelfAwarePatterns

Mitchell might have had a point if he had stuck to noting that a failed replication might be because of errors on the replicator’s part, that one failed replication shouldn’t automatically discount the original results. But asserting that failed replications shouldn’t be published, in effect saying that they shouldn’t even be attempted, seems pretty unscientific.

http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

Certainly one null finding isn’t decisive, but then one positive finding isn’t decisive either. Even with a huge sample size and excellent methods, it could be a silly mistake.

http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

Re: the cake analogy, if we assume or somehow know that the recipe is perfect, then we can ascribe any bad cakes to bad cooks.

But if we can’t assume that, then, faced with bad cakes, we’d have to consider the possibility that the recipe is bad, and the photo in the book was unrealistic, or was (by mistake) a photo of a different cake entirely.

Nick Brown

Agreed. But I want to emphasise my second point, which is that we can safely ascribe good cakes to good cooks, because the way in which entropy works in the baking situation is to degrade the quality of the cake, more or less by definition (a great-looking cake is a low-entropy situation, following the input of a lot of energy from outside the system); much the same applies to the silver nitrate/chloride experiment. In comparison, the difference in entropy between two spreadsheets or SPSS output files, one with with a positive result and one with a negative result, is negligible (and/or completely stochastic).

http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

Right.

Although one might argue that most errors tend to abolish effects rather than create them, I’m not even sure if that is true.

Few errors have the effect of randomly rearranging data. Even so simple an error as duplicating the same subject and including them twice, would (if that subject were above or below the mean for their own group) tend to enhance group differences, not diminish them.

theLaplaceDemon

“Although one might argue that most errors tend to abolish effects rather than create them, I’m not even sure if that is true.”

Publication bias may still mean there are a lot of false positives in the literature, even if this is true. If you do ten experiments, and in 9 of them silly errors lead to false negatives, and in one leads to a false positive, chances are you file drawer the first nine and publish #10. That’s still a false positive in the literature that needs to be weeded out via replication.

http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

There’s also the problem of “selective debugging”.

You get a null result, so you take a look at your scripts, fix a bug, it’s still null. But fix another bug, and now there’s a result! So you stop looking for bugs (and don’t spot the bug that caused your positive result.)

M.W. Poe

Also, the way science progresses is by improving the recipe.
At the origin of a field the recipe may be “mix flour, water, and yeast then
heat it up” and follow-up research refines and operationalizes variables. Ie: “Oh!
Rye flour and wheat flour produce different results.” Then eventually those
different results are predictable and known, perhaps splitting off a different
field (“pastries vs cakes”). But the core assumption is that if you follow the
recipe you get the same thing. If you can follow the recipe and get something
different then you aren’t following the scientific method.

Additionally, I take issue with way he focuses on the
results of unskilled cooks attempting to follow a recipe. In reality these are
fellow experts in the field. This is not Dr. Mitchell trying to follow a Gordon
Ramsay recipe. This is Gordon Ramsay trying to follow a Emeril Lagasse recipe.

http://autap.se Richard

I’ve attempted to have a go at some of the arguments that I think are most-wrong here: http://autap.se/8

It’s a shame his essay wasn’t posted on a blog with a comment system…

Deborah Mayo

He is falling into the 4th irony I noted in my recent blogpost on the replication movement in psych:

The idea that “you can’t ‘prove’ a negative” is common lore, but incorrect. If we have a test with high power or capability to have revealed or detected an effect if it is genuine, and yet it fails to be produced, then we have a strong warrant for its absence. (e.g., Uri Geller’s ability to bend spoons.)

Moreover, as one or two others have noted, the question here is whether there is “positive” evidence for a genuine effect. We only attain this by being able to reliably bring it about (in this case statistically). Until we have done so, we don’t have evidence for a genuine effect.

All that said, I have my own serious caveats regarding the replication movement in psych.

Evidence does not have polarity. If you accept silly language like this you end up twisting yourself into some weird shapes.

Thalia Wheatley

All the evidence you provide for false positives being as likely as false negatives are post-processing spreadsheet errors. I don’t see this as the crux of the asymmetry argument.

http://autap.se Richard

In experiments where there is potentially flexibility in the design and many unidentified variables (i.e. a lot of psychology experiments), I would suggest that there is a high chance that an experiment may not be “bungled”, but that the unidentified variables may be very important. A positive result doesn’t tell you that your hypothesis is likely to be true, it tells you that your data were unlikely to be generated by a null distribution (assuming we’re doing null hypothesis testing), leaving the possibility that all sorts of unidentified variables could have influenced your results. Technically then false positives, given the precise details of the hypothesis, may well be very likely and not just the result of experimental errors.

A failure to replicate should be the first step in collaboration between the original testers and the replicators to agree on a protocol that would isolate the unidentified variables and run an experiment that both groups would be satisfied with – this is unfortunately not standard practice, it would seem.

The argument you provide here, a mirror image of Mitchell’s, is troubling. Mitchell argues that “the likeliest explanation for any failed replication will always be that the replicator bungled something along the way.”

You appear to be arguing that successes are equally likely to be caused by bungling. It is possible that successes can be caused by bungling, but there is not really evidence that it is equally or more likely. If we are to take it on faith that successes are equally likely to be caused by bungling, then pretty much every study could be garbage, right? There is no practicable way of verifying the integrity of the entire experimental process implicit in every published paper. This is another point made by Mitchell: there is not even a good way of writing down the total process of any experiment such that it can be reliably reproduced (though many are loathe to admit it, this is a large part of why scientists across fields have a system of apprenticeship, as many sociologists of science have observed).

In any case, focusing on the question of whether it is possible for mistakes to cause false positives, or whether null findings have “no value” or “some value,” misses what I take to be the larger point of Mitchell’s essay, which is that focusing on what has been called “failed” replication by some can have the highly undesirable effect of constraining the possibility space of research. (I would not go as far as Mitchell, but I agree with him that there is grave danger, both for the field and to individuals, and believe that “failed” replications must be handled with great grace.) Mitchell’s example of the field of implicit prejudice is telling. If “direct” replications of early studies of implicit prejudice had “failed” and researchers didn’t pursue further study, science would have suffered a great loss (and no one would have known).

Arguing about whether it is in principle possible for mistakes to cause false positives, or whether it is in principle possible for a null finding to have some value may be fun and interesting—but the point is the health of the field and whether future possibilities are unproductively constrained or usefully expanded.

http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

“If we are to take it on faith that successes are equally likely to be
caused by bungling, then pretty much every study could be garbage,
right?”

Quite right. They could be. It is possible. But I’m not the one saying that we should use that possibility as a license to just ignore results. Mitchell is the one saying that.

Albeit he only says this about negative results, but as I have pointed out, he has no good reason to limit himself in that way, because any result might be bungled.

Mitchell may have a point about the dangers of non-replications to the field etc. but this has nothing to do with the question of whether they carry evidentiary value.

beau

If this is the case, then it truly matters whether or not bungling is more or less likely to cause a false positive vs. a false negative. This argument hasn’t yet been cashed out. Mitchell’s argument that there are a great many, let’s say, “bungle-attack-vectors” that can cause study failure even in the presence of a true effect speaks strongly to experience in the lab, and has only been countered by the observation that there are also some bungle-attack-vectors (spreadsheet errors, selective debugging—selective debugging is in fact a scary notion) that can cause false positives.

However, I still think this misses the larger point. It is dangerous to focus near-exclusively on the epistemological point of Mitchell’s at the expense of the ethical one. If you don’t find the extreme version of Mitchell’s argument compelling (i.e., non-replications have no evidentiary value), it is easy to see how a very gently moderated version of the argument is still quite strong, and has deep ethical ramifications for the field.

http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

A debate about the ethics of negative replications is important, but there are clearly two sides to that argument.

On the one hand, giving too much weight to negative results is harmful.

But giving them too little weight is also bad – it leads to cargo cult science and the harm of that is well known (wasted time, money, and careers, the scandal when it all eventually collapses.)

Where to draw the line is a very interesting question, and I don’t claim to know the answer, but I don’t think Mitchell does either – he doesn’t try to. He only provides one side of the equation.

Bob Calin-Jageman

* ” Because experiments can be undermined by a vast number of practical mistakes, the likeliest explanation for any failed replication will always be that the replicator bungled something along the way.” — this is why we use positive and negative controls. Controls provide evidence about experiment-bungling. The judicious use of controls is what enables billions of sales in science “kits” for amplifying DNA, measuring promoter activity, etc–a well-trained scientist can use the kit, check the positive and negative controls, and if these are successful have a reasonable degree of confidence in the results, whatever they are.

* “unsuccessful experiments have no meaningful scientific value” – this is factually wrong. Look at any meta-analysis in the clinical trials literature. Even where an effect is strong, some trials still fail to observe it and thus have ‘negative’ results. These still have value, however–they still refine the measurement of the true effect size.

* “I claim that some non-white swans exist” –yes, there is an asymmetry of evidence for these types of claims. But these are not the types of claims scientist make. They make claims like “Power improves performance” or “The hippocampus processes memories”. With claims of principle like this, the asymmetry is opposite that described by Mitchell. The scientist making the claim must show the claim is true in a variety of contexts/settings–truly a generalize able scientific claim. Any quality negative result suggests ways in which the principle must be refined/qualified/moderated. Not only does this have value, it is the essential difference between real science and psuedoscience. Psuedoscience operates without boundary conditions (much like modern day MRI); real science is interested in refining theory by specifiying when/how/and to what degree different effects will be observed.

* The example given about the hippocampus is an argument in favor of negative results being publishable and useful. If it hadn’t been for some failed results of getting memory tasks to light up the hippocampus, scientists would have stuck to a broad notion of hippocampal involvement in memory. We now know, however, that the hippocapus is important to only certain types of memory–a refinement of theory that wouldn’t have happened without publishing negative results.

Who values replications? Those interested in the cumulative knowledge gained by science, and in refining theories until they precisely capture reality. Those who don’t value replications are playing some other game entirely.

beau

In the field of science there is widespread concern about the risks of false positives, but false negatives also carry profound risks. I think in your response you are under-weighting those risks. Characterizing this as an argument about the fate of “silly priming effects to do with messy rooms” underscores the point—it is entirely possible that the study of implicit prejudice could have been dismissed too soon using this rationale. It is not simply a matter of the resources consumed during the research process, it’s a big-picture question of what researchers are or not able to further investigate. This is a large-scale social problem with no simple quantitative solution and real consequences.

One point of Mitchell’s essay is that the process of finding something that is “interesting, true, useful, and reliable” can be very, very easily knocked off course by a null, perhaps “failed,” replication that is in fact a false negative that looks strong, but was thrown off course because of a hidden bungle. That the relative probabilities of different kinds of bungles are very difficult or impossible to determine matters very much because it means we can’t just sit down and crunch the numbers about this. This cannot be a purely epistemological discussion. Many of the commentators on this topic have been trying to push the discussion in as quantitative a direction as possible because it simplifies the issue. But the epistemological issue and the ethical issue are linked; there is no way to make this simple and reducible.

(I would allow for a more complicated epistemological picture than Mitchell; if there is a huge flaw in his essay it’s that it offers extreme statements that people can attack without addressing any of the other arguments or the relations between them.)

Nick Brown

If a theoretical effect is so important to humanity that we must avoid false negatives, the original study should either be sufficiently powered so as to make that risk negligible (e.g., .999 power) — that is, humanity should put up the money to find out — or else the study should be reported as “preliminary” and “exploratory”, thus positively *inviting* replications.

I don’t think there are too many cases like that in the contemporary literature. I do, however, see plenty of N<=50 studies whose discussion sections and (especially) subsequent press releases and articles in the Economist, NY Times, Daily Mail, etc claim to have discovered something fundamental about how "people" (sic; not "psychology majors aged 19-22 at a medium-sized Mid-West US university") function. Yet, when someone fails to reproduce that "fundamental finding" with the exact same materials using psychology majors aged 19-22 at a California university, suddenly it's because the effect is incredibly fragile (despite having been universal, fundamental, and often very strong in the original write-up). I see no fundamental difference between that process and Uri Geller's claim that God has given him psychokinetic abilities, which then turn out only to be useful for bending spoons (and only when Geller himself provides the spoon).

beau

Like a lot of the discussion about replication, the argument presented here is hyperbolic and leaves a large excluded middle.

The worry is not that we will miss any single theoretical effect that will be vital to humanity because a study will be underpowered, and the solution is not adding more power. The worry is that determining how effects work requires inter-study and (perhaps) inter-researcher cooperation, and that declaring a direction of research “failed” will close the door on any number of research questions with unknowable value. Adding power helps fix the quantitative part of the problem, but doesn’t necessarily help fix the social part.

Qualitative evaluation of multiple studies is necessary to evaluate what makes an effect work or not work under different circumstances—i.e., what the processes of nature are that create an effect, not just how we can model it using statistics or algorithms (“fragile” I agree is maybe not a great word). But we can’t even get to the point of having multiple studies with different views of an effect to evaluate if our standard is .999 power or get out.

http://blogs.discovermagazine.com/neuroskeptic/ Neuroskeptic

“The worry is that declaring a direction
of research “failed” will close the door on any number of research
questions with unknowable value.”

OK, but can’t we encourage people to do replications but discourage people from drawing premature conclusions, either positive or negative, from individual replications?

beau

Yes. This is very hard to do, apparently!

The way to do it probably cannot just be by making comments in discussion sections. We will probably need to change the way research is conducted and communicated to the public in order to convincingly suggest the ideas are still forming and not yet finished.

Mitchell makes some brief suggestions (e.g., avoid “direct” replications, because they can never be truly “direct”) but they are probably not enough.

d1onys0s

Neuroscience is notoriously hard to replicate because there is so much software and human interpretation (bias) affecting the perceived results. Neurobabblers are among the worst in interpretive “science” badly exceeding the errors of Psychology since they believe they have objective measures.

Even without p-value hacking or data-fishing, the original finding could be correct or a false-positive (perhaps enhanced by repeating the experiment with minor variations until it succeeded and not reporting all the failures because they were not fully verified or followed up – a very usual situation) or there could be a methodological error. In all cases, Mitchell is of course completely wrong that a failed replication has no value. It suggests either that the recipe is really sensitive (assuming it was performed correctly after corresponding with the original authors to clarify the Methods), or one of the two errors above occurred. An adequately powered and performed non-replication could (and should) be the stimulus for a fresh replication by the original lab. After that, whether it is followed up depends on the importance of the question
and the available resources (it could take a long time: http://scholarlykitchen.sspnet.org/2014/03/26/reproducible-research-a-cautionary-tale/).

Suresh Krishna

And the part about the “The asymmetry between positive and negative evidence (response to rejoinder #3″ confuses logical reasoning (in a noise-free situation) with statistical inference (in the presence of noise). Asking critics to provide methodological or conceptual explanations for false-positives (and not just fail to replicate the result) ignores the possibility that the original result could emerge by chance due to noisy data. It is also curious that Mitchell says critics can never know enough about the experimental procedure to replicate it, but they should still be asked to find the flaw in the procedure if they think the result is erroneous.

Discover Blogs

Neuroskeptic

No brain. No gain.

About Neuroskeptic

Neuroskeptic is a British neuroscientist who takes a skeptical look at his own field, and beyond. His blog offers a look at the latest developments in neuroscience, psychiatry and psychology through a critical lens.