John Bargh, a psychologist at Yale University, has published a scathing attack on a paper that failed to replicate one of his most famous studies. His post, written on his own blog on Psychology Today, is a mixture of critiques of the science within the paper, and personal attacks against the researchers, PLOS ONE, the journal that published it, and me, who covered it. I’m going to take a closer look at Bargh’s many objections.

John Bargh and his colleagues found that infusing people’s minds with the concept of age could slow their movements (PDF). The volunteers in the study had to create a sentence from scrambled words pick the odd word from a group of scrambled ones. When this word related to being old, the volunteers walked more slowly when they left the laboratory. They apparently didn’t notice anything untoward about the words, but their behaviour changed nonetheless.

Surprisingly, this prominent result has seldom been replicated. There have been twoattempts but neither stuck closely to the original experiment. This prompted Stephane Doyen and colleagues to try and repeat Bargh’s study. They tried to match the original set-up, but they made some tweaks: they timed volunteers with infrared sensors rather than a stopwatch; they doubled the number of volunteers; and they recruited four experimenters who carried out the study, but didn’t know what the point of it was. As I wrote:

This time, the priming words had no impact on the volunteers’ walking speed. They left the test room neither more slowly nor more quickly than when they arrived. Doyen suspected that Bargh’s research team could have unwittingly told their volunteers how they were meant to behave… Perhaps they themselves moved more slowly if they expected the volunteer to do so. Maybe they spoke more languidly, or shook hands more leisurely… Maybe they were responsible for creating the very behaviour they expected to see.

To test that idea, Doyen repeated his experiment with 50 fresh volunteers and 10 fresh experimenters. The experimenters always stuck to the same script, but they knew whether each volunteer had been primed or not. Doyen told half of them that people would walk more slowly thanks to the power of priming, but he told the other half to expect faster walks.

…He found that the volunteers moved more slowly only when they were tested by experimenters who expected them to move slowly… Let that sink in: the only way Doyen could repeat Bargh’s results was to deliberately tell the experimenters to expect those results.

Was this possible? In Bargh’s study, an experimenter had packed envelopes with one of two different word tasks (either elderly-related or neutral words). When each volunteer arrived, the experimenter chose an envelope at random, led the volunteer into a test room, briefed them, and then left them to finish the task.

Doyen thinks that, during this time, the experimenter could have seen which set of tests the volunteer received, and tuned their behaviour accordingly. This was not a deliberate act of manipulation, but could easily have been an unconscious one. He wrote, “This possibility was in fact confirmed informally in our own study, as we found that it was very easy, even unintentionally, to discover the condition in which a particular participant takes part by giving a simple glimpse to the priming material.”

In his new post, Bargh dismisses Doyen’s experiments on two technical points, and other personal ones. Let’s consider each in turn.

Bargh’s objections – blinding

First, he says that “there is no possible way” that the experimenter in his study could have primed the volunteers with his own expectations. He says that the experimenter “was blind to the study hypotheses” (meaning that he didn’t know what the point of the experiment was). Bargh adds, “The person who had actual contact with the participants in the elderly priming study never saw the priming manipulation… and certainly did not know whether the participant was in the elderly priming or the control condition.”

Could the experimenter have known what the experiment was about, even though Bargh asserts that they were blind? In the comments section of Bargh’s post, another psychologist, Matt Craddock, notes that the experimenter was also responsible for pre-packaging the various tasks in their packets, and so had ample time to study the materials. [This is the first of several inconsistencies in Bargh’s interpretation of his own study – more on that later.)

Could the experimenter have primed the volunteers? It’s not clear. This hinges on what actually happened in the test room, and we only have Bargh’s word on this. There is precious little in the way of description in the actual paper (here it is as a PDF; let me know if I’ve missed something). As such, the paper does not seem to be at odds with Doyen’s vision of what happened, although it does not provide evidence for it either.

Bargh’s objections – differences between the two studies

Bargh’s second objection (in many parts) is that Doyen’s study had differences from his own, which would have eliminated the elderly-priming effect. However, in all of these cases, Craddock and other commenters have pointed out inaccuracies in his statements.

For example, he says that after the test, Doyen instructed his volunteers to “go straight down the hall when leaving” (his quotes), while he “let the participant leave in the most natural way”. This is important because drawing someone’s attention to an automatic process tends to eliminate that effect. But Doyen did nothing of the sort, and his paper never contains the words that Bargh quoted. Instead, Doyen wrote, “Participants were clearly directed to the end of the corridor”. It is not clear how this differs from Bargh’s own study where “the experimenter told the participant that the elevator was down the hall”.

Bargh also says that Doyen used too many age-related words in his word task. The volunteers might have noticed, cancelling out the effect of the priming. But this contradicts what Bargh says in his own methods paper, where he says that if there are too many primes, volunteers would be more likely to perform as expected. By that reasoning, Doyen’s volunteers should have showed an even stronger effect.

Bargh says that priming depends on there being something to prime. Volunteers would only walk more slowly if they associated old age with infirmity. He says, “Doyen et al. apparently did not check to make sure their participants possessed the same stereotype of the elderly as our participants did.” However, neither did Bargh. His original study says nothing about assessing stereotypes. [Update: actually, I note that Doyen et al chose their priming words by using the most common answers in an online student survey where people reported adjectives related to old age; that’s at least a tangential way of assessing stereotypes.]

“To adapt the items, we conducted an online survey (80 participants) in which participants had to report 10 adjectives related to the concept of old age. Only the most frequent responses were used as replacement words.” (i.e., as primes)

Bargh says that Doyen used the same experimenter who administered the test to time how slowly the volunteers walked down the hall. This is also false – they used infrared sensors.

What do Doyen’s team have to say about Bargh’s criticisms? They support Craddock’s analysis. And one of the authors, Axel Cleeremans says:

“The fact is that we failed to replicate this experiment, despite having twice as many participants and using objective timing methods. Regardless of the arguments one may come up with that explain why his study worked and ours did not, this suggests that unconscious behavioural priming is not as strong as it is cast to be. If the effect were truly robust, it shouldn’t depend on minute differences. The fact that we did manage to replicate the original results when both experimenters and participants were appropriately primed suggests interesting avenues for further research and should be taken as an opportunity to better delineate the conditions under which the effect is observed.”

Bargh’s objections – er, the other stuff

As stated before, Bargh also directs personal attacks at the authors of the paper (“incompetent or ill-informed”), at PLoS (“does not receive the usual high scientific journal standards of peer-review scrutiny”), and at me (“superficial online science journalism”). The entire post is entitled “Nothing in their heads”.

Yes, well.

I’ve dealt with the scientific aspects of the critique; I think we’re all a bit too old to respond to playground tactics with further puerility. The authors certainly aren’t rising to it. In an email to me, Doyen wrote, “This entire discussion should be about the reasons that best explain the differences between his findings and ours, but has somehow turned into something else that unhelpfully confuses personal attacks with scientific disagreement as well as scientific integrity with publishing politics.” And PLoS publisher Peter Binfield has already corrected Bargh’s “several factual errors” about their journals.

For my part, I’m always happy to correct myself when I’ve screwed up in my reporting. Here, I believe I did my due diligence. Contrary to accusations at the time, I read both the Bargh and Doyen papers. I contacted other psychologists for their view, and none of them spotted egregious technical flaws. More importantly, I sent the paper to Bargh five days before the embargo lifted and asked for a comment. He said, “There are many reasons for a study not to work, and as I had no control over your [sic] attempt, there’s not much I can say.” The two-page piece he has now posted would seem to falsify that statement.

After some reflection, I largely stand by what I wrote. I can’t see much in the original study or in Bargh’s critique that would have caused me to decide not to cover it, or to radically change my approach. There is one thing, though. Someone (on Twitter; sorry, I can’t find the link) noted that a single failure to replicate doesn’t invalidate the original finding, and this is certainly true. That’s something I could have made more explicit in the original post, maybe somewhere in the fourth paragraph. Mea culpa.

There is a wider issue here. A lack of replication is a large problem in psychology (and arguably in science, full stop). Without it, science has lost a limb. Results need to be checked, and they gain strength through repetition. On the other hand, if someone cannot repeat another person’s experiments, that raises some serious question marks.

Scientists get criticised for not carrying out enough replications – there is little glory, after all, in merely duplicating old ground rather than forging new ones. Science journals get criticised for not publishing these attempts. Science journalists get criticised for not covering them. This is partly why I covered Doyen’s study in the first place.

In light of this “file drawer problem”, you might have thought that replication attempts would be welcome. Instead, we get an aggressive and frequently ill-founded attack at everyone involved in such an attempt. Daniel Simons, another noted psychologist, says, “[Bargh’s] post is a case study of what NOT to do when someone fails to replicate one of your findings.”

Others have suggested that the Bargh study has many positive replications, but this is inquestion. In his post, Bargh speaks of “dozens if not hundreds of other conceptual replications”. He says that the “stereotype priming of behavior effect has been widely replicated”, and cites the well-established “stereotype threat” effect (which I have also written about). He implores responsible scientists and science journalists to not “rush to judgment and make claims that the entire phenomenon in question is illusory”.

I’m not sure which scientists or science journalists he is referring to. Neither Doyen nor I implicated that the entire concept of priming was illusory. I specifically said the opposite, and quoted two other psychologists who did the same. The issue at stake is whether Bargh’s results from that one specific experiment could be replicated. They could not.

If there’s an element to this farrago that heartens me, it’s that the comments in Bargh’s piece allowed various parties to set the record straight. In concluding his piece, Bargh says, “I’m worried about your ability to trust supposedly reputable online media sources for accurate information on psychological science.” Well, dear professor, this is the era of post-publication peer review. I’m not that worried.

Honestly I feel the very thing Bargh criticized when he attacked you and your lack of credentials is a boon to science. Finally there is someone outside of the old boys network who polices the results of the peer review process with all its biases and problems. Qui custodiet ipsos custodes? Ed Young on a good day. Rock on!

It’s quite sad that a distinguished scientist resorts to personal attacks to cover potentials flaws in his work…
I think you’ve hit on a serious issue with the lack of replication in science. Researchers at all levels, especially PhDs and post-docs (i.e. when most of the work is actually done!) are always encouraged to come up with novel ideas, novel methods, novel results… This is good to some extent, but I agree with you that we’re going way too far in that direction. Especially for empirical, “not exactly rocket science” fields indeed, dealing with complex systems like psychology, ecology, ethology etc. Sadly, nowadays it’s really difficult to get fundings and publications to repeat “stuff that’s already been done”. As a result, we get gazillions of novel but on their own useless findings

Bargh’s responses seem to be factually incorrect, based on faulty assumptions, misunderstandings, and slanted by preconceived biases. That would be a red flag as it doesn’t speak well for his ability to put aside these critical thinking shortcomings when it comes to setting up, running and interpreting results from an experiment.

As you noted though, a failure to replicate an experiment doesn’t invalidate it. Bargh will probably take the critiques and do a new experiment. He can then validate his first experiment, or be among the first to invalidate it. You don’t lose face or respect when you change your mind due to evidence, but you certainly lose the reputation as a good researcher if you try and defend your experiment against mounting evidence against it. I’m sure we’ll be seeing new evidence from Bargh this year.

As a very general aside, use of stop-watches by biology/ecology researchers to time various events (e.g. amount of time critters like fish or planaria spend in different treated locations) is considered a bit suspect due to the recognized biases/errors that affect such measurements. Instead, actually filming is used and/or sensors are used instead (often or even most of the time anyway–this is even encouraged in a second year ecology class at the local university during their animal behaviour labs).

Researchers at all levels, especially PhDs and post-docs (i.e. when most of the work is actually done!) are always encouraged to come up with novel ideas, novel methods, novel results… This is good to some extent, but I agree with you that we’re going way too far in that direction.

Agreed. Nature in the Feb 16th issue had an article on how to apply for grants. In it they said don’t follow up on your supervisor’s research. Go your own direction to show you can think independently.

Little replication is done in many areas. I keep running across old ecology studies that are “gospel”, but finding surprisingly little follow-up work or replication done of them. I sometimes wonder if our ecology knowledge base is sound enough to continue to build theories upon it. My thesis was based upon answering questions raised by a project that had been done once and not replicated (technically it was based upon over 30 years of data so a pattern had emerged so it’s not really as iffy as it sounds–still, I wonder if similar patterns would emerge if it was done all over again or in a different region).

One very prominent ecologist who used to teach at University of British Columbia (among other places) has said the field of ecology is decades behind the times and hasn’t advanced very much in the last 50 years…he’s been working in it for over 50 years himself. I know he’d agree there’s not enough replication done in ecology especially in regards to field research.

Mr Yong, reading your blog is always a pleasure – in case of this particular post it’s because of your perfect attitude. Everything wise I could say about the situation you said already. Thank You. (I’m a psychologist, my M.A. thesis was based partly on Bargh’s work, but fortunately not on the experiment in question)

The professor’s comment “there are many reasons for a study not to work,” seems very strange to me. What does it mean for a study “not to work?” If a study is carefully constructed and run, and it has an unexpected result, does that mean the researcher says “drat, it didn’t work,” throws out the results and tries again? Is that how science is done? Is that how you find out the truth? I don’t think so. But then, I’m not a scientist.

Jen Deland: The phrase “there are many reasons for a study not to work” is meant to mean exactly that. It could be slightly different (but consequential )procedures, a different population (that could differ in an unknown number of ways), random chance, the effect isn’t real, etc. When ever you find no results (or null results) it is difficult to tell exactly why that is the case. Sometimes that could mean trying again with different procedures or a different population, or sometimes it means moving on to the next idea. It depends on why you think the study didn’t work and if you have plausible ideas to fix it going forward. This isn’t necessarily bad, its just a problem when it is all hidden from view so that other researchers do not know the intricacies of the back story of the finding.

When I read John Bargh’s attack (and it was just that) on Ed Yong and his medium I thought, “Boy have you poked the wrong bear!” This even-handed well-constructed response to what was nothing less than an unfair and unfounded academic flame proves Yong is beyond doubt is the better person and in all likelihood the better scientist. Ed, you da man!

Jen, there are unexpected but legitimate results, but more commonly anomalous results are due to methodological problems. Obviously one cannot assume that whenever you get results that are unexpected it is because something went wrong with the experiment, but usually that is what happened. The point of experimentation is to make causal inferences at some level which requires some degree of control over “demonic intrusion” (yes that’s a real term). The more established a particular effect is (that is based on a lot of prior work) the more likely it is that a contradictory result is a screw-up, rather than something that gives important information about the phenomenon of interest.

Has no-one come in yet with a rant about “soft-science”? Well, allow me, then.

The whole imbroglio just proves that many of the criticisms of soft-science are completely justified. The original study was clearly tendentious, the whole protocol only makes sense as an attempt to confirm the experimenter’s belief. Sub-conscious influencing of the result is then almost pre-programmed.

To generalise the problems of a study of this kind to all of science seems like a desperate attempt to seek refuge amongst the crowd of legitimate scientists.

Lack of replication is very frequent in social psychology, not in psychology. Why? One reason is that the methods are always underspecified. Another reason is that, well, social psychologists are good at social networking, and that is how their papers get published in high-profile journals.

The kind who naturally understand people well. – All work for big commerce.
The kind who naturally understand people poorly. – All work in ‘research’.
The 8 kinds who specialise in emotional currency speculation. – 1. Research, 2. Market, 3. ????, 4. Profit.

Ed’s too classy to say it, but I’m not: “superficial online science journalism” is an odd thing for a Psychology Today blogger to say about a Discover Magazine blogger.

You know, especially when the Discover Magazine blogger is keeping up his publication’s fine tradition of responsible journalism that presents the issues at hand in layman’s terms without talking down to his readers, while the Psychology Today blogger is, if anything, keeping up his publication’s fine tradition of bad behavior.

I suppose I might be the only detractor here, but I think everyone involved is being petty. Bargh made some unfortunate factual errors in describing PLoS One’s publication model, however as for his scathing review, that at least seemed to be motivated. After giving you the brush off over email, you proceeded to write a post where you liken Bargh’s 1996 study to the clever hans story. I would view that as incredibly insulting myself. There are many reasons Bargh might not have replied to your request for a comment in more detail, not the least of which people are busy and it may have been unclear if there was much more to say about it at the time, given the paper had just come out. But after your post, yeah sure, I bet he read it more closely and decided he should weigh in. Frankly, I don’t think the tone of his post was all that surprising given the insinuation in your post that he had fallen prey to Clever hans.

Although I read a lot of blog posts and generally think post-pub commentary is a good thing, this is a debate that would have been better in the journals as people would be more careful to fact check (i.e., Bargh and PLoS One) and spend more time crafting better reasoned arguments with a careful reading of the literature (citing and discussing the replications, direct and conceptual, in more detail). And an editor could have served as moderator to keep the vitriol to a minimum. We have journals that publish this kind of thing (Perspectives on Psych Sci), people should us them! Although, admittedly, APS journals are still not open access, which remains unfortunate given APS’s status as a non-profit dedicated to promoting psychological science… But that’s another fight.

Another example of shoddy journalism. To back up your claim that Bargh’s result has not been replicated (and it has- over and over) you cite an anonymous blog comment? If people were not convinced by the first article that you don’t know what you are talking about (though that’s quite obvious), they should be now.

Bargh didn’t react this way because someone failed to replicate his study. There are hundreds of reasons why studies don’t replicate, and the majority of them have to do with bad methods by the people who are attempting to replicate (in this case, Doyen et al.). Bargh reacted so negatively because one science “journalist” took one simple failure to replicate and used that to push an agenda forward and make it seem like the original study was invalid. One failure to replicate (and this wasn’t even a complete failure to replicate) does not disprove the original study, and this is NOT something you should have just mentioned “in the fourth paragraph”.

To back up your claim that Bargh’s result has not been replicated (and it has- over and over) you cite an anonymous blog comment?

And to back up your claim that it has “over and over”, you cite… nothing. And you dismiss an anonymous blog comment (actually, it’s pseudonymous) when you yourself are a Yale University psychology student using a pseudonym.

And once again, we get personal attacks in lieu of actual critique. Sigh.

Just to add support for PLoS One, and how their approach is really one that serves to uphold the core values of science. From their website: “Unlike many journals which attempt to use the peer review process to determine whether or not an article reaches the level of ‘importance’ required by a given journal, PLoS ONE uses peer review to determine whether a paper is technically sound and worthy of inclusion in the published scientific record.”

While many journals and institutions place greater (over) emphases on novelty, replication is the true gold standard of science. PLoS One is an amazing peer-reviewed option for the necessary dialogue required to test the validity of scientific theories.

I was talking about conceptual replication- and it has been. Any quick search of psych info will show you that (and no, I’m not going to do your job for you and give you those cites- clearly, you should have done that in the first place). In fact- the results are conceptually replicated within the same paper. So to compare Bargh’s findings with the clever hans story is insulting and misleading, considering that studies 1 and 3 of that paper could not be explained by Doyen et al.’s findings (the paper is explicit about the experimenter being blind to condition in these cases).

It is clearly you who is making this personal. I notice you didn’t take the time to dig into the backgrounds of any of your supportive commentators and let people know their backstory- just mine. What does that say about your ability to be unbiased as a journalist?

Furthermore, I wasn’t objecting to the fact you cited an anonymous blog comment, I was objecting to the fact you cited a blog comment at all to support a scientific claim about the lack of replications.

Richard Feynman’s Cal Tech commencement address “Cargo Cult Science” is on point, especially his description of an experiment with rats in mazes. You can read it here (and no doubt many other places: http://www.lhup.edu/~DSIMANEK/cargocul.htm.

What is a “conceptual” replication? It is difficult enough to replicate an experiment, but when it succeeds it is also sufficient.* As people note here, all Bargh has to do is to replicate successfully to cast doubt on Doyen et al.

While I wait for a testable definition, I amuse myself with imagining what it can be. Is showing infectiousness of Y. pestis a “conceptual replication” of infectiousness of other bacteria? Or maybe even of viruses? Maybe it is a “conceptual replication” of prey-predator relationships observations!

But I note that your anonymous blog comment mentions that researchers have used the term before (Stapel) and it is put in contrast with [direct] replication.

———
* For the purpose of being a replication and its consequences. As Yong notes, you would need a replication of a negative replication to make further progress.

As a social psychologist I thought it was worth taking a minute to clarify the difference between a direct replication and a conceptual replication. The value of conceptual replication is so deeply ingrained among social psychologists that we do not recognize that it is not as intuitive an approach to most everyone else, particularly those in the “hard sciences”.

In reading Ed’s response, as well as in some of the comments above, there seems to be a misunderstanding about whether Bargh’s research has been replicated “seldom” or “over and over”. The answer is that both are true. There have seldom been “direct” replications of Bargh’s experiments; that is, not many people have tried to use Bargh’s exact methods to see if the same results would obtain. This is the model of replication that is most common in science and for good reason. However, what has been done “over and over” since Bargh’s paper came out in 1996 is conceptual replication. Conceptual replication tests the basic concept underlying the original research (in Bargh’s case, roughly, that linguistically priming a concept, such as elderly people, can lead to an increase in behavior associated with that concept) using a different set of independent or dependent variables, or both. For example, you could conceptually replicate Bargh’s study by measuring how early people want to eat dinner after being primed with the elderly stereotype, or even how many bananas people eat after being primed with monkeys. The underlying hypothesis being tested would remain the same: priming a concept linguistically leads to increased stereotype-related behavior.

This distinction between direct and conceptual replication helps to explain why a psychologist isn’t particularly concerned whether Bargh’s finding replicates or not. The question of whether priming the elderly stereotype leads people to walk more slowly or not, or by how much, is of secondary importance. What matters is the underlying concept – and that has been replicated many times. For example, in 1998, Ap Dijksterhuis and Ad van Knippenberg, found that priming participants with the stereotype of a professor made them score higher on an intelligence test, while priming them with the soccer hooligan stereotype made them perform worse. The finding, conceptually, is very reliable.

This explanation, however, raises a second question: if we don’t really care about the specific stereotype about the elderly or how fast it makes people walk, why did Bargh choose to measure it in the first place. And the answer to that is: it’s interesting; it’s surprising; it makes people take notice of the study and think, “wow, I wouldn’t have expected that!” Some scientists, psychologists among them, may frown on this sort of approach – what we’re doing here is science not running a carnival sideshow. But there’s a long tradition in social psychology of experiments as parables, and we owe a lot to this approach both as a science and as a culture.

Some of the first studies you learn about in your introductory college course are precisely this sort of research. Stanley Milgram’s famous obedience experiments were far from rigorously designed; they didn’t even have a control condition! But we certainly learned a lot from them, and they led to reams of follow-up research. Another favorite example of mine is the research on bystander (non-) intervention. These studies illustrate the fact that we are more likely to intervene in an emergency when we are alone than when there are others present. What happens is that when there are others around there is a “diffusion of responsibility” and we assume that someone else will take care of the problem. We also interpret other people’s lack of response as an indication that perhaps there’s nothing wrong. And while I don’t mean to imply that these studies weren’t rigorous, they certainly were not done in a manner that’s easy to dutifully replicate. They brought Columbia undergrads into lab rooms, alone or with another person (actually a confederate of the experimenter) and then let smoke into the room from a vent, and measured whether, and how quickly participants inquired to see if something was wrong.

Clearly, though, it’s the underlying concept we are interested in: does the presence of others make us less likely to act in an emergency (we don’t care as much about the specific question of smoke in particular). It turns out, it does matter. So much so that when I recently attended an infant CPR class before the birth of my son, one of the take home messages was that in case of an emergency, do not simply yell, “someone call 911.” Instead, point to someone in particular and say, “You. Call 911.” I’m not sure if anyone ever directly replicated the original smoke in the room experiments, but the conceptual lessons we’ve drawn from them are tremendously important and have made their way into the mainstream understanding of psychology.

(For anyone interested, the Dijksterhuis & van Knippenberg studies were published in the Journal of Personality and Social Psychology in 1998 (Vol. 74, No. 4, pp. 865-877). The smoke in the room studies were done by Bibb Latane and John Darley, and they were published in the same Journal in 1968 (Vol. 10, No. 3, pp. 215-221).

I’m a card-carrying social psychologist who deals in the kinds of priming methods Bargh has used in his studies. I’d love to write a really long response, but let’s just say: priming methods like these fail to replicate all the time (frequently in my own studies), and the news that one of Bargh’s studies failed to replicate is not surprising to me at all.

There are so many subtle influences that can, at times, unintentionally cause a priming effect to be strengthened or diminished. The Doyen group suggests experimenter foreknowledge is one of those unintentional influences, and that is certainly true, but I think Bargh rightly bristles at the notion that his original experiment is completely explained by experimenter foreknowledge. I think that his reaction is understandable especially since in his original studies the researchers certainly took steps to minimize that.

In the meantime, I will continue to teach these studies in my social-personality psychology course.

For the record, I’ve been talking to Dave via email and asked him to put forward his thoughts – I find them very valuable in framing this discussion.

While I understand the concept of conceptual replication, it worries me. Superficially, it looks like a good thing – it would seem to strengthen the existence of an effect by demonstrating it under multiple conditions. As Dave writes, the concept becomes more reliable.

But surely that only holds if each conceptual replication is independently strong. Is that the case? It seems to me that the positive conceptual replications would be especially prone to publication bias. They would be more likely to appear in the literature than negative conceptual ones, because we know that positive results are more likely to be published than negative ones. They would also be more likely to appear than any form of direct replication – positive or negative – because they add a fresh, exciting, newly counter-intuitive “parable” to the mix.

If this is a norm within the field, it seems to be one that makes it easier for weak results to bolster other weak results.

To clarify, I’m definitely not suggesting that this is what’s happened with priming, but I do get an uneasy house-of-cards vibe from this. Surely this has been discussed before?

@Dave Nussbaum
Hear hear! Thanks for clarifying the distinction between direct and conceptual replications. And kudos for citing examples.

My own 2 cents are that this whole situation is about something much larger than Bargh vs [whoever]. Attempted direct replications should definitely happen more often. My hope is that publication standards that PloS currently uphold will become more and more the norm, hopefully coupled with post-publication review.

Anyone here aware of any attempts at meta-analyses of the various conceptual replications of linguistic priming effects on behavior?

If I understand the distinction correctly (and I may not), conceptual replication is done all the time in “hard” sciences without a need for a special term for it. Instead of replicating exactly experiment X, you do an experiment Y that is a follow-up on X and relies on X being correct. If your experiment Y does not work, only then you start doubting X and perhaps try to replicate it exactly and completely. In most cases, the results of your experiment Y are consistent with the findings of the experiment X, so experiment Y did two things: confirmed that X is probably right, and added more data. In the good old times, before E.O.Wilson misused the word, this was called ‘consilience’.

Seems like the amount of quotations Bargh has received for his study has gone to his head a bit. I think it would be a bit too much to say that just because his study couldn’t be replicated, it’s invalid, but the fact that several attempts to reconstruct the outcome failed should at least make Bargh curious to find out why this happened, and maybe get involved himself if he feels so passionate about it.

@Michael Kraus – Thanks for the comment. When you say, “priming methods like these fail to replicate all the time (frequently in my own studies)”, can you clarify whether you mean (a) “and therefore it’s no big deal that this (any?) replication attempt failed”, or (b) “and that’s something we should be concerned about”? Judging from the last line, I’m guessing (a) but it’s ambiguous, and I want to check.

In contrast to some of the posts above, I contend that conceptual replication is a very poor way to validate and advance science, for the following reasons.

1. It is a subjective and assumption-bound label that presupposes commonality in cognitive processes

A conceptual replication only holds if the experimental measures in different studies probe the same underlying neurocognitive process. But how certain can we be that this is the case? The priming literature is vast and multifaceted, and you couldn’t pick any two examples at random and then conclude securely that they “conceptually replicate” each another. It is questionable whether this would even hold within the priming of automatic behaviours, as in Bargh’s study. It therefore becomes an opaque exercise in subjectivity as to whether two methods or conclusions are sufficiently close enough in flavour for the results to be considered “replicated”.

Imagine the following two studies. In the first, Smith asks her volunteers to identify a letter on a computer screen as quickly as possible. The letter can appear on either left or right side and the volunteers keep their eyes fixated on the centre throughout. Crucially, Smith presents a small flash (or cue) immediately before each letter, at random on the left or right. She finds that when the cue occurs on the same side as the letter, people are faster and more accurate at identifying the letter than when it occurs on the opposite side. Smith is happy because she’s just discovered that shifts of attention can be triggered reflexively without moving the eyes.

Then along comes Jones. Like Smith he asks his volunteers to identify a letter as rapidly as possible, but this time he manipulates the colours of the letter (say, blue or green). He also presents a cue, but the cue now occur in the middle of the screen as a blue or green colour patch. Jones finds that when the colour of the patch is the same as the upcoming letter, people are faster and more accurate at identifying the letter. He concludes that automatic priming of features, such as colour, can influence behaviour.

Does the Jones study replicate the Smith study? In a direct sense, of course not. They used very different methods. But then they did both measure how visually-guided behaviour can be influenced by a prior stimulus (the cue). So perhaps some would argue that, in this sense, Jones “conceptually replicates” the more general notion that perception can be primed automatically by prior information? (see what I mean above about subjectivity?)

Lets suppose that many in the research community decide that the similarity crosses whatever subjective threshold is necessary for it to count as a conceptual replication. But then there’s a twist: a few years later, along comes Brown. Brown compares the two kinds of preparatory processing (spatial vs. colour) and finds that they are produced by different psychological mechanisms. He also finds that they activate different networks in the brain, and that stimulating those parts of the brain with electric currents can interfere with the spatial cueing but not colour cueing.

Suddenly we have a problem.

If the spatial and colour effects discovered by Smith and Jones are produced by different mechanisms, how could one of them be thought to “conceptually” replicate the other? Do we admit that, in fact, Jones didn’t replicate Smith afterall? So a study can go from being replicated to not being replicated?

Or do we shift the goal posts: “Oh, well, it’s still a conceptual replication of ‘attention’, it’s just that attention can no longer be accurately characterized as a single concept – it’s now more of an umbrella term used to describe a lot of different processes supported by different mechanisms”.

On this basis, the label “conceptual” replication is arguably a black hole of meaninglessness. A misleading oversimplification which presupposes that two different observations obtained using different methods stem from the same cause.

2. It is prone to publication bias and confirmation bias

As Ed rightly notes above, conceptual replications are highly prone to publication bias. Looked at differently, I suspect that for any published finding in psychology, it would be possible to find a published study, somewhere, that could be considered a conceptual replication.

There is also a troubling confirmation bias than produces an obvious double-standard. Suppose two studies draw subjectively similar conclusions using different methods. The second study could then be said to conceptually replicate the first. But suppose the second study drew a very different conclusion. Would it be seen to conceptually *falsify* the first study? Of course not. Researchers would immediately point to the differences in methodology and would use these as a ‘get out of jail’ free card.

3. It substitutes and devalues direct replication

Direct replication is vital for science. I really hope nobody disagrees with this point. In his blog (http://social-brain.blogspot.com/2012/03/replication-issues-and-solution.html), Matt Lieberman makes an excellent suggestion for testing the replicability of key results: that a short list of the most important novel findings could be generated annually and would then form the basis of a coordinated series of student projects in the following year. This idea could be rolled out more generally across psychology and cognitive neuroscience, and need not be limited to social psychology.

However I take issue with Matt’s argument that conceptual replication trumps direct replication: “At a certain level it does not matter whether the exact primes Bargh used produce a change in walking speed over the exact distance he measured.”

These details do matter because the edifice of any body of work is built wholly and without exception on the foundations of specific experiments. In my mind, Matt’s argument is tantamount to eyeing up a brick wall and saying “That brick isn’t needed” before shaking it loose. The wall might keep standing but how many bricks would need to be removed before you would hesitate to stand next to it?

Direct replication of specific experiments is vital yet grossly undervalued. In psychology, it feels as though novelty has trumped reproducibility, that certainty has given way to the ‘wow’ factor. My colleagues and I recently had a paper rejected from the Journal of Cognitive Neuroscience because the first of two experiments replicated (using a very similar method) the results of a previous study. Our second experiment was more novel. And the finding which we replicated in the first experiment had itself had never been replicated before.

One of the reviewers said: “”The methods are strong, but the outcome produced results that aren’t particularly groundbreaking”

The editorial letter said:

“Our decision is based on both the narrative reviews, which appear further along in this letter, and priority scores assigned by the reviewers in terms of “Significance”, “Originality”, and “Experimental Design and Quality of Data”. The overall conclusion that these lead to is that this work, although methodologically sound, does not make a sufficiently novel contribution to make it into the journal.”

The message? Don’t bother replicating. We aren’t interested and we don’t ask reviewers to assess papers on these grounds. This encourages cheeseburger science at its worst, and is frankly a very bad message to be sending to graduate students.

Ed’s analogy to a house of card is apt. In my opinion we need to jettison the amorphous concept of “conceptual replication” and instead return to valuing the principle of direct replication that has upheld science for centuries. As reviewers of papers, we should be vocal in praising a manuscript if it directly replicates a previous result, rather than doing what the journals want and offering faint praise that “it isn’t very exciting”. Well guess what, we aren’t in this job to get “excited”. We’re supposed to be advancing knowledge.

We really do need to take a long hard look at this issue, otherwise psychological research risks becoming an enormous and trivial game of Jenga – and we know how that ends.

Dave Nussbaum – “Stanley Milgram’s famous obedience experiments were far from rigorously designed; they didn’t even have a control condition!”

I’m not sure what control condition would have been appropriate?

Milgram ran many variants on the classic experiment, using different procedures e.g. the experimenter wore a lab coat vs. casual clothes, the experiment took place in a crappy private office rather than a Yale lab, etc. Many of which had large effects on the % obedience rate.

As I recall, the variant with the lowest obedience was when the experimenter wasn’t actually in the room, he just told them to shock the victim via telephone. Almost no-one did.

Whether those are “controls” I’m not sure, AFAIK Milgram didn’t call them that. But I don’t know that Milgram’s was a study that needed controls. Not all studies do. If I invent a new cognitive task and I test it on 1,000 random people to see what the average performance is, I don’t need controls.

Milgram’s study was a lot like that, except the task wasn’t cognitive, it was social. It wasn’t an intervention.

@corturnix, I don’t disagree with you entirely, but most in the hard sciences will go ahead and replicate X themselves before starting Y, but just don’t bother to write it up. There’s often too much money, time and mental strength at stake to NOT know for sure X is valid before progressing to Y. Journals like PLoS ONE and Scientific Reports offer an excellent home for still getting a publication out of that initial “sanity check” of X, but I think most scientists like to move on quickly to Y, believing writing up the direct replication of X is a waste and takes precious time away from doing the science that could yield even greater rewards. It’s not just the state of current science culture, but a natural inclination as well. Things are slowly changing though…

While agreeing with Chris Chambers that “conceptual replication” becomes pretty subjective pretty quickly, relying on direct replication is also problematic.

A good example of this from my undergrad days is the word-length effect. Baddeley et al. (1975) found that people were better at remembering lists of two-syllable words if the words were shorter in spoken duration. Baddeley interpreted this as evidence that short-term memory has a time-based capacity – you can only remember as much stuff as you can say in a fixed amount of time.

There are umpteen direct replications of this study. But other studies that have looked for the same effect using different sets of words have pretty much all failed to replicate the key finding.

Because these studies had also varied in other ways from Baddeley’s original study, Neath et al. (2003) ran four experiments in which the conditions were exactly the same apart from the stimulus set. Like everyone before them, they replicated the word-length effect using the original Baddeley et al. stimuli. But they also replicated the null findings (and even reverse word-length effects) of other studies using other stimulus sets. In other words, they showed that the precise set of words chosen was critical. Baddeley et al.’s findings are extremely robust, but if his model was correct, it should have generalised to other stimuli. The fact that it didn’t suggests that the model is wrong.

Chris is right. Direct replication is essential and completely undervalued. But even if direct replications do “succeed” (as was the case with the word length effect), it still needs to be shown that the effect generalises. If only faithful replication attempts succeed, then chances are there’s something uncontrolled in the original design and it doesn’t mean what we think it means.

I think Chris Chambers pretty much nails it above. If you “conceptually replicate” a study but come out with contrary results, then the argument will always be it’s because you didn’t directly replicate.

Incidentally, I’ve had a similar experience to his with the Journal of Cognitive Neuroscience with a different journal – had an article rejected from JEP:HPP in part because one of the conditions was a replication of a condition in a previous paper of mine, and thus the paper as a whole wasn’t novel enough.

Jon Brock also makes an excellent point regarding the generalizability of the results. However, I would still tend to think of a study which used different stimuli but otherwise the same paradigm as a direct replication – after all, you’re using different participants! I’d raise here a rather dry technical and statistical point, which I may fudge so be prepared to google. The point here is that we tend to forget that our stimuli are often random samples as well. Typically we do statistical tests (well, ANOVAs at least) treating participants as a random factor and our experimental manipulations as fixed factors. Effectively, this means we treat our stimuli as being the whole range of possible stimuli rather than a random sample, and thus our stats are only telling us how generalizable our effect is across participants with *these stimuli*. One way people get round this is to report separate by-participants and by-items analyses, with the logic being that if you get the same effects for both, it suggests the effect is not specific to the stimuli you used. It’s not perfect, but it’s what I’ve done in the past.

I wonder if part of the problem is that in much of psychology, and in social psychology in particular, we are often restricted in how much control we have in laboratory situations. Leonid Kruglyak ribbed conceptual replication on twitter with “We know chemicals kill cancer cells; who cares if we can reproduce this drug killing this cell line.” Unlike in microbiology, we can’t remove the cell from the organism and study it in a controlled situation. We try to isolate thoughts and behaviors, only introducing a single manipulation (chemical) but we are often still flummoxed by the complexity of the organism that produces the behavior.
I would agree that direct replication is undervalued, but also that the difference between direct replication and conceptual replication is often hazy. You can’t use the same people, the same lab, and the same experimenters, so what you have to do is replicate the methods section as best you can, choosing those details which (you think) are relevant to keep the same, and which aren’t. This requires using the “concepts” which are understood thus far.

As far as the house of cards metaphor, I think social psych (and to some degree psych in general) is more vulnerable to this criticism in part because of the publishing conventions mentioned above, but also in part due to the nature of the phenomena we are studying. This reminds me of this description of the science of scurvy. First, James Lind finds that citrus fruits cure scurvy. But (to skip to the end) the relevant concept here is not “citrus” or “lime” but vitamin C. What helped us get there? We went forward and back, “undiscovering” the causes of scurvy because we didn’t know the nature of how it decayed in fresh fruits and meats when it was stored, or how it was different in lemons and limes. In social psych, I am not sure how easy it is to get to the “vitamin C” of unconscious social behavior. We are doomed to keep testing lemons, limes, etc etc. Should we do more direct replication, making sure that this one kind of lime really does in fact cure scurvy in a certain population under certain conditions? Absolutely. But we need to search for the bounds of the concepts (with lemons, limes, oranges, sailors, college students, etc) using the unfortunately labeled “conceptual replication” which I just see as just another word for converging evidence. But ultimately, the strength of social psych (or I would maintain most psych) is not a brick wall built on a solid foundation, but a complex web of interconnected concepts and studies. Do people really have different personalities? I could be wrong, but I daresay there is no single foundational study of the Big Five that should be replicated every year. But there is nonetheless a boatload of evidence, both through factor analysis, as well as other correlations, as well as lab manipulations, that some people are more “conscientious” than others. It might look like a house of cards from the outside, but upon closer inspection, I think there are a few bricks there, a few pieces of wood, and the occasional steel girder.

This is a really interesting conversation, I wish it didn’t have to start out this way. I think Bargh could be a much more effective champion of social psychology if he participated in this, rather than being impatient and defensive as he has.

I am a professional research psychologist at a major university and I have heard from more than one lab that couldn’t replicate the original study. These labs didn’t publish the results because the costs of publishing a failure to replicate are high (read: pushback from the authors of the original paper) and more importantly the rewards are low: most of the top-tier for-profit journals are not interested in null results, even when they are failures to replicate.

Alex – a worthwhile message to push from your pulpit: granting agencies should reward researchers who publish null results and researchers who submit to journals that do not discriminate against null results (such as PLoS ONE, where this study was published).

@neuroskeptic: Interestingly, in the Milgram studies, there actually was a control group, what was missing was the comparison group: it was supposed to be Germans. Milgram initially hypothesized (as did everyone else) that he would get very little compliance from American participants, who were supposed to stand up to authority and refuse to administer shocks. The Germans were supposed to be the ones willing to obey authority. Of course this never happened because rates of obedience were so high in the US sample. This isn’t a criticism of Milgram, it’s only to say that his initial studies (yes, there were follow-ups) don’t compare his findings to any objective standard other than our expectations.

And since we’re talking about conceptual replications, another interesting note is that Milgram was, in essence, conducting a (critical) conceptual replication himself. Many of you may be familiar with Solomon Asch’s studies of compliance with unanimous group pressure. (If you don’t they’re fantastic too, here are links to the Asch studies http://www.youtube.com/watch?v=TYIh4MkcfJA and the Milgram studies http://www.youtube.com/watch?v=W147ybOdgpE). Milgram thought that it was unsurprising that people would go along with something so meaningless as the estimation of a line’s length, but that if administering shocks to another human being were the dependent variable, you would see much less compliance. Asch himself was conducting a critical conceptual replication of Muzafer Sherif’s autokinetic effect studies, but now we may be getting off on too much of a tangent. However, it is worth noting, in response to some worthwhile critiques of conceptual replication by @Ed Yong and @Chris Chambers, that conceptual replication, in addition to building external validity, drives discovery as well.

@Ed Yong: I think you make a good point about conceptual replication’s dangers, particularly when the balance tips too far toward conceptual replication alone. Certainly I agree that conceptual replications should be strong – at least as strong as a non-conceptual replication would have to be to merit publication. Still, as you note, psychology has many publication biases that hopefully get corrected in the long run, but are certainly problematic. As Matt Lieberman addresses in his post, we don’t value these nearly enough. What often happens, however, is that direct replications often occur as the foundations of other studies. As @coturnix explains, Study Y builds on Study X and therefore must replicate it in the process (these can be direct or conceptual replications, but are often direct). Of course, as @Michael Kraus tells us, sometimes we fail to replicate, and we don’t necessarily publish or even informally disseminate these results, so in some sense we’re back to square one. On the other hand, if an effect consistently fails to replicate then that also becomes apparent over time.

@Chris Chambers: you make some very interesting points, but I don’t necessarily see things the same way. In psychology we operationalize variables all the time. While there may be an objective quantity such as 1 mmol of HCl, or 48 degrees Kelvin, there’s no such thing as 2.3 Milgrams of conformity. Therefore we, as researchers, need to operationalize variables and make a convincing case that these operationalizations are good ones. And I think that @John Brock makes a fantastic case that relying solely on direct replication, rather than conceptual replication, does a disservice because we focus too narrowly on the original operationalization of a variable. That is, if all we ever did was re-run Bargh’s walking to the elevator study, we would be missing a huge part of the picture.

Having said that, I wholeheartedly agree that there aren’t enough direct replications, and that the field’s incentive structure is designed in a way that doesn’t reward them nearly enough. This is clearly a problem. I appreciate Matt Lieberman’s suggestion to have first and second year graduate students replicate selected studies, but that, too, is not an uncomplicated undertaking. For one thing, a successful replication requires competent execution of the experiment. I was a second year graduate student once too, and I don’t know what inferences would have been appropriate to draw from studies I ran back then that failed to replicate previous findings. I know that Brian Nosek is currently trying to replicate a large body of work with the help of “crowdsourcing” of this sort, and while I applaud the effort, I think it will have exactly these weaknesses. We will have not-fully-mature (to be polite) researchers failing to replicate findings, and this may muddy the waters as much as it clarifies them.

Certainly I am very receptive towards these endeavors, and I also eagerly anticipate the continued evolution of journals such as PLOS. I think psychology is still findings its way in some respects, and is also slowly making a transition out of the habits it has developed over the past few decades (such as the publication process and what is valued when professors are seeking tenure and promotions), and I hope discussions such as this one help steer the field in a positive direction.

“Eric R.” left a comment on the original post, that I thought was worth duplicating here, just to make sure everything’s in the same place:

“Chris wrote: “I have heard from more than one lab that couldn’t replicate the original study.” This points to another reason why it is important that non-replications get peer-reviewed and published. Of course is it essential that researchers get a balanced view on the robustness and scope of an effect. But another thing is that as long as non-replications do not get published, researchers cannot judge the quality of those studies, nor can they try to systematically test and rule out alternative explanations based on those studies. Non-replications that do not get published are doomed to an existence as persistent rumours, and rumours are not science.”

Am I the only one who thinks that Ed’s post should have been titled “Barghing up the wrong tree”? Sorry, had to write this to get it out of my head. Your well behaved tone was of course much better, Ed.

I guess when people talk about a direct replication, what they usually mean is keeping the stimuli constant (because that’s easy) but testing new participants (because you usually have to). Really, though, we should be looking for generalization across stimuli as well as participants. If you don’t get both then the original finding is not something to go building a theory on.

1. You clearly have sour grapes about the fact that others have failed to replicate your findings.

2. By claiming that the reason for Doyen’s failure to replicate your results is that Doyen et al’s study had differences from your own, which would have eliminated the elderly-priming effect, then you are also claming that your own study lacked replicability and external reliability (if Doyne et al were unable to replicate your study and get the same results, then there were clearly some details missing in your report which would have enabled Doyen to replicate your study exactly and get the same results).

I’m wondering about the “concepts” and “stereotypes” that the researchers are aiming to evoke, and using to explain their results. There seems to be an assumption that the researchers and the experimental subjects share the same concepts and stereotypes. That they belong to exactly the same culture. But can this be assumed? How do you know that the subjects stereotype the old as slow? What DO they think about the old? Would subjects from different cultures/sub-cultures/social groups react differently? Fail to react?

At the largest conference in the HCI field of research (which began with many psych people in the 80s), we are in the process of setting up a sub-conference venue to publish and discuss replication, our ability to, and the differences between studies: repliCHI (the conference is CHI). It’s our attempt to do something like FilePsychDrawer, but to allow the studies to be published as extended abstracts. We’re rather hoping it presents a forum to effectively discuss things, rather than write angry blogs. Interesting to see we are not the only ones doing this. We’re hoping it will be an open publishable sub-venue at the conference that creates archived extended abstracts, may 2013. Click my name to have a look at the panel from last year, thats continuing this year too.

Why would a single failure to replicate — assuming no error in how the study was conducted — not invalidate the initial result?

I thought the essence of science was falsification. If I throw an apple out the window and it does not fall — even once — it would seem to falsify most of what we think we know about gravity. If something is a fact of nature, it’s always true.

There are so many points raised in these various posts it is not possible for me to comment on all of them. I just wanted to state three simple things:

1) The original paper in question did a direct replication. Having found this remarkable result, their first instinct was to replicate the finding by running the exact same experiment again. I am not sure what more people would want from this research team? Should they replicate it every decade because other scientists have been unable to replicate it? They performed a direct replication and then moved on to conceptual replications, of which there are, as one person stated, many. Attacking that person for failing to cite these replication seems odd. It is just well known in the field that there are many conceptual replications. Who would think to cite them in a blog comment and in a forum such as this? But you can read the review of the field in my social cognition textbook if you are interested. And many more examples have been published since the release of my book 7 years ago.

2) Bargh was not contradicting himself when he made reference to the number of items used to “prime” the stereotype. It is not an insult, I presume, to tell people who are not experts in a field that they lack expertise. Bargh appears to be being accused of being contradictory by someone who simply does not have the expertise to recognize he is not. He is saying two very different things: a) IF one is not aware of the priming materials, or is at least not aware of their potential impact at the time behavior is being performed, then more primes leads to a stronger effect. b) But there is a danger in using MORE primes, and that danger is that the IF mentioned above will be removed. When people become aware of the prime’s influence on them, which can happen when attention is drawn to them, then the effect disappears, and in fact an opposite effect can appear. I published a paper with Ian Skurnik in 1999 (in our field’s top journal) detailing the parameters by which priming effects of the sort Bargh describes will emerge, and when they will be reversed. And drawing attention to them, as presenting many items will, can eliminate the effect. It is a fact. Why call Bargh names or paint him as a hypocrite for correctly pointing to a reason these studies would fail to replicate? Again, these issues are reviewed in my Social Cognition textbook if you or anyone else wishes to review it. But Bargh seems perfectly justified to question whether we should expect the Doyen et al studies to produce the same finding given their methoodology. This is not a “subtle” effect that can so easily be thrown off by slight alterations to the procedure. This is Doyen et al using a procedure that does the exact thing that whole lines of research have been dedicated to establishing will prevent them from finding the effect for which they look– that overt awareness of primes leads to the effect Bargh was studying being consciosuly controlled.

3) Bargh also correctly noted that one important consideration is to check if the stereotype in question even exists among the research participants of this failed study. The entire point of the Bargh et al (1996) paper is that a stereotype is (unknowingly) primed and this impacts behavior. If there is no stereotype, or if it is weakly held, one cannot expect an effect. This is not a failure to replicate, but a failure to even test the idea in question. It is extremely important and central to the very enterprise, not a minor methodological issue.

The only response to this I saw was the reporter stating (and I paraphrase) “well, Bargh never checked to see if the stereotype was held, so why should anyone else”. Bargh should have (and may have and just not reported it), but in the end the data suggests he did not need to. I worked at NYU and published one study with Bargh, so you all may think I’m biased. But, we did lots of pretesting across many labs at NYU during that time, and there was a wealth of information regarding attitudes toward various issues and stereotypes toward various groups that was collected every semester. I feel confident there were strong stereotypes regarding the elderly in that community at that time, and likely have data of my own buried somewhere from 20 years ago to prove it. But as I said, Bargh would not have found his results if the stereotype was not there. We cannot say the same about this failed replication, and it seems important for them to test this before we call this a failure to replicate, as opposed to a failure to test anything at all.

Failure to replicate is an important and serious issue, and if it is happening repeatedly relating to this experiment then it is certainly worth discussing. But, my understanding was that the original reporting was about one lab’s failure to replicate, and because of this one failure the Bargh and colleagues work was called into question, perhaps even implying it should be dismissed. If others have since come forward with other failures I am not sure this justifies the sort of accusations that were initially directed at Bargh based on the failures from one lab. Bargh may have responded inappropriately by questioning the skill of the researchers, but he was provoked it seems by an accusation based on limited data (if I am wrong, and the reporting was based on multiple failed replications across many labs, I apologize for that mistake).

Personally, I find the data reported about how experimenter expectations impact the effect compelling. I was never a strong believer that the behavioral effects reported in several of these papers is unmediated. I have always contended that goals of the participants mediate the effect — something like priming the stereotype of the elderly leads one to prepare for interaction with the elderly and to desire to have a smooth and easy interaction (all of this being unconscious). Thus, one unconsciously adjusts one’s own behavior to be compatible with this unconscious goal. I have no evidence, just my belief that there could be mediators of this effect that are not yet discovered. So, I am not sticking up for anyone or anything. The idea of limitations to the effect is not threatening to me. I am just saying that 1) Bargh et al did directly replicate, and 2) if the original blog relied solely on one (or limited) evidence of failed replications, it seems premature to attack the original research when so many possible reasons for the failure could exist, not the least of which is the possibility that the stereotype is not even strongly held (if held at all), or that their methods used a priming procedure that would eliminate the effect.

The issue moving forward seems to be the many failed replications that are now being made known as a result of that initial exchange, and what to make of them.

Great points, Gordon. For those interested in the “preparation to interact” hypothesis, I’d recommend checking out the Cesario, Higgins and and Plaks 2006 JPSP (study 2). People who hold negative implicit attitudes toward the elderly _speed up_ when primed with words related to the elderly, while people who hold positive implicit attitudes toward the elderly _slow down_ when primed with words related the elderly – in theory, so that people who dislike the elderly can avoid interacting with them, while people who like the elderly will be better able to interact with them.

It’s very important to measure the existence of the stereotype and attitudes toward the group in question. Regardless of whether Bargh had pre-existing information about the stereotype, or was fortunate in that this stereotype and positive associations with the elderly existed in the original sample, we now have good evidence that implicit attitudes can moderate this relationship.

While it’s important to understand why failures to replicate studies occur, ideally these replications would take into account the latest developments in the field.

I don’t know about the US, but inthe UK, the average life expectancy has increased by approx 3.75 years in the 15-16 years between these 2 studies (10 years in 4 decades). I am certain (but can not back this up) that (social) views of the elderly have changed even more profoundly in that time. I know mine have!

Perhaps the reason that there does not seem to be any priming occuring in this more recent study is that the participants’ expectations & understandings of old age has changed?

Also, nothing has been said about the ages (or views) of the participants themselves.

In fact, you could even argue that the 2 studies looked at very different populations from which they drew their samples, and are therefore not exactly comparable. In which case you should not expect to repeat the results.

In reply to Gordon Moskowitz: as the guy who made the early running in pointing out the contradictions in Bargh’s post, of course I’m not offended by it being pointed out that I’m not an expert in social psychology: after all, I’m not an expert in social psychology.

1) Of course, it’s great that Bargh replicated the effect in the same paper. However, any methodological criticism that applied to the first study applies just as much to the replication. Which leads to your next point.

2) The majority of what I pointed out was discrepancies between what Bargh was saying about his previous work and what his previous work actually said. I’m not sure that requires much expertise in social psychology. The first of these was that his description of the blinding procedure in his original study differed from that given in the paper. The account in the paper does not exclude the possibility of experimenter bias, so this concern extends to both Exp 2a and 2b.

Bargh then went on to detail several points where he felt Doyen’s paper had important differences to his own. The problem is, even if these are reasonable explanations for why Doyen did not replicate his results, they do not appear to differ from his original paper. It’s one thing to point out that their paper had flaws; it’s another to claim that the original paper did not share them.

Bargh claimed that Doyen drew participants attention to the exit whereas he did not. The paper says differently, so even if it were responsible for eliminating the effect in the Doyen paper, it is no different from Bargh’s study. Secondly, I’d point out that in the weather example, it is supposedly drawing attention to the weather that eliminates the effect the weather has. Drawing attention to the exit eliminating the effect of an elderly prime would be more like drawing attention to life satisfaction ratings eliminating the effect of the weather.

Regarding the number of primes: I would point out that Bargh’s account appears to differ from his paper, and his methods chapter does not quite make the points he says it makes. It does not recommend, for example, that people use only 10-12 primes. It does suggest not to use primes in every sentence since this may produce “active effects”, which presumably includes those you mention – reducing or eliminating priming effects – as well as demand effects (i.e. which may include producing the expected effect), which the chapter specifically notes as the “most notorious”. In other words, it’s actually a nice, reasonable, and flexible explanation you can use to support any interpretation. The critical point is that from reading the two papers, it’s not clear that they differ on the number of primes. Only Bargh’s new account makes this claim – that his original study did not have this flaw. This is the inconsistency I pointed out.

I’d note that both the example Bargh gives and your paper with Skurnik are primarily concerned with contrast effects (i.e. after reading about Hitler, Ted Bundy doesn’t seem quite so bad – hey, Married with Children made me laugh once or twice). This appears to me to be a subtly different argument – I am not sure I see where contrast effects come into the Doyen and Bargh studies aside from being proof of principle that priming effects can go in multiple directions – but I am not an expert. And again the Bargh and Doyen studies don’t seem to differ on this point, so i’m not sure why it would apply to one experiment and not the other.

And so, to the final point:

3) Your argument that he found an effect and therefore his participants had the right stereotype is a prime example of affirming the consequent. i.e. if they have the stereotype, we’ll find the effect. we found the effect, so they have the stereotype. This is a logical fallacy. Here’s a quote from Bargh’s blog: “Without taking the appropriate methodological steps to make sure participants hold these stereotypic beliefs in the first place, one can’t assume the external primes have anything in the participants’ minds to activate.”

I would go further and say that without taking the appropriate methodological steps to make sure participants hold these stereotypic beliefs, one can’t assume that a difference in performance has anything to do with the priming of these stereotypes. In hindsight, the interpretation of the original study is an affirmation of the consequent.

So, to summarise; the reasons given may be fine and reasonable explanations why Doyen found a null result, but don’t appear to pinpoint any major differences between Bargh’s and Doyen’s studies. It is primarily the claim that they did – that the original did not suffer from any of these apparent flaws – that is objectionable. And if the original did suffer these flaws, the question might well then be how come it found the effects?

For me, a flaw in the Doyen paper is that they didn’t also have manual recording of walking speed in the first experiment to compare it to automatic recording; the link between the first experiment and the second experiment, in which experimenter effects were demonstrated, is thus a little weak.

There’s an even simpler confound that occurred to me earlier on. What if the people in Bargh’s elderly groups just walked slower anyway?

Anyway, Ron R also raises some interesting possibilities for why the effect was null.

I see that there are one or two mentions of the problems with the timing method, pointing out that the modern experimenter would “film” the subjects walking down the corridor, and take the timing from that film, rather than relying on a human with a stopwatch.

That’s a quite inexpensive solution these days. In 1991, when the experiment was conducted, a video camera was neither cheap nor inconspicuous.

I’d be a little wary about how that change might affect the replication, because we’re dealing with people, who might notice things, but using current measurement methods is a pretty sensible idea. And it won’t invalidate anything until somebody goes and does it.

The video describes and shows Williams priming the subjects with warm or iced coffee and, later, the same Williams prompting the subjects for impressions of his colleague Randy, which were hypothesized (by Williams) to be “warmer” for subjects primed with warmth.

The published article makes it clear that the experiment was blinded, and that what the BBC aired was a dramatization. It’s a shame the BBC didn’t find a way to portray the blindedness of the experiment.

It’s certainly true that much of research in psychology is not rocket science. In the core of physics and engineering (including rocket science) there is substantive, cumulative research. That means one piece of research builds on another. If the piece it is built on is erroneous or unreliable, the attempt to build on it caves in. So in a sense the findings are replicated every time, as the cumulative house of cards keeps growing successfully. (In other words, as long as it is not really just a house of cards, but reliable experiment and valid theory.)

This is not true in the periphery of research (in any field), where hit-and-run studies are done just in order to get a publication or a PhD and no one (except maybe the referees or the thesis committee) cares, because no one is going to try to build anything on top of the “findings.” The house of cards does not cave in, because it stops with the bottom layer.

Unfortunately, a lot of experimental psychological research is like that: Hit and run studies, or a temporarily fashionable run of parallel rather than cumulative studies that lead nowhere and peter out after a while. And worse, there are sometimes studies whose popularity is notional, based on an extravagant interpretation, in which case it is the interpretation that perdures, sometimes for years or even decades, rather than the “effect,” which no one bothers to try to replicate, because it is the interpretation that everyone is interested in. Then, once the interpretation has become part of the canon, in textbooks as well as journal article citations, if someone does venture to try to replicate, rather late in the day, the negative result is dismissed, as going against a vast positive results!

I am not saying this is the case with Bargh’s finding. But I do agree with the others who say that what is needed here is several systematic, concerted efforts to replicate — and if they fail, to stop the special pleading and concede that this was yet another fragile outcome of psychology’s plethora of Type I errors (accepting chance findings as real).

To me, replication – it’s failure and it’s success is the only way to study the most interesting thing on the planet, our own human minds. Yes, we want to understand the world as clearly as possible, but to me, the treasure is that we get to know ourselves through the process.
Smiles,
Robin

Who We Are

Phenomena is a gathering of spirited science writers who take delight in the new, the strange, the beautiful and awe-inspiring details of our world. Phenomena is hosted by National Geographic magazine, which invites you to join the conversation. Follow on Twitter at @natgeoscience.

Ed Yong is an award-winning British science writer. Not Exactly Rocket Science is his hub for talking about the awe-inspiring, beautiful and quirky world of science to as many people as possible.
Follow @edyong209
Subscribe via RSS