The (non-)replicability of soft science

Since last night the internet has been all atwitter about a commentary* by Dan Gilbert and colleagues about the recent and (in my view) misnamed Reproducibility Project: Psychology. In this commentary, Gilbert et al. criticise the RPP for a number of technical reasons asserting that the sampling was non-random and biased and that essentially the conclusions, in particular in the coverage by science media and blogosphere, of a replicability crisis in psychology is unfounded. Some of their points are rather questionable to say the least and some, like their interpretation of confidence intervals, are statistically simply wrong. But I won’t talk about this here.

One point they raise is the oft repeated argument that replications differed in some way from the original research. We’ve discussed this already ad nauseam in the past and there is little point going over this again. Exact replication of the methods and conditions of an original experiment can test the replicability of a finding. Indirect replications loosely testing similar hypotheses instead inform about generalisability of the idea, which in turn tells us about the robustness of the purported processes we posited. Everybody (hopefully) knows this. Both are important aspects to scientific progress.

The main problem is that most debates about replicability go down that same road with people arguing about whether the replication was of sufficient quality to yield interpretable results. One example by Gilbert and co is that one of the replications in the RPP used the same video stimuli used by the original study, even though the original study was conducted in the US while the replication was carried out in the Netherlands, and the dependent variable was related to something that had no relevance to the participants in the replication (race relations and affirmative action). Other examples like this were brought up in previous debates about replication studies. A similar argument has also been made about the differences in language context between the original Bargh social priming studies and the replications. In my view, some of these points have merit and the example raised by Gilbert et al. is certain worth a facepalm or two. It does seem mind-boggling how anyone could have thought that it is valid to replicate a result about a US-specific issue in a liberal European country whilst using the original stimuli in English.

But what this example illustrates is a much larger problem. In my mind that is actually the crux of the matter: Psychology, or at least most forms of more traditional psychology, do not lend themselves very well to replication. As I am wont to point out, I am not a psychologist but a neuroscientist. I do work in a psychology department, however, and my field obviously has considerable overlap with traditional psychology. I also think many subfields of experimental psychology work in much the same way as other so-called “harder” sciences. This is not to say that neuroscience, psychophysics, or other fields do not also have problems with replicability, publication bias, and other concerns that plague science as a whole. We know they do. But the social sciences, the more lofty sides of psychology dealing with vague concepts of the mind and psyche, in my view have an additional problem: They lack the lawful regularity of effects that scientific discovery requires.

For example, we are currently conducting an fMRI experiment in which we replicate a previous finding. We are using the approach I have long advocated that in order to try to replicate you should design experiments that do both, replicate a previous result but also seek to address a novel question. The details of the experiment are not very important. (If we ever complete this experiment and publish it you can read about it then…) What matters is that we very closely replicate the methods of a study from 2012 and this study closely replicated the methods of one from 2008. The results are pretty consistent across all three instances of the experiment. The 2012 study provided a somewhat alternative interpretation of the findings of the 2008 one. Our experiment now adds more spatially sensitive methods to yet again paint a somewhat different picture. Since we’re not finished with it I can’t tell you how interesting this difference is. It is however already blatantly obvious that the general finding is the same. Had we analysed our experiment in the same way as the 2008 study, we would have reached the same conclusions they did.

The whole idea of science is to find regularities in our complex observations of the world, to uncover lawfulness in the chaos. The entire empirical approach is based on the idea that I can perform an experiment with particular parameters and repeat it with the same results, blurred somewhat by random chance. Estimating the generalisability allows me to understand how tweaking the parameters can affect the results and thus allows me to determine what the laws are the govern the whole system.

And this right there is where much of psychology has a big problem. I agree with Gilbert et al. that repeating a social effect in US participants with identical methods in Dutch participants is not a direct replication. But what would be? They discuss how the same experiment was then repeated in the US and found results weakly consistent with the original findings. But this isn’t a direct replication either. It does not suffer from the same cultural and language differences as the replication in the Netherlands did but it has other contextual discrepancies. Even repeating exactly the same experiment in the original Stanford(?) population would not necessarily be equivalent because of the time that has passed and the way cultural factors have changed. A replication is simply not possible.

For all the failings that all fields of science have, this is a problem my research area does not suffer from (and to clarify: “my field” is not all of cognitive neuroscience, much of which is essentially straight-up psychology with the brain tagged on, and also while I don’t see myself as a psychologist, I certainly acknowledge that my research also involves psychology). Our experiment is done on people living in London. The 2012 study was presumably done mainly on Belgians in Belgium. As far as I know the 2008 study was run in the mid-western US. We are asking a question that deals with a fairly fundamental aspect of human brain function. This does not mean that there aren’t any population differences but our prior for such things affecting the results in a very substantial way are pretty small. Similarly, the methods can certainly modulate the results somewhat but I would expect the effects to be fairly robust to minor methodological changes. In fact, whenever we see that small changes in the method (say, the stimulus duration or the particular scanning sequence used) seem to obliterate a result completely, my first instinct is usually that such a finding is non-robust and thus unlikely to be meaningful.

From where I’m standing, social and other forms of traditional psychology can’t say the same. Small contextual or methodological differences can quite likely skew the results because the mind is a damn complex thing. For that reason alone, we should expect psychology to have low replicability and the effect sizes should be pretty small (i.e. smaller than what is common in the literature) because they will always be diluted by a multitude of independent factors. Perhaps more than any other field, psychology can benefit from preregistering experimental protocols to delineate the exploratory garden-path from hypothesis-driven confirmatory results.

I agree that a direct replication of a contextually dependent effect in a different country and at a different time makes little sense but that is no excuse. If you just say that the effects are so context-specific it is difficult to replicate them, you are bound to end up chasing lots of phantoms. And that isn’t science – not even a “soft” one.

Then again, all fields of science are “soft”

* At first I thought the commentary was due to be published by Science on 4th March and embargoed until that date. However, it turns out to be more complicated than that because the commentary I am discussing here is not the Science article but Gilbert et al.’s reply to Nosek et al.’s reply to Gilbert et al.’s reply to the RPP (Confused yet?). It appeared on a website and then swiftly vanished again. I don’t know how I would feel posting it because the authors evidently didn’t want it to be public. I don’t think actually having that article is central to understanding my post so I feel it’s not important.

Post navigation

34 thoughts on “The (non-)replicability of soft science”

I’m not sure why there is all this hating towards Dutch students using English materials. First, Dutch people tend to have excellent levels of English and a surprisingly good understanding of other countries’ cultural issues. Second “Dutch” students might well include a substantial number of non-Dutch people. Third, many Dutch university courses are taught in English anyway (not least because the non-Dutch people will have problems with materials in Dutch).

Furthermore, I’m not sure how typical the experience of the average American undergraduate (who, it seems to me, are overwhelmingly White or East Asian and middle class in origin) of race relations and affirmative action is. I’m guessing that both American and Dutch 19-year-olds are mostly “basically anti-racist” but not especially well-informed about the matter, and neither group probably has a lot of Black friends.

Now, let me guess what the reaction from the original researchers would have been if, outside the context of the replication project, a group from Amsterdam had called them up and said “Hey, great news – we found the same effect here too!”. Would the original authors have scolded the Dutch researchers for the obvious cross-cultural problems and cautioned them to be very careful in their interpretation? There’s quite a bit of “eating the cake while retaining possession of the cake” going on, I think.

Thanks for your comment. I think you miss the point in the beginning. Everything you say about the English skills of Dutch students and the cultural view of US students etc is doubtless true but that’s not really the problem. Whether or not the students are racist or not is largely irrelevant (in fact the “anti-racistness” could be the driving factor for all we know).

I don’t think it’s defensible to say that there cannot be a difference. This isn’t visual search but a complex human behaviour. Maybe you’re right and there is no difference in this behaviour between Dutch and US subjects but it’s entirely reasonable to question that assumption.

And therein lies the problem. You can always question the context. So unless the researchers make very clear what should or should not generalise and why it should be so, they are chasing phantoms or – in statistical terms – overfitting the noise. Your final hypothetical scenario illustrates this nicely. I think you’re right, if they Amsterdam experiment had replicated the effects they probably wouldn’t have complained about the contextual difference.

And this is often the way in these debates: the differences between studies are only a problem if they fail to replicate. That makes sense, because the failure to replicate may mean that there is a moderating factor – but (as I have said many many times) in that case you need to do more experiments to demonstrate that.

I wasn’t trying to suggest that there isn’t a difference; clearly there is. I only mentioned the case of the Dutch students because I’m not sure that it’s worth a full-on facepalm. Presumably the replicators thought about this and decided that the differences were reasonable (to my shame, I haven’t read either the original article or the replication).

I haven’t read them either :P. I should say the facepalming is something that I base solely on Gilbert and Co’s description but provided that this is reasonably accurate I think it’s justified. I wouldn’t call this a direct replication. And this is part of the problem I’m discussing. Because by every standard definition it obviously is a direct replication! They used the same stimulus! Closely replicated the method which probably isn’t appropriate.

There is a catch-22 here: you can replicate the method exactly but thus make it an inappropriate test for the hypothesis or you can adapt the methods to make it appropriate to the context but then it definitely isn’t an exact replication any more. There is no way to win.

I think that Sam raises some very good points here. My take is a little more hopeful than his though. I think that psychology does (hopefully) generate “the lawful regularity of effects that scientific discovery requires” but that those laws, since they deal in social, contextual variables are often phrased in terms of social contextually dependent things. As Sam says, these can wooly and slippery to get hold of. But not impossible. The key is the level of description.

So there are several ways to describe the study we ran. You could say
(a) When Stanford students hear someone saying something potentially offensive about race relations and Stanford admission, they will fixate a Black American who can hear the remarks
(b) people will look at the target of potentially offensive language when the target can hear the remarks, but not when he is visible on screen, but can’t hear them.

It seemed to me that the replication experiment tried to replicate at the first level of description. They tried to generalise this, to see if non-Stanford students would fixate people in the same way, in response to the same video. Unsurprisingly, they didn’t. But the conclusions we wanted to put forward have nothing to do, really, with Stanford or race relations. So whether or not the new participants were Dutch, American, racist, liberal or not, is beside the point. We weren’t trying to conclude anything about those things in particular.

Our conclusion – our generalisable law (hopefully!) – is all about (b), the second level of description. It concerns how listeners respond to offensive language, and the sort of social cognition that they engage in. They key experimental manipulation is whether the target can hear the remarks.

The previous post here has imagined that a Dutch group did replicate our results, has imagined our response, and become angry that I am eating cake. I don’t really know what to say about that…

So perhaps I can just say this… I would be happy with any experiment that replicated at the right level of description; that had remarks that the listener knew was potentially offensive, a bystander target who would plausibly be offended, and it was experimentally controlled whether or not the target heard the remarks. If all those things were in place, I’d be very happy to call it a replication. Of course, there can be debate (at a conference or in a review or on a blog) about whether and how those conditions hold or not. That’s the tricky (but not impossible) thing about social psych. But the conditions could reasonably be said to hold, and if the effect was not found, then I would be very concerned about the generalisability of our effect, and happy to say that we had gotten things wrong. I would not touch any cake.

As an aside, I went into these issues *at length* with the people involved in the first replication. I suggested that they should do exactly what Sam advocates here: replicate and extend. Replicate with our stimuli if you want, but more importantly with stimuli that is tailored to your participants, or looks at sexism or ageism, or any other type of potentially offensive language. Then we might learn something useful about the scope and limits of the effect. But that whole discussion seems to have gotten… lost… in the replication attempt, and now it has blown up into something else.

My own feelings about the replication project are decidedly mixed. I’m still a strong supporter of replication attempts, am trying to make them a foundation part of our undergrad education. I also see the merits of pre-registration and other changes in journal publication practices.

Anyway, this might all be an egocentric error on my behalf. Maybe it’s not my paper! Am happy to see more when I see the commentary…

Hi Daniel, and thanks for your comment! I should probably have clarified my statement about lawful regularities a bit: I do agree that psychology can discover laws but it’s much harder than in many other fields. The reason for this is two-fold:

1. The effects are fuzzier and noisier than in many other scientific results because they are governed by a complex mix of interacting factors. Context is far more important than in many other fields. For instance, a simple visual search experiment will produce pretty equivalent results wherever you test it, whoever you test, and – within certain limits – even regardless of what stimulus you use. Social psychology is by definition much more susceptible to context and boundary conditions. This is what my post is mainly about.

2. The natural laws in social (etc) psychology are also more diffuse and complex themselves. That doesn’t mean there aren’t any but they aren’t as clear cut as in other fields. Perhaps I’m wrong but I doubt social psychology will ever see the equivalent of Newtonian mechanics, Relativity, or even psychophysical principles like the Weber-Fechner Law.

About the commentary, I agree it’s a bit daft that it’s not public yet. Also the example may even be in a commentary by Gilbert and co about the reply to their reply to the RPP thus making this even more confusing. Not sure they will even publish this now at all (but I can send it to you if you wish…)

I found myself very sympathetic to Daniel’s point here even thought I can be counted among those who are skeptical about replicability in science. The issue is often more complex than we might think it is. For example, I am of the opinion that in order for a finding to be treated seriously, it should be possible to replicate it with methods that can be communicated through a methods section and possibly some extra commentary. However, it must also be the case that the people reading that methods section and building the replication have the background to properly interpret it, and from Daniel’s description it seems like this might not have been the case with his study.

Consider as an extreme example someone who is given the complete specs for the Large Hadron Collider, lacks the background to truly understand it, and yet is tasked with replicating it. Such a person may build a metal tubular structure hook up some electricity and get nothing. In such a case, noone would propose that this is a valid replication attempt, even though it had been attempted in good faith.

My point is that the same thing might have occurred in some of these studies, though obviously to a less extreme case. If one misunderstand the construct or its cultural confounds, then it’s possible to completely miss the crucial point of the experiment while adhering to the letter of the methods. This does not necessarily mean that the methods are inadequate. We all write methods with an expectation that the reader starts with a relatively similar set of assumptions and knowledge as we do.

For example, I do not specify that my experiments are run indoors instead of out in the middle of a field on a sunny day(which would have a HUGE effect on the ability to perceive a computer display). I just expect the reader to know certain things. It sounds as if Daniel did his level best to impart his knowledge to the replicators, but they perhaps lacked some crucial contextual information. I haven’t read the studies in question so I can’t offer a firm opinion, but it’s very plausible.

Hi Brad, I largely agree. Of course we have had all these discussions many times before (in fact I’ve used the LHC example myself several times). My main point here is though that instead of worrying too much about replicability of individual findings it really seems more important to study the robustness of larger hypotheses. You can fail to replicate the Bargh experiments over and over and argue ad infinitum what that means but I think it’s not getting us anywhere. Instead I want to see some more concerted effort to actually put that whole underlying theory to the test (and most likely, to rest).

Daniel, I apologise if my comments about you found my remarks about hypothetical cake (etc) to be inappropriate. I should have written them more carefully to direct them at the critics of the replication project, who have indeed created what Sam referred to as a Catch-22 situation.

I have now read your article, and the replication report (which is available here: https://osf.io/nkaw4/). Interestingly, Gilbert et al. did not mention that as well as in Amsterdam, the replication was also performed in Worcester, Massachusetts, with the same (null) results. The replicators also conducted a separate replication with a specially-prepared version of the video that had references to Stanford removed (perhaps so that nobody could say that the reference to another school caused participants to dismiss the video as irrelevant to them). They appear, to my minimally-trained eye, to have been rather thorough in documenting the choices they made.

Hi Nick, I thought Gilbert et al. did mention the US replications – they claim that this one was successful (albeit only barely significant). Is this the same you talk about or not? I already discussed this in my post.

Anyway, I’d say this is irrelevant really. While the difference is presumably not as strong as between Amsterdam and Stanford there are still a billion reason why you could argue that they are different. That’s the crux here: with broader human behaviour questions studies by traditional psychology (in particular social) there are *always* context-dependencies you can argue about. There can of course context/parameter-dependencies in all fields of science but I think it’s delusional to think that this isn’t much more severe in psychology.

It isn’t a criticism of psychology. I think it has a lot to contribute as a field but I think psychologists need to be much more wary of this catch-22 issue. If this discussion continues to be between Camp ‘Nonreplicable’ and Camp ‘Context’ it is never going to move forward. At some point a crisis in psychology turns into its stagnation.

I hope this reply gets posted in the right place: it’s to Sam’s comment about “I thought Gilbert et al. did mention the US replications – they claim that this one was successful (albeit only barely significant).”

First, my bad, they did mention this. I was skimming the Science article prior to replying and didn’t find a mention of it, but that’s because I was assuming that the 6 problematic replications they discussed included this one, and they don’t – this replication is only mentioned in the press release. (So I choose to blame you, Sam, for bringing it up! :-))

That said, the press release says that Amsterdam researchers “basically replicated the original result” in their U.S. sample. I guess that’s one interpretation of a p=.078 result on the original ANOVA and a p=.033 t test that was not even performed in the original. And that was only in the second bite at the cherry, once the video had been edited to remove references to Stanford (which was an initiative of the replicators). It seems like the replicators worked hard to get a significant result somehow. Hmmm, does that sound familiar?

Thanks Nick for clarifying. I must say I got very confused. What I discuss in my post is about the reply to the reply to their reply to the RPP. When I first read it I thought *this* was the Science article to be published tomorrow but it apparently is not. I didnt think it was the press release though (it didn’t read like one but like a paper). They removed the public link so maybe it won’t be published at all now. I will change the footnote to reflect this…

Anyway, I think there are two issues. I don’t think we can quantify whether something replicates by looking whether or not the replication was significant. Sadly this is what everyone talked about with the RPP even though they looked at the data from all sorts of angles. Without any further information, p=0.07ish could easily mean a positive replication albeit of a much smaller true effect (which could be smaller due to publication bias). This is why we should use replication Bayes Factors (cue Alex Etz…;).

The second issue, which I still think is actually a greater problem and which I have already mentioned in the post, is that the same problem I discuss also affects the US replications. They wouldn’t have the potential cultural and language differences but there may be a multitude of other differences. Even going back to Stanford may not work because the culture could have changed, subjects might be able to deduce the hypothesis because they have heard about it, or the effect may depend in other ways on the subject sample.

The only thing you can conclude from these non-replications is that the effect isn’t very general. This is precisely what I’m talking about. In general psychology it seems to me like a very justified assumption that many effects are extremely fickle much moreso than say slopes in conjunction search, the dependency of the tilt illusion on eccentricity or surround orientation, or (more high level?) the Stroop effect. This is a challenge that even if not unique to social/personality psychology it is likely to be extremely strong there compared even to other parts of cognitive science and certainly moreso than biology or physics.

You say “this is a problem my research area [cognitive neuroscience] does not suffer from”, in reference to the context dependency thing. You give one example of an unspecified effect that is apparently robust. But you also say “whenever we see that small changes in the method (say, the stimulus duration or the particular scanning sequence used) seem to obliterate a result completely, my first instinct is usually that such a finding is non-robust and thus unlikely to be meaningful.” So clearly there ARE fragile, context-dependent effects. Sure, it’s possible that most effects in cog neuro are of the first kind, and not of the second kind. Your comment implies that you think this is the case. But frankly, without systematic replication data to back this up, I’m skeptical.

One thing we do know is that the typical analysis pipeline in neuroimaging research is so loaded with researcher degrees of freedom that it’s probably very, very often possible to find some combination of analysis parameters that bring out a result that could be interpreted as the hypothesized effect — which result, of course, probably won’t replicate under slightly different conditions or even under the same conditions in a future sample (because it’s essentially the result of overfitting). Again, until there’s some systematic replication data assuring that this is NOT a widespread problem in cog neuro research, the self-congratulatory attitude is probably unwarranted.

Thanks for your comment Jake. It’s impossible to have a coherent discussion on twitter. I think you misunderstand what I’m saying (which is usually because I’m not expressing myself clearly enough):

1. I never said all cognitive neuroscience effects are clearer than social psychology effects. They definitely aren’t. But I also think in cognitive neuroscience, especially within my own field (which is also partly systems neuroscience), you wouldn’t get very far with a fuzzy hypothesis like those that seem commonplace in much of social psych (e.g. social priming). If an effect lacks robustness so badly that I am worried that the same stimulus won’t work on a different sample then I wouldn’t put much money on it. But that is exactly what Gilbert is asking us to do.

2. My post also isn’t about publication bias, researcher degrees of freedom and all the rest. Most scientific fields suffer from these issues and they undoubtedly reduce replicability of findings. This is an important discussion worth having but it’s a separate issue. I’m talking about the reduction in replicability that results from having noisier measures and vaguer laws governing them. There are simply more diluting factors in something like walking speed after unscrambling words than there are in the tilt aftereffect.

3. Researcher degrees of freedom in fMRI analysis are also a red herring. This may be true when you talk about blobology fMRI experiments without any prior hypothesis. Let’s call that the dead salmon activation. But that is actually not remotely as common as some would like us to believe. I never did unconstrained blobology experiments in my research. And I think I have a pretty good handle on the robustness of the kinds of effects we do study in the face of changing methods. The kind of data we work with in the lab is usually virtually the same regardless of the scanner, pulse sequence, amount of data collected, stimulus used or the participant. This is not to say that everything is always rosy and perfect. Obviously it isn’t. Some things are noisy, some experiments are vastly underpowered and suffer from all sorts of problems. Nobody is denying that. Some results therefore probably don’t replicate and some may be dependent on subtle parameters. But this has nothing to do with how much replicability we can expect a priori from this field of research. More importantly, what I’m saying here is that you can’t paint all of fMRI with the same brush.

4. Speaking of brushes, you also can’t paint all cognitive neuroscience with the same brush. As I said on twitter I don’t doubt that a lot of cog neuro results are fragile. I can’t speak for all of cognitive neuroscience some of which is no doubt overlapping with social and personality psychology. I can only talk about my own field and I think by and large what I said holds true there.

5. The example I gave is just that. I know I left it vague because the details don’t matter in my opinion. I’m happy to send you the links to the two previous studies so you can look at them but that’s not the point (I was going to blog about those soon anyway actually but not sure I will). The point is though that we could essentially use the same stimuli, same length of data collection etc and produce similar results. Nobody in their right mind would say “But you are not testing this on Americans! Your stimuli are inappropriate!” This would make no sense. We are studying something about visual object perception that I would suspect would work in a macaque monkey, it sure as hell should generalise to the normal human brain! The whole point of my post is that social psychology can’t say the same in many situations. I’m sure it can for some effects but in a large number of cases the context and boundary conditions matter a great deal. That makes it difficult to interpret the results and to plan a meaningful replication. Difficult, not impossible.

As I said this is just an example and I don’t think the details matter. There are plenty more and just off the top of my head here are a few (and this would have been the topic of my blog post about replication). Both of these are basic neuroimaging findings that have been replicated 100 times the world over:

I really don’t see how anything you’ve said distinguishes cognitive neuroscience from social psychology, Sam. You’re basically cherry-picking favorable examples from the former and comparing them to crummy examples of the latter. You could have just as easily gone the other way. For example, many people study reading with fMRI. Imagine you took a study originally done on Japanese readers, observed that word recognition is associated with TPJ activation, and then said, “hey, we should replicate this with English readers here at Stanford”–and then proceeded to show the American undergraduates the same Kanji characters used in the original characters. Would it surprise anyone if the effect didn’t replicate? Would it then be sensible for you to say that reading effects “lack robustness so badly that I am worried that the same stimulus won’t work on a different sample”? No. You would say, “hey, maybe you shouldn’t present stimuli that are only meaningful to one group to a very different group.”

Similarly, I’m willing to bet that, in vision or psychophysics experiments, there are all kinds of things that are very important to you but would probably seem quite trivial to someone not in your field. Does the pop-out effect really _always_ happen, irrespective of the number of stimuli in the display, or the starkness of the contrast between the target and the distractors? Does the size of the dot matter in a dot-probe task? How about the lighting in the room? Will it change anything if you move the target five degrees away from fixation? When it’s your field, these things don’t seem so trivial, and I bet I could find any number of studies in which people explain differences between supposedly “basic” findings by appeal to such moderators. Is it hard to imagine a personality psychologist thinking, “dude, I can get the same five dimensions of personality to show up in every single culture I’ve looked at, and yet you can’t even get this same effect to show up if the room’s dark in one case and light in another?”

Of course, none of this is to deny that some domains or research questions feature more law-like behavior than others. But as Jake pointed out, unless you have systematic evidence to back up the claim, it’s not at all obvious that the particular fields you’ve chosen to contrast differ appreciably. And it’s not clear why you have to go there at all; you could have made that point without any tendentious comparisons. I’ll be the first to agree that many (most?) psychology results are ill-defined to the point that operationalizing the notion of replication is difficult… but one can make that point without putting another field that suffers from essentially the same issues on a pedestal. I shudder to think what, say, a chemist or physicist reading this would think. “Oh, you think cognitive neuroscience is a law-like, ‘hard’ science? Let me buy you a drink.”

Hi Tal, I think we should get away from this notion that I’m talking about cognitive neuroscience at all. I am not. I never mentioned it in my post. I honestly am not even sure what cognitive neuroscience really is other than that it involves neuroimaging. A lot of cognitive neuroscience suffers from the very same problems I discuss about psychology – because that is what it is. If this is what your point is, you won’t get much of an argument from me.

Also, let’s get away from this erroneous notion (again, probably my terrible communication to blame) that I’m putting anything on a pedestal. I am saying that any lofty research question about complex human behaviour is difficult to study – and the way that research is done should reflect that (including how replications are done and how they are interpreted). I really don’t give a damn if it’s social psychology, personality psychology, cognitive neuroscience or whatever. The problem is not this rather artificial categorisation but what the research question is. The reading example you mention is a perfect illustration of that.

Although on further thought I should also add that the reading example is actually still much more constrained than a lot of “proper” psychology research so in that sense it actually isn’t a good illustration. Sure, testing Kanji on Americans without any knowledge of Japanese is unlikely to replicate the same effect – but this is an easily testable effect. It lends itself to being adapted to English writing. It lends itself to learning studies. It allows for fairly simple control experiments. I don’t think any of these can be said about the racism experiment or many other social psych experiments. So no I don’t think I buy your example.

And furthermore, I think your example about lighting or moving your eyes are also strawmen. Why are these trivial to an ‘outsider’ to the field? Rather these are trivial to an expert! If I make the background lighting so that I can’t see the screen you should not be surprised if the experiment doesn’t produce the same result. The same applies to moving your eyes such that the stimulus is in the periphery. However I would also bet that you can tell much more easily that the results are just useless. Not always of course, but I would suspect you can tell better than in most social psych experiments.

Anyway I will stop bombarding you with my thoughts now :P. My whole point is that on the whole (there are always exceptions for everything) fields like physics, chemistry, biology and psychophysics, have more constrained and defined regularity than traditional psychology. And the whole thing falls on a spectrum. The reading studies are flakier than studying the tilt aftereffect or biological relationships (e.g. what do neurons look like in condition A) let alone studying gravitational waves or particle physics or whatnot. But they are definitely still more regular than studying complex human behaviour in action (as that racism study did).

You didn’t really respond to my central point. What I’m saying is that you can take almost any effect in neuroscience–including systems and molecular neuroscience–and get it to go away by changing variables that, to a non-expert, would seem completely trivial. And in a huge number of cases, effects _don’t_ consistently replicate from study to study, often for reasons that are just as unclear as they are in psychology. So it simply isn’t true that this is a problem your research area doesn’t suffer from. I mean, it may be true that some of the _specific_ problems you’re working on don’t suffer from that problem; but if the idea is that most findings in systems or cognitive neuro behave in a law-like way, that’s something you’ll need to provide empirical evidence of, because it doesn’t follow from the single anecdote you provide. (You also seem to beg the question when you say that “whenever we see that small changes in the method … seem to obliterate a result completely, my first instinct is usually that such a finding is non-robust and thus unlikely to be meaningful.” What basis is there for assuming that inconsistent results must reflect error rather than unmodeled complexity when it’s systems neuro, but that it could be either when it’s psychology?)

The opposite is also true, of course: a huge number of effects in psychology are highly robust to all kinds of variations in context. Your suggestion that in “social and other forms of traditional psychology … Small contextual or methodological differences can quite likely skew the results because the mind is a damn complex thing” is clearly not true of all psychology. Take a look at the results for the anchoring effects tested in the first Reproducibility Project if you want a robust effect, for example. Anchoring is, on its face, a completely absurd effect that completely violates almost everyone’s intuitions about how a cognitive agent should behave. It’s the product of a complex psyche; yet it’s quite regular, and can be systematically taken advantage of in a wide range of cases (e.g., by telling your real estate agent to list your house at a higher price).

To reiterate what I said in my last comment, I don’t doubt that some domains are more regular than others. And it wouldn’t even surprise me if some hypothetical averaging of all possible systems neuroscience questions revealed that the behavior of systems-level neural processes was more stable on average than the average of all processes studied by psychologists. But those between-discipline differences seem to me quite minute in comparison to the massive variation we observe within discipline, or even specific research area. Plenty of people work on questions in systems neuroscience where the effects are weak and context appears to matters a whole lot, and plenty of people work on questions in psychology where effects are huge and don’t seem to vary much within a range of tested contexts. If you want to draw a principled distinction between psychology and neuroscience in terms of reproducibility and regularity, I think you need to do better than provide one anecdote of an effect you’ve worked on.

(Re: your last comment: sure, there’s a gradient. I don’t think anyone would dispute that. But you didn’t say “psychology is a little fuzzier than systems neuroscience, which in turn is a little fuzzier than molecular biology, which in turn is fuzzier than chemistry”. You said “this is a problem my research area does not have”. And for any definition of “my research area”, I’m sure I can find plenty of cases of mysterious non-replication and context-dependency with a few minutes of googling.)

“What I’m saying is that you can take almost any effect in neuroscience–including systems and molecular neuroscience–and get it to go away by changing variables that, to a non-expert, would seem completely trivial. And in a huge number of cases, effects _don’t_ consistently replicate from study to study, often for reasons that are just as unclear as they are in psychology.”

I think you are still missing the point. Yes, you can make any effect go away with changing variables but that’s not the issue here. What do you mean by they “seem completely trivial” to non-experts? If I lack the expertise to understand why an effect may depend on a particular variable then you shouldn’t take my failure to replicate seriously. It’s the LHC example from above (and before) all over again. If you run a visual psychophysics experiment on a laptop in the sunny Nevada desert and fail to produce an effect then you don’t even need to be an expert to get why this is rubbish. If you grow a cell culture and screw up the medium and nothing happens, then surely that’s your problem, not that of the original authors.

This is where you seem to miss the point. In (much of) psychology that is not the same. The Amsterdam-Stanford comparison is a perfect example why. By all intents and purposes, the Amsterdam experiment was a direct replication. They used the same stimuli and addressed the same question in a different sample. Surely that meets the definition of direct replication. But it may or may not be the appropriate replication to do. In psychophysics, single neuron physiology, or cell biology, or particle physics for that matter, you have a built-in assumption that the effects should generalise pretty widely. That does not mean they can’t be obliterated by poorly done experiments or that there aren’t sometimes moderating factors but in those cases the moderators usually are trivial! In psychology they are usually not.

“I’m sure I can find plenty of cases of mysterious non-replication and context-dependency with a few minutes of googling.”

Again, careful that we aren’t talking about different things. I think all of science (perhaps physics less so) have problems with replicability but I think they are not inherently caused by the research question but methodological issues like low power, publication bias, and analytical flexibility. And personally I’d actually doubt that you could find lots of “mysterious non-replication and context-dependency”. I am sure you could find context-dependency and non-replication but it’s the mysterious part I don’t believe.

Take the Boekel et al. replication of the structural brain-behaviour correlations. Just looking at the VBM ones, there we have a clear non-replication but – according to Ryota at least – also a clear parameter-dependency in that the replication used a too sharp spatial prior for the cluster. There are also other differences that may play a role. They used a sequence with lower SNR, a 3T scanner that has likely more spatial distortion, a presumably less accurate form of spatial normalisation. (And then there is the issue of a priori power and posthoc strength of evidence but that’s really a red herring in this discussion). None of these things conclusively tell us whether the original effects are real or not and you are welcome to have low prior odds on that.

But all of these context-dependencies are easily testable! They are clear-cut. This is the difference. In the Amsterdam-Stanford-Massachussetts comparisons nothing is clear-cut. Someone on Twitter summarised my argument as saying that psychology effects are snowflakes. That’s exactly right. These kind of effects have a context-dependency inherently built-in and no, I don’t think that is true about most of the effects in my field. If you call cognitive neuroscience my field then perhaps there are some but I would call those parts of it psychology.

Again, this has nothing to do with pedestals. I think psychology is a science and it is an important one at that. I just believe psychologists need to think more about the challenges they are facing. Challenges that are perhaps not unique to psychology, but particularly strong in it.

(see my earlier post – I’m one of the authors of the paper that was subject of a replication attempt Gilbert et al are discussing. I think, I haven’t read any of the embargoed stuff. Sam – if there is anything you can send me please do!)

I just wanted to follow up on one thing…

One can make a very simple statement like ‘a direct replication attempt is one where the same stimulus is used with a different population’. In some cases, that’s trivially false. For example, if you replicate an english psycholinguistic experiment in Japan with the original English stimuli, then obviously you won’t get the same result, particularly with subjects who don’t speak English…

So I think that we would all agree that you can’t use exactly the same stimulus is a replication here. You’d need a translation, and that’s not easy, because word meanings can be subtle and contextual and so. And when you are studying social phenomena in different populations, you also need to translate the stimulus. And not translate it word for word, but in a way that captures the same social phenomena in a culturally contextually dependent way.

He talked about the classic Bargh study where people were primed with the concept of the elderly by presenting them with semantic associates of the word old. Then afterwards they walked slower out of the room. (In)famously this was the subject of one of the first high profile non-replication studies when a lab in Belgium comprehensively failed to get the same result.

but this is Ramscar’s point… What the researchers wanted to do is present ‘the same stimulus’ to participants as Bargh did. So they translated Bargh’s English stimuli into their French equivalents. But here’s the the thing: priming works because of the statistical property of a particular language, the patterns between words and so on (this is what Ramscar studies). And in French, those translated stimuli items *do not* have the same statistical properties and subtle associations with the concept of ‘old’ as in English. Indeed, since the first study was published in 1991 even in US english those properties have shifted. So if you believed that priming works, and believed the original effect, you still wouldn’t really expect the Belgian replication to work.

I bring this up not because I think that Bargh’s original effect needs defending (for other reasons, I think that there are serious reasons to be skeptical about that particular experiment). I simply wanted to illustrate (with work that isn’t mine) that the principal that ‘a replication has to use the same stimulus’ can be surprisingly non-straightforward.

And perhaps if you use mainly psychophysical stimuli in your experiments, it’s pretty easy to overlook this problem, because you can replicate elsewhere just by emailing the images to another lab. But we’re playing a different game, and different things are important.

I make this comment having been on both sides of the equation, BTW, since I began in vision science and ended up working in social psych. I been in the position of trying to explain to vision scientists why it matters that the stimulus is culturally relevant to a population, and also trying to explain to social psychologists why has to matter that two stimuli are isoluminant in a saccadic task, for example. The social psyc’s comment (which echoes so of the comments here) was ‘surely your effect isn’t so weak that these changes in lighting make a difference’….

Hi Daniel, are you the same Daniel Richardson in my department? 🙂 in the case I can send you the files by email…

Anyway thanks for pointing out the ramscar post. I blogged about that in passing before and meant to mention it here as well but then forgot. There are a lot of things to criticise in his argument but I agree with you that the general point stands. It is a good illustration about just how context dependent you can argue social psych effects to be (whether they are or not isn’t even the point – there is plenty of room to argue about it).

Talking about the vision science example again, as I said these parameters are much easier to define. If stimuli are not isoluminant you are adding a lot of factors you should be controlling. This isn’t a “trivial” difference. If a non-expert called it trivial then they clearly lack the expertise to judge the merits of an experiment. On the other hand, if subtle differences in stimulus parameters alter an effect in visual psychophysics you better show me that this modulator is robust otherwise most people will suspect it is a fluke.

My point is that the situation isn’t as easy in social (and many other fields of) psychology. In vision science you can very easily replicate the same methods and usually be fairly confident that they are valid. In social psych, both in your own case and the Bargh experiment, the situation is simply more complicated than saying “I used the same stimulus”

Yep that’s me – hi Sam! Been trying to find the commentary but all the twitter links I found were dead. Perhaps because the authors realised they were breaking embargo. Thanks for sharing. And I entirely agree with your comments about vision science and a level of expertise being required to evaluate.

Stepping back a little, perhaps one could say that critical expertise at some stage of review seems to be where many of these problems start (by reviewers of the original study; by reviewers of the replication attempt, etc). Indeed, scanning through twitter there are strong claims being made that the Gilbert commentary itself makes a basic error in understanding what a confidence interval is, and that escaped Science’s peer review process.

I don’t know how to react to this stuff. Are we as a science getting worse at peer review and the critical expertise needed to properly evaluate studies and stats? Or is it like when crime statistics go up – maybe (because of twitter and blogs?) people are just getting better at spotting and sharing problems. I don’t know…

Yes they got the confidence interval wrong. That’s in the supplement (not sure if that’s the supplement to the Science article or the supplement to the commentary I’m talking about here – it’s very confusing). It could be they took it offline because of embargo or perhaps they saw the responses about the CI thing and decided to correct that before going live…

I’ll leave you all with this quote from the Nosek et al. reply to Gilbert et al.’s Science article:

“What counts as a replication involves theoretical assessments of the many differences expected to moderate a phenomenon.”

This is precisely the point. Of course, Tal and Jake are right that all sorts of parameters can modulate or even destroy an effect in psychophysics or neuroimaging studies or sensory cortex. Nobody is questioning that. But the problem is in most of these cases I would not put much faith in effects that are not robust to such changes. Whenever we replicate something we always discuss whether particular changes to the design, stimulus, analysis protocol etc are justified. Obviously it is usually best to aim for minimal changes. However, there are many choices where I would say that if they affect the result in a substantial way I would regard the effect as fickle and meaningless.

Using the example of our replication I discussed in the post, we closely replicated the stimulus but there are obviously still discrepancies, the TR of the scans and the voxel size for instance. If I thought these things mattered significantly then I wouldn’t have much confidence in the effect.

My point is that in social psychology this isn’t true. Some effects are fickle because they presumably are fickle and you should consider that when you study them, both as an originator and as a replicator.

Agreed. But we have people taking these fickle results out into the world and getting 6- or 7-figure deals for self-help or business books based on the “robust truths” that they have discovered. They get tame journalists to provide stories for the Economist, or they have their own op-eds in the New York Times, which policy wonks read and use as the basis of changes to legislation. Consultants pick up on these papers and turn them into worthless education enhancement programmes to sell to schools because “Scientists have proven that your brain works like this”. Once those boulders start rolling, a non-replication with a 10x greater sample is not going to stop them, especially if the original researchers are allowed to cry “hidden moderator!” or “inaccurate replication!” (while not explaining how their effect still manages to transfer perfectly from the lab to a business or school setting).

Yes and that relates to a much bigger problem we have with science publishing and mainstream media coverage. I really think this needs to be changed. I try my best when I talk to the media but then again if they decide to cut out the 10 minutes I talk about the limitations and leave in the 10 second soundbite that sounds all cool there isn’t much I can do about it. The changes must be more wide-ranging than how scientists communicate with the media.

Anyway, I think psychology studies of this kind, the ones that are chasing snowflakes, must find better ways to deal with this problem. I don’t think the book deal gold mine will stick around for much longer otherwise. (Or maybe I’m just bitter that I can’t get in on that party? :P)

Unfortunately, I don’t think the book deal gold mine will ever go away, sadly. There is an unsatisfiable appetite for information about how to improve ourselves that seems to be largely unaffected by the information’s truth value.