More on replication crisis

The replication crisis in social psychology (and science more generally) will not be solved by better statistics or by preregistered replications. It can only be solved by better measurement.

Let me say this more carefully. I think that improved statistics and preregistered replications will have very little direct effect on improving psychological science, but they could have important indirect effects.

Why no big direct effects? Cos if you’re studying a phenomenon that’s tiny, or that is so variable that any main effects will be swamped by interactions, with the interaction changing in each scenario where it’s studied, then better statistics and preregistered replications will just reveal what we already know, which is that existing experimental results say almost nothing about the size, direction, and structure of these effects.

I’m thinking here of various papers we’ve discussed here over the years, examples such as the studies of political moderation and shades of gray, or power pose, or fat arms and political attitudes, or ovulation and vote preference, or ovulation and clothing, or beauty and sex ratios, or elderly-related words and walking speed, or subliminal smiley faces and attitudes toward immigration, or ESP in college students, or baseball players with K in their names being more likely to strike out, or brain scans and political orientation, or the Bible Code, or . . .

Let me put it another way. Lots of the studies that we criticize don’t just have conceptual problems, they have very specific statistical errors—for example, the miscalculated test statistics in Amy Cuddy’s papers, where p-values got shifted below the .05 threshold—and they disappear under attempted replications. But this doesn’t mean that, if these researchers did better statistics or if they routinely replicated, that they’d be getting stronger conclusions. Rather, they’d just have to give up their lines of research, or think much harder about what they’re studying and what they’re measuring.

It could be, however, that improved statistical analysis and preregistered replications could have a positive indirect effect on such work: If these researchers knew ahead of time that their data would be analyzed correctly, and that outside teams would be preparing replications, they might be less willing to stake their reputations on shaky findings.

Think about Marc Hauser: had he been expected ahead of time to make all his monkey videotapes available for the world to see, he would’ve had much less motivation to code them the way they did.

So, yes, I think the prospect of reanalysis of existing data, and replication of studies, concentrates the mind wonderfully.

But . . . all the analysis and replication in the world won’t save you, if what you’re studying just isn’t there, or if any effects are swamped by variation.

That’s why, in my long blog conversations with the ovulation-and-clothing researchers, I never suggested they do a preregistered replication. If they or anyone else wants to do such a replication, fine—so far, I know of two such replications, neither of which found the pattern claimed in the original study but each of which reported a statistically significance comparison on something new, i.e., par for the course—but it’s not something I’d recommend because then I’d be recommending they waste their time. It the same reason I didn’t recommend that the beauty-and-sex-ratio guy gather more samples of size 3000. When your power is 6%, or 5.1%, or 5.01%, or whatever, to gather more data and look for statistical significance is at best a waste of time, at worst a way to confuse yourself with noise.

So . . . as I wrote a few months ago, doing better statistics is fine, but we really need to be doing better psychological measurement and designing studies to make the best use of these measurements:

Performing more replicable studies is not just a matter of being more careful in your data analysis (although that can’t hurt) or increasing your sample size (although that, too, should only help) but also it’s about putting real effort into design and measurement. All too often I feel like I’m seeing the attitude that statistical significance is a win or a proof of correctness, and I think this pushes researchers in the direction of going the cheap route, rolling the dice, and hoping for a low p-value that can be published. But when measurements are biased, noisy, and poorly controlled, even if you happen to get that p less than .05, it won’t really be telling you anything.

With this in mind, let me speak specifically of the controversial studies in social priming and evolutionary psychology. One feature of many such studies is that the manipulations are small, sometimes literally imperceptible. Researchers often seem to go to a lot of trouble to do tiny things that won’t be noticed by the participants in the experiments. For example, flashing a smiley face on a computer screen for 39 milliseconds, or burying a few key words in a sham experiment. In other cases, manipulations are hypothesized to have a seemingly unlimited number of interactions with attitudes, relationship status, outdoor temperature, parents’ socioeconomic status, etc. Either way, you’re miles away from the large, stable effects you’d want be studying if you want to see statistical regularity.

If effects are small, surrounded by variability, but important, then, sure, research them in large, controlled studies. Or go the other way and try to isolate large effects from big treatments. Swing some sledgehammers and see what happens. But a lot of this research has been going in the other direction, studying tiny interventions on small samples.

The work often “succeeds” (in the sense of getting statistical significance, publication in top journals, Ted talks, NPR appearances, etc.) but we know that can happen, what with the garden of forking paths and more.

So, again, in my opinion, the solution to the “replication crisis” is not to replicate everything or to demand that every study be replicated. Rather, the solution is more careful measurement. Improved statistical analysis and replication should help indirectly in reducing the motivation for people to perform analyses that are sloppy or worse, and reducing the motivation for people to think of empirical research as a sort of gambling game where you gather some data and then hope to get statistical significance. Reanalysis of data and replication of studies should reduce the benefit of sloppy science and thus shift the cost-benefit equation in the right direction.

Piss-poor omnicausal social science

One of my favorite blogged phrases comes from political scientist Daniel Drezner, when he decried “piss-poor monocausal social science.”

By analogy, I would characterize a lot of these unreplicable studies in social and evolutionary psychology as “piss-poor omnicausal social science.” Piss-poor because of all the statistical problems mentioned above—which arise from the toxic combination of open-ended theories, noisy data, and huge incentives to obtain “p less than .05,” over and over again. Omnicausal because of the purportedly huge effects of, well, just about everything. During some times of the month you’re three times more likely to wear red or pink—depending on the weather. You’re 20 percentage points more likely to vote Republican during those days—unless you’re single, in which case you’re that much more likely to vote for a Democrat. If you’re a man, your political attitudes are determined in large part by the circumference of your arms. An intervention when you’re 4 years old will increase your earnings by 40%, twenty years down the road. The sex of your baby depends on your attractiveness, on your occupation, on how big and tall you are. How you vote in November is decided by a college football game at the end of October. A few words buried in a long list will change how fast you walk—or not, depending on some other factors. Put this together, and every moment of your life you’re being buffeted by irrelevant stimuli that have huge effects on decisions ranging from how you dress, to how you vote, to where you choose to live, your career, even your success at that career (if you happen to be a baseball player). It’s an omnicausal world in which there are thousands of butterflies flapping their wings in your neighborhood, and each one is capable of changing you profoundly. A world if, it truly existed, would be much different from the world we live in.

A reporter asked me if I found the replication rate of various studies in psychology to be “disappointingly low.” I responded that yes it’s low, but is it disappointing? Maybe not. I would not like to live in a world in which all those studies are true, a world in which the way women vote depends on their time of the month, a world in which men’s political attitudes were determined by how fat their arms are, a world in which subliminal messages can cause large changes in attitudes and behavior, a world in which there are large ESP effects just waiting to be discovered. I’m glad that this fad in social psychology may be coming to an end, so in that sense, it’s encouraging, not disappointing, that the replication rate is low. If the replication rate were high, then that would be cause to worry, because it would imply that much of what we know about the world would be wrong. Meanwhile, statistical analysis (of the sort done by Simonsohn and others), and lots of real-world examples (as discussed on this blog and elsewhere) have shown us how it is that researchers could continue to find “p less than .05” over and over again, even in the absence of any real and persistent effects.

The time-reversal heuristic

A couple more papers on psychology replication came in the other day. They were embargoed until 2pm today which is when this post is scheduled to appear.

I don’t really have much to say about the two papers (one by Gilbert et al., one by Nosek et al.). There’s some discussion about how bad is the replication crisis in psychology research (and, by extension, in many other fields of science), and my view is that it depends on what is being studied. The Stroop effect replicates. Elderly-related-words priming, no. Power pose, no. ESP, no. Etc. The replication rate we see in a study-of-studies will depend on the mix of things being studied.

Having read the two papers, I pretty much agree with Nosek et al. (see Sanjay Srivastava for more on this point), and the only thing I’d like to add is to remind you of the time-reversal heuristic for thinking about a published paper followed by an unsuccessful replication:

One helpful (I think) way to think about such an episode is to turn things around. Suppose the attempted replication experiment, with its null finding, had come first. A large study finding no effect. And then someone else runs a replication under slightly different conditions with a much smaller sample size and found statistically significance under non-preregistered conditions. Would we be inclined to believe it? I don’t think so. At the very least, we’d have to conclude that any such phenomenon is fragile.

From this point of view, what the original claim has going for it is that (a) statistical significance was obtained in an uncontrolled setting, (b) it was published in a peer-reviewed journal, and (c) this paper came before, rather than after, the attempted replication. I don’t find these pieces of evidence very persuasive. (a) Statistical significance doesn’t mean much in the absence of preregistration or something like it, (b) lots of mistakes get published in peer-reviewed journals, to the extent that the phrase “Psychological Science” has become a bit of a punch line, and (c) I don’t see why we should take the apparently successful result as the starting point in our discussion, just because it was published first.

Hi Ben,
What always bothered me about the glass-half-full representation was that a study counted as a replication even if the effect size was in the different direction than the original (because of small samples and thus LARGE prediction intervals). Sure, the prediction intervals of the original studies are indeed large (because of small n), and thus many effect sizes in a replication are statistically consistent with the originally observed effect. But if your definition of replication includes contradictory results, I’d say that your definition of replication needs tuning.

I want to make sure I understand what you’re saying. I think you’re saying:

Preregistered studies are the real arbiter of whether hypotheses are true. Preregistered replications, therefore, can reveal whether non-preregistered studies have discovered effects that exist in the world.

> However, you have strong priors that certain kinds of effects are trivial relative to the noise in any particular implementation, including, e.g., the effect of power-poses on certain real-world outcomes. Since we have strong beliefs in advance that the noise (except in extremely large or homogenous samples) will dominate the effect size, any result unlikely to appear under a null effect would be due to that unlikely sample having been drawn. We know in advance that if we were to get a significant result, it would be because we drew an unlikely sample.

> Therefore, there is basically no information that will be gained by doing a pre-registered replication, since it will be non-significant with 95 percent probability, and if it isn’t, it’ll be because of that 5 percent sampling probability on the tails.

(1) Is that what you’re saying?

If so, (2) How many things do we have such strong priors on a trivial effect size that we’re in a situation like this? How do we know when we’re in that situation? I wouldn’t have been surprised if power poses worked, before the failed replications. Could I have known somehow that I was in one of these situations with trivial effect size where there is nothing to be gained from attempting replication?

In some settings such as sex ratios and voting, we do have strong priors from the literature: it’s very hard to imagine that the proportion of girls born among beautiful parents is more than, say 0.003 more than the proportion of girls born among ugly parents; and it’s very hard to imagine that more than a few percentage of voters were changing their views during the 2012 general election—so it’s hard to believe that a football game could swing voters by 2 percentage points, or that votes would vary by 20 percentage points based on your time of the month. In other settings such as the ovulation-and-clothing example, some simple calculations can show that, with any plausible underlying effect, that observed correlations would have to be pretty low.

In other settings, sure, who knows? Power pose could conceivably have a large effect, or it could be large and positive for some people and large and negative for others. The plethora of possible large interactions suggests that any particular interaction that one might study, is likely to be small. It’s just not possible for all these large interactions to coexist. Still, yes, it’s possible that Cuddy or someone else could luck out and find something. And, then, sure, maybe a replication is the best way to go to resolve the issue. But, if you’re a researcher who wants to study the topic (rather than, say, a policymaker who wants to adjudicate an existing dispute), then I think you’re much better off thinking hard about what you’re trying to study, taking careful measurements, implementing large changes, and so forth—rather than giving small interventions that you design so that the experimental participants aren’t even aware of them. That’s one thing that bugs me: these researchers sometimes seem to be working so hard to make their interventions imperceptible—this is like a recipe for disaster.

In my area of neuroscience investigators spend huge amounts of energy coming up with new measurements that are in some sense ‘better’ than existing methods. These new measurement techniques are justified because they either a) seem more sophisticated and accurate within the current investigative environment, or b) give smaller p-values for contrasting risk groups (age, disease, etc.). I don’t find this effort to be particularly productive, and I don’t think that dwelling on measurements will resolve the bigger problems of non-replication and pseudo-science.

When I took my PhD in social science and went to waste it in marketing research (in the opinion of my chief faculty advisor) I discovered that I now had actual variables. Sure, you had to operationally define sales, and gross profit, and GRPs, but they were pretty close to real things — not like attitudes, for example.

Mass production offers some lessons, or at least analogies here. You produce thousands of nearly identical parts, and you examine their properties (e.g., length). It’s not enough to reject parts that are out of tolerance. You want to see whether the mean and variance are deviating from their expected values. If say the mean drifts, it may well be within the tolerance range for any individual part, but still it’s time to look into fixing the production system to bring the mean back to the intended value.

This approach is much more effective than simply rejecting out-of-tolerance parts, since it corrects production deterioration before it leads to large numbers of rejects.

It seems to me that in the case of so many of the kinds of papers that have been discussed here, we’re trying to reject parts rather than to correct the process so it doesn’t produce many rejects. Using p-values is like rejecting or accepting parts based on quality control metrics. Correcting the production line is like making sure your experiments and analyses have the power to detect “real” results vs mere noise. Or even better, making sure that publications cause that to happen.

This why some of us have proposed replication as a way of improving _research practice_, or how scientists go about doing research, as opposed of shooting a single study down, or making new substantive contributions.

Scientists are in the business of manufacturing inferences. And it is the manufacturing process that is broken.

What seems different here and might enhance some progress, is the obvious disconnect with common background knowledge about the claims (say compared to clinical treatments) and realization that more critical abduction (hypothesis generation) is needed – i.e. if It’s just not possible for all these large interactions to coexist” [given what we think we know] then that is not a good choice of something to study. We might be wrong, but its still not a good choice unless we have good reasons to think we are wrong.

From 1989 “After all, the ultimate objective is to be able to answer the question or in some cases go on to another question we may be better able to answer. To do otherwise, is to ensure inefficient use of the scarce resources available for research”I should have added “and flood the literature with distractions!]”

Thank you for saying what needs to be said. Numerous collegues defend our field by saying that altough the statistics is bad, everything else is just fine. The findings are sexy. This may be a älanguage issue but i can’t comprehend the concept of sexy findings. Silly findings or funny effects seems more appropriate.

Once again people are confused about meaning of replication. Is it re-analizing data, or doing experiment again in exact same way using sample from exact same population, or from a different population, or doing things differently to assess robustness.

It’s a semantic quagmire. And it matters.

For example, the media is atwitter about replicating an experiment about affirmative action in the Netherlands. But what if the original study had claimed that the effects are universal? Then it becomes legit to vary the population. An external validity replication.

In general I am no fan of conceptual replication, but Dan Gilbert should know are common in psychology. I am glad they are now arguing against them.

More generally this is a waste of time. The whole notion of success / fail replication is suspect. Basically you are partitioning a continuous measure into a binary one. There are so many different ways to do this that ex post I can take the same OSF data and, by careful choice of test, prove any point you want.

Agree – and one could assess if the studies are individually moving the prior in the same directions to the posterior (or compare the individual likelihoods look for common ranking of probability of data given location in the parameter space.)

Woah. This critique has serious flaws: they claim that “If all 100 of the original studies examined by OSC had reported true effects, then sampling error alone should cause 5% of the replication studies to “fail” by producing results that fall outside the 95% confidence interval of the original study”. Suppose the original CI is [0.9,1.2]: claiming (as they do) that 95% of the replicated experiments should fall in [0.9,1.2] is of course a complete misunderstanding of the frequentist CI. It’s a bit like saying that the frequentist p-value is the probability of the null being true (a statement which makes no sense in frequentist statistics).

Gilbert is not making such an argument. But if an experiment is deeply embedded in a social setting (“how to feel about Stanford’s affirmative action policies), then moving it wholesale to Italy seems like a recipe for nonreplication.

I was not able to get to anything more than a vague summary of the Gilbert et al critique last night and was responding to the rather sloppy characterization of Gilbert’s points by the journalist, but someone was kind enough to post the actual critique after that.

I agree with some of the criticisms such as attempting to replicate a study on racial attitudes in the U.S. on a sample in Italy and arguing that it is not a replication. Though I find it rather odd that a study in which the inferences are confined to Stanford would even be accepted for publication.

The argument that a replication must sample from the same population seems deeply flawed to me. Not in principle, but in practice: in Psychology, you rarely see a clearly stated sampling frame. Most experiments are done in university students, but the conclusions are usually general.

This way, you can always play the “not the same population” card if the results do not replicate, even though the original study do not define a population and hastily generalize the conclusion.

In the case of the racial bias study with an Italian sample, they found a similar effect! Should it count as a successful replication or not, given that the sample came from a different country?

As Nosek et al. noted, there’s never such a thing as an exact replication. Recall that when the ovulation-and-clothing researchers failed in their very own replication, they declared victory by saying that the conditions of their replication were different because of a change in the outdoor temperature. Bargh declared replications successful even though new interactions were brought in, in each new study.

As we’ve discussed from time to time on this blog, one problem is that, when a study fails to replicate, the original researchers tend to emphasize how specific their finding is to their experimental conditions. But published papers tend to imply generality. So, on one hand we’re told not to trust a replication because it’s done on a different group of people or under different conditions, but on the other hand we’re told that an experiment on a bunch of college students in a lab or Mechanical Turk participants on the internet generalizes just fine. (Search this blog for “freshman fallacy.”)

“It’s an omnicausal world in which there are thousands of butterflies flapping their wings in your neighborhood, and each one is capable of changing you profoundly.”

I don’t quite see why this sentence is so unbelievable that you have to automatically reject it. The butterfly effect is real. Minor events are likely to have minor effects, but these effects can indeed snowball (a series of shark attacks make voters upset, voters express frustration by voting for opposition party, opposition party gains power in close election, etc.) The thought may be rather depressing, because it implies that randomness and chaos plays a major element in our day-to-day life, and that there is little rhyme or logic in our world. But there’s no reason to believe that there IS rhyme or logic in our world.

We should obviously fight against bad stats (no to ESP), but that is not to say that the omnicasual world is false. It may be even more understandable if we reduce the butterflies down to hundreds instead of thousands (especially because we might begin to get a handle of them).

Are we to assume that lowered worker productivity is just some fun footnote? That it wouldn’t lead to anything worthwhile other than in a question for Trivial Pursuit? Not all butterflies are to be dismissed.

1. I believe that lots of little things can change you in lots of little ways and occasionally in big ways. What I dispute is the claim that lots of little things can each change you in big ways, consistently. That’s what’s being claimed in much of the social psychology and evolutionary psychology literature: that various imperceptible interventions have large, consistent, and predictable effects.

It was perhaps a mistake for me to bring up the butterflies. The point of the butterfly effect is that the effects are unpredictable, which is the opposite of what is claimed by Bargh, Cuddy, Kanazawa, etc.

2. No, I don’t think that lowered worker productivity is just some fun footnote.

I am surprised you emphasized better measurement and not better thinking more broadly, although this is covered in the body of your post to some degree. I think this idea that most psych effects worth studying are small effects requiring huge samples is just not true. Our practices have led to a perversion of the whole research process, including thinking about what is worth testing. Supposedly all the effects are small, but that’s because with QRPs one can go after such ephemeral effects and have a good chance at a publishable result.

With better practices, we’ll have to think a lot longer and harder, and we may find that there are psychological effects worth studying that are not only novel but also nontrivial in size.

I agree about better thinking more broadly, but I worry that this advice would be too general. After all, I’m sure the researchers of the unreplicated studies already are thinking hard about their theories. I’m thinking of measurement as a necessary bridge between theory and data. But, yes, another important step is for researchers to think more seriously about effect sizes. Rather than to just gather a bunch of data and then shake and shake until something statistically significant comes out. Or, even worse, to engage in a desperate rearguard defense of unreplicated and unreplicable work, instead of taking this as an opportunity for reflection and reassessment.

We also need to stop being so damn attached to our theories. If you base your entire career on power poses or embodied cognition or priming then you become so invested in particular findings that you stop being a scientist. We teach our undergraduates about the importance of disinterestedness but seldom practice it ourselves. Sometimes I think of designing a class around how to be a heartless empiricist.

I’d like to see more concern about measurement and design, but it’s not entirely clear to me how this will improve reproducibility. I do think the more we reflect on these aspects of the research process the more we recognize how much of a role luck probably plays when measurement/design hasn’t been carefully considered. This may in turn cause people to rethink what they invest their time/energy/money is measuring. Maybe this is what you are getting at.

Paying more attention to measurement and design is important in its own right, as part of good science. It will probably improve reproducibility, but can never eliminate the element of “luck” (or as I would say, chance), because variation is just part of nature. In other words, some lack of reproducibility is just in the nature of things, even if attention to measurement and design are good.

I think its this “After all, the ultimate objective is to be able to answer the question or in some cases go on to another question we may be better able to answer. To do otherwise, is to ensure inefficient use of the scarce resources available for research”

See the linked Sanjay Srivastava post for more on this. Short answer is that I disagree with them, and I’m disappointed they didn’t run their article by someone knowledgeable such as Uri Simonsohn or Brian Nosek who could’ve explained a bunch of things to them.

The original Science replication study tried to “estimate the reproducability of psych science” (that’s the title). But, considering their sampling procedure their article simply couldn’t answer such a question. “The population of interest was the research literature of psychological science. The authors of OSC­2015 began by defining a non­random sample of that population: the 488 articles published in 2008 in just three journals that represented just two of psychology’s subdisciplines (i.e., social psychology and cognitive psychology) and no others (e.g., neuroscience, comparative psychology, developmental psychology, clinical psychology, industrial­organizational psychology, educational psychology, etc.). They then reduced this non­random sample not by randomly selecting articles from it, but by applying an elaborate set of selection rules to determine whether each of the studies was eligible for replication—rules such as “If a published
article described more than one study, then only the last study is eligible” and “If a study was
difficult or time­consuming to perform, it is ineligible” and so on. Not only did these rules make
a full 77% the non­random sample ineligible, but they also further biased the sample in ways that
were highly likely to influence reproducibility (e.g. researchers may present their strongest data
first; researchers who use time­consuming methods may produce stronger data than those who do not; etc.). After making a list of selection rules, the authors of OSC­2015 permitted their replication teams to break them (e.g., 16% of the replications broke the “last study” rule, 2
studies were replicated twice, etc.). Then it got worse. Instead of randomly assigning the remaining articles in their non­random
sample to different replication teams, the authors of OSC­2015 invited particular teams to replicate particular studies or they allowed the teams to select the studies they wished to replicate. Not only did this reduce the non­random sample further (a full 30% of the articles in
the already­reduced sample were never accepted or selected by a team), but it opened the door to exactly the kinds of biases psychologists study, such as the tendency for teams to accept or select (either consciously or unconsciously) the very studies that they thought were least likely to replicate. As the authors of OSC­Reply remind us, even casual bystanders in a prediction market can tell beforehand which effects will and will not replicate—and yet, armed with exactly those insights, replication teams were given a choice about which studies they would replicate. Then, in a final blow to the notion of random sampling, the already reduced non­random sample was non­randomly reduced one last time by the fact that 18% of the replications were never completed.”

Even Plos one shouldn’t publish a study with such an exceptionally poor sampling procedure. I am grateful that Gilbert et al. have made people aware of these grave errors.

I think Gilbert et al. continue to miss the point. They write, “Why does the fidelity of a replication study matter? Even a layperson understands that the less fidelity a replication study has, the less it can teach us about the original study.” Again, they’re placing the original study in a privileged position. There’s nothing special about the original study, relative to the replication. The original study came first, that’s all. What we should really care about is what is happening in the general population.

I think the time-reversal heuristic is a helpful tool here. The time order of the studies should not matter. Imagine the replication came first. Then what we have is a preregistered study that finds no evidence of any effect, followed by a non-preregistered study on a similar topic (as Nosek et al. correctly point out, there is no such a thing as an exact replication anyway, as the populations and scenarios will always differ) that obtains p less than .05 in a garden-of-forking-paths setting. No, I don’t find this uncontrolled study to provide much evidence.

Gilbert et al. are starting with the published papers—despite everything we know about their problems—and treating the reported statistical significance in those papers as solid evidence. That’s their problem. The beauty of having replications is that we can apply the time-reversal heuristic.

1) I disagree that “Revealing replication problems in a bunch of studies does seem to me to be valuable” is a strong argument in judging the value of Gilbert et al.’s work.

2) One can’t claim to “reveal a replication problem” with a single replication (due to the reasons you mentioned above).

3) The time-reversal heuristic has the implicit assumption that one should treat the fidelity of both replication and original as equal since they cannot possibly be known. While I would have been willing to accept this had I not known any better, after taking a look at the experimental design of the replications I don’t completely agree with this. From a purely qualitative standpoint, several replications were not judicious in picking a representative sample, while the original was.

1. I have no interest in assigning an overall value to the Gilbert et al. paper. If it contributes to the discussion in some way (perhaps by making researchers aware that published studies are typically not as general as claimed—remember that paper that wrote about “upper body strength” but what they actually measured was arm circumference?), that’s a good thing.

2, 3. I do not take the replication and original studies as equal! Original studies are typically uncontrolled which makes their p-values close to meaningless (not even counting cases like Cuddy’s where the p-values themselves are not calculated correctly). Remember that researcher degrees of freedom don’t just come from the choice of formal test to run, they also arise in rules for data exclusion, what interactions and comparisons to consider, and so forth.

In contrast, replications are often preregistered. Not always perfectly but with typically much more control than original published studies. In cases I’ve seen (such as power pose and embodied cognition), I’d judge the replications to be more careful than the original. These things do have to be judged on a case by case basis. One good thing about the replication project is its openness, in contrast to published studies that often don’t give key details.

Sure, there is plausible deniability that originals are less controlled than replications due to the reasons you mentioned.
However, like you said, one ought to judge these by a case by case basis instead of ascribing generalizations to the whole class of original studies.

Since the only material we have is the OSF data released by Nosek, one can only try to judge the efficacy of the study by Nosek.
That is precisely what Gilbert et al. does, citing concrete examples of where the replication made errors.

Even if you disregard any criticism that compares the original to the replication, other criticisms like the sampling issue stand completely on their own.

Regardless, Gilbert et al. can only do so much as to show that Nosek’s study is flawed, not whether or not a replication problem truly exists.

Next, when the experimental conditions are modified so greatly, how can one be sure to identify the reason why the replication failed? Was it because the replication was done on a sample that was not representative of the population the measured effect is intended to be observed on? Or was it because the original study was less controlled and introduced bias? The fact that one can’t isolate the reason is itself an indication that the experiment is not conducive in determining the replicability of the original study. It is a qualitative case-by-case judgement at the end of the day, but one cannot deny that factors like these add confounding reasons as to why the replication failed.

Even if we assume that the sample chosen in the replication was equally representative as the sample in the original, there were other criticisms of the paper (how the replication studies were chosen etc.)

Lastly, regardless of Gilbert’s tweets or even his opinion, the most one could ever show by criticizing the original study is simply that it does not prove there is a crisis. Whether or not there actually is a crisis would still be unproven. I think is important that people realize this distinction.

Direct quote from Gilbert’s paper to illustrate the extent of the problem:
—

“For example, many of OSC’s replication studies
drew their samples from different populations
than the original studies did. An original study that
measured American’s attitudes toward AfricanAmericans
(3) was replicated with Italians, who
do not share the same stereotypes; an original
study that asked college students to imagine being
called on by a professor (4) was replicated with
participants who had never been to college; and
an original study that asked students who commute
to school to choose between apartments that
were short and long drives from campus (5) was
replicated with students who do not commute to
school. What’s more, many of OSC’s replication
studies used procedures that differed from the
original study’s procedures in substantial ways:
An original study that asked Israelis to imagine
the consequences of military service (6) was replicated
by asking Americans to imagine the consequences
of a honeymoon; an original study that
gave younger children the difficult task of locating
targets on a large screen (7) was replicated by
giving older children the easier task of locating
targets on a small screen; an original study that
showed how a change in the wording of a charitable
appeal sent by mail to Koreans could boost
response rates (8) was replicated by sending
771,408 e-mail messages to people all over the
world (which produced a response rate of essentially
zero in all conditions).”

If a study’s authors are arguing that their study demonstrates some generalizable effect that is not directly related to the context of the measures used, nor even of the measures themselves — then whether or not the sample is the same or the context itself is same is somewhat beside the point as the problem would then be with the non-representativeness of the sample or the non-generalizability of the study despite its claims.

The typical problem is that there is often not enough information in either the original study or the attempted replication to discern whether this is true. It is also typical that inferences are generalized well beyond the sample at hand and descriptions of the sample are often vague.

On the specific point of representativeness, I do agree with you (and Gilbert et al.) that the results of the Nosek et al. study cannot be taken to represent a general rate of reproducibility in psychological science. As Gilbert et al. correctly point out, to estimate such a rate, once would first want to define a population that represents “psychological science” and then try to study a representative sample of such studies. Sampling might not be so hard, but nobody has even really defined a population here, so, yes, it’s not clear what population is being represented here.

Regarding the issue of representativeness, I disagree with Gibert et al.’s statement that “it is difficult to see what value their findings have.” Revealing replication problems in a bunch of studies does seem to me to be valuable, especially given the attitudes of people like Cuddy, Bargh, etc., who refuse to give an inch when their own studies are not replicated.

If there are some terms (such as “Creator”) that Plos-One will never publish, then Plos-One should do us a favor by publishing that list of forbidden words. But . . . they can’t publish that list, because it contains the words they won’t publish!