Failure Is Moving Science Forward

The replication crisis is a sign that science is working.

While I paced around the green room at a recent TEDx event in Colorado, one of the other speakers offered the rest of us some advice on how to ease our nerves. “Raise your arms up in the air and make yourself big — it will help you feel powerful!” It was scientifically proven, she told us (she’d seen it in a TED talk), that adopting a so-called power pose — shoulders wide, arms strong — could raise your testosterone levels, lower your stress hormones, and make you feel more confident and commanding.

Like everyone else, I was nervous. This wasn’t my usual kind of speech; it was a performance — a scripted story that wasn’t supposed to sound scripted, told with no notes and no cues. I knew my lines by heart, but I also knew that one moment of doubt was all it would take for me to draw a blank up on stage. So just before I walked through the curtains, I took a deep breath and raised my arms overhead as if signaling victory. I don’t know if the power pose helped me, but it didn’t seem to hurt.

What I didn’t say back in the green room was that although one highly touted study had shown how adopting a power pose could alter your hormone levels and make you more bold, another group of researchers had tried to repeat the study and found no such effect. It’s possible that the power pose phenomenon was nothing more than a spurious result.

Power poses aren’t the only well-publicized finding called into question by further research. Psychology, biomedicine and numerous other fields of science have fallen into a crisis of confidence recently, after seminal findings could not be replicated in subsequent studies. These widespread problems with reproducibility underscore a problem that I discussed here last year — namely, that science is really, really hard. Even relatively straightforward questions cannot be definitively answered in a single study, and the scientific literature is riddled with results that won’t stand up. This is the way science works — it’s a process of becoming less wrong over time.

The roots of the reproducibility crisis

As science grapples with what some have called a reproducibility crisis, replication studies, which aim to reproduce the results of previous studies, have been held up as a way to make science more reliable. It seems like common sense: Take a study and do it again — if you get the same result, that’s evidence that the findings are true, and if the result doesn’t turn up again, they’re false. Yet in practice, it’s nowhere near this simple.

Yet there are good reasons why real effects may fail to reproduce, and in many cases, we should expect replications to fail, even if the original finding is real. It may seem counterintuitive, but initial studies have a known bias toward overestimating the magnitude of an effect for a simple reason: They were selected for publication because of their unusually small p-values, said Veronica Vieland, a biostatistician at the Battelle Center for Mathematical Medicine in Columbus, Ohio.

Imagine that you were looking at the relationship between height and college majors. You collected data from math majors in a small class that had a couple of unusually tall students and compared it with a similar sized philosophy class that happened to have one unusually short person in it. Comparing the two averages, the differences seem large — math majors are taller than philosophy majors (and perhaps the unusual difference between these two particular classes is what caught your attention in the first place). But most of those differences were flukes, and when you repeat the study you’re unlikely to see such an extreme difference between the two majors, especially if the second study has a larger sample. If you’re trying to figure out the true height differences, this “regression to the mean” is a good thing, because it gets you closer to the true averages.

But the regression to the mean issue also means that even if the initial results are correct, they may not be replicated in subsequent studies. The RP:P project attempted to replicate 100 studies, 97 of which had produced results with a “significant” p-value of 0.05 or less. By selecting so many positive studies, the group set itself up for a regression to the mean phenomenon, and that’s what it found, said Steven Goodman, co-director of the Meta-Research Innovation Center at Stanford (he was not involved in the RP:P project).

How should we think of replications?

Researchers in RP:P did everything possible to duplicate the original studies’ methods and materials — they even contacted original authors and asked for advice and feedback on their replication plans. Even so, there could have been differences between the studies that explained why their results weren’t similar.

For instance, Elizabeth Gilbert, a graduate student at the University of Virginia, attempted to replicate a study originally done in Israel looking at reconciliation between people who feel like they’ve been wronged. The study presented participants with vignettes, and she had to translate these and also make a few alterations. One scenario involved someone doing mandatory military service, and that story didn’t work in the U.S., she said. Is this why Gilbert’s study failed to reproduce the original?

For some researchers, the answer is yes — even seemingly small differences in methods can cause a replication study to fail. In a commentary published March 4 in Science, Daniel Gilbert (no relation to Elizabeth), Gary King, Stephen Pettigrew and Timothy Wilson argue that methodological differences between the original studies and RP:P’s replications led the RP:P authors to underestimate how many replication studies would fail by chance. They also took issue with some of the sampling and statistical methods used in the RP:P analysis and conclude that “the reproducibility of psychological science is quite high.”

“Individually, each of these problems would be enough to cast doubt on the conclusion that most people have drawn from this study, but taken together, they completely repudiate it,” according to a statement attributed to Gilbert and King in a Harvard University news release. “Psychology might have a replication problem, but as far as I can see, nothing in [the RP:P] article provides evidence for this conclusion,” Gilbert told me. “I learned nothing.”

This is more than just a dispute about these particular research projects; it’s a fundamental argument about how scientific studies should be conducted and assessed.

The debate between these two groups is highly technical and difficult to parse without a deep grasp of statistics and research methodology. It’s also critically important. This is more than just a dispute about these particular research projects; it’s a fundamental argument about how scientific studies should be conducted and assessed.

When 29 research teams working with the best intentions (and fully aware that their work will be scrutinized and compared with that of the other teams) can come up with such a wide range of answers, it’s easy to imagine that similarly earnest efforts to replicate existing studies might also produce different results, whether or not the original finding is correct. The takeaway is clear — methods matter.

Years ago, someone asked John Maddox how much of what his prestigious science journal Nature printed was wrong. “All of it,” the renowned editor quickly replied. “That’s what science is about — new knowledge constantly arriving to correct the old.” Maddox wasn’t implying that science was bunk; he was saying that it’s only as good as the current available evidence, and as more data pours in, it’s inevitable that our answers change.

When studies conflict, which is right?

When considering the results of replication studies, what we really want to know is whether the evidence for a hypothesis has grown weaker or stronger, and we don’t currently have an accurate metric for measuring that, Vieland said. P-values, which are commonly (and, statisticians say, erroneously) used to assess how likely it is that a finding happened by chance, don’t measure the strength of the evidence, even if they’re often treated as if they do. You also can’t take a study showing that a drug reduced blood pressure by 30 percent, add it to a study that suggested that the treatment increased blood pressure by 10 percent, and then conclude that the actual effect is a 20 percent reduction in blood pressure. Instead, you have to look at the evidence in total and carefully consider the methods used to produce it.

Consider the power pose concept, which began with a 2010 study that became a sensation via a TED talk and subsequent media storm. The results were exciting. The study suggested that briefly standing in a power pose, such as the “Wonder Woman stance,” could “configure your brain” to boost your testosterone levels, reduce the output of the stress hormone cortisol, and make you act less cautious and more confident, as Harvard psychologist Amy Cuddy explained in a speech that’s become the second-most-viewed TED talk of all time.

But in a study published last spring, Eva Ranehill at the University of Zurich and her colleagues set out to confirm and extend the effects of power poses found in Cuddy’s 2010 experiment. Ranehill told me that her team was so sure that the original finding would replicate that they’d created a whole research plan around it, hoping to see if power posing could help close gender gaps. Using a design similar to Cuddy’s, the researchers found that power posing had no effect on testosterone, cortisol or financial risk-taking. People still felt good, though — the study reproduced the self-reported feelings of power among participants who did the poses.

It’s hard to know for sure why the results of the second study didn’t mirror the first one. In a response to the Ranehill paper, Cuddy’s team spelled out 12 differences between the two studies, including their gender ratios (the original study had a higher proportion of women), the length of time participants spent in the power pose (it was three times longer in the replication), and what participants were told before the experiment began (people in the original study were given a cover story to obscure the study’s purpose, and those in the replication weren’t deceived about the study’s focus).

Illustration by Shout

But the most striking difference between the two studies was the disparity in their sample sizes. The original study involved just 42 people, and the replication had 200 participants.3 Studies with small sample sizes generally have low statistical power, meaning that they’re unlikely to distinguish an effect, even if it’s present, said Michèle Nuijten, who studies statistics and data manipulation at Tilburg University in the Netherlands. Small studies also tend to contain a lot of noise, and as a result, she said, the effect size estimates they produce can vary wildly — ranging from severe underestimations to severe overestimations. Whether they’re first findings or replications, large studies are generally more trustworthy than small ones, Nuijten said.4

So we have one small study suggesting that power poses can alter your hormones and also your behavior and another, much larger one suggesting that they don’t. What now?

Who wants to walk back the speech that made you famous by saying, I still believe in the result, but the science is less settled than I originally thought?

In her statement to me, Cuddy suggested that her work had been targeted because other researchers disagreed with her results. “It’s important for replicators to not ‘target’ findings because they don’t pass the replicator’s own personal gut check,” she wrote, adding that she would like to see replication efforts begin with a “collaborative conversation” with “no resentment or envy or nastiness.” Ranehill told me her group hadn’t contacted Cuddy’s team but said they had no ill will toward the researchers, and she seemed puzzled that their well-intended effort to reproduce the original study would be greeted as a threat.

Cuddy is hardly the only scientist who’s reacted defensively toward someone else’s failure to replicate. “Findings become like possessions,” said Nosek, who told me that he was “taken aback” by the tone of the Harvard news release about Gilbert’s commentary on his work. “I want to adopt a stance of humility and assume that there are errors and that’s why I need to be cautious in my conclusions,” he said. The ultimate goal should be reducing uncertainty, not being able to say nah, nah, nah, I’m right.

Goodman argues that the replication framework is the wrong criteria by which to judge studies, because it implies that the first study is privileged. Focusing specifically on replication implies that the first experiment has a special claim on truth, Goodman said. Instead, “We should just be looking at an accumulating evidence paradigm, where we’re getting closer and closer to truth.”

It’s easy to imagine that some of the pushback Cuddy has experienced stems from envy at the way her power pose study exploded into the spotlight and brought her fame and a nice book deal. But at the moment, the evidence supporting her contention that power poses can provoke hormonal changes seems shaky at best, and her public messaging about her results glosses over the uncertainty that remains. She’s in a tight spot, of course — who wants to walk back the speech that made you famous by saying, I still believe in the result, but the science is less settled than I originally thought?

The thing to keep in mind is that no single study provides definitive evidence. The more that science can bake this idea into the way that findings are presented and discussed, the better. Indeed, problems replicating studies have led some researchers to look for intentionally nonconfrontational approaches to weeding out false results.

One such method is something called “pre-publication independent replication” (PPIR) in which results are replicated before first publication. A team led by psychologist Eric Luis Uhlmann at the Insead business school in Singapore recently published one such project. A total of 25 research groups conducted replications of 10 studies that Uhlmann and his collaborators had “in the pipeline” as of August 2014. As they explain in their paper, six of the findings replicated according to all predetermined criteria, and the others failed at least one major replication criterion. “One strength of PPIR is that it makes sure that what comes out in the journal is quite robust,” Uhlmann said. “If something fails the PPIR, then you could avoid the media fanfare.”

Columbia University statistician Andrew Gelman, meanwhile, has proposed what he calls the “time-reversal heuristic”: “Pretend that the unsuccessful replication came first, then ask what you would think if a new study happened to find [that] a statistically significant interaction happened somewhere.”

Elsewhere, Nosek and his colleagues at the Center for Open Science are pushing for a whole new paradigm — transparency, data sharing and a move toward “registered reports,” in which researchers commit to a design and analysis plan in advance of the study. This strategy is meant to prevent researchers from exploiting “researcher degrees of freedom” — decisions about how to collect and analyze data — in a way that leads to p-hacking and other questionable research practices.

In a 2011 paper, psychologists Simonsohn, Simmons and Leif Nelson demonstrated how easy it is to fiddle with researcher degrees of freedom to produce almost any result, and a new paper by Richard Kunert at the Max Planck Institute for Psycholinguistics concludes that questionable research practices (like p-hacking, sampling until a “significant” result is obtained and hypothesizing after a result is known) are at the heart of many replication failures. The implication is clear: If scientists want to make their results more reproducible, they need to make their methods more rigorous.

In her statement to me, Cuddy wrote, “we are trying to do something quite difficult here — predict human behavior and understand subjective experiences. Psychology may not be a hard science,” she writes, but it is certainly a difficult one. It shouldn’t be so surprising that psychology studies don’t always replicate — the field faces an inherent challenge. Rather than measuring molecules or mass, it examines human motives and behavior, which are frustratingly hard to isolate.

What some have interpreted as a “terrifying” unraveling of psychology, others see as a sign of gathering strength. “I trust the public to recognize that this conversation about our methods is healthy and normal for science,” Simine Vazire wrote on her blog Sometimes I’m Wrong. “Science progresses slowly,” wrote the University of California, Davis, psychologist. “I don’t think we’re going to look that bad if psychologists, as a field, ask the public for some patience while we improve our methods.”

I’m skeptical that the power pose I did before my TEDx talk gave me a hormone boost and not just affirmation, but I’m also open to new data.6 We’ll learn more about power poses this fall when the journal Comprehensive Results in Social Psychology publishes its special issue on the topic. The journal is among a growing number that use a registered reports format in which researchers submit their hypothesis, methods and intended analyses in advance of the study and then the journal sends it out for peer review and accepts articles based on the experiment’s methods and rigor, rather than the results. None of the individual studies will provide the final word, but taken together, they should get us closer to the truth.

Ranehill’s group initially consisted of 100 participants, a number they’d calculated to be sufficiently large to find an effect if it was there, and they were so surprised by the null result that they decided to recruit 100 more people.

A small sample size is a weakness of the original power pose study, but small studies are not unusual in psychology. In an 2012 paper titled “The Rules of the Game Called Psychological Science,” psychologist and statistician Marjan Bakker, at the University of Amsterdam at the time, and her colleagues calculated the likelihood of obtaining statistically significant results (as measured by p-values) and showed that it’s easier to meet this goal by doing, say, five studies with 20 participants each, rather than one study with 100 people. The smaller sample size increases the likelihood of a false positive, but it also increases the chances of publication if you’re submitting to a journal that favors positive, novel results (as is mostly the norm).

Through Jim Aisner, with media affairs at Harvard Business School, Cuddy sent me a link to a post on her Facebook page pointing to what she called “a clear conceptual replication of our hormone findings.” The item she linked to was a brief story about a study by Purdue University consumer sciences researcher Christopher Kowal purporting to show that “body position while using smartphones is related to stress-inducing hormone levels in the body.” When I contacted Kowal to ask for a copy of the study, he told me he hadn’t finished writing it. He sent me a draft of the manuscript, but the analysis was still incomplete.

My friend Rosemerry spoke before me and missed the green room power pose chatter. She told me that she scrunched herself into a ball and flexed all her muscles to prep for her talk. Which posture is better? Someone should do a study!

Christie Aschwanden is FiveThirtyEight’s lead writer for science. Her book “Good to Go: What the Athlete in All of Us Can Learn from the Strange Science of Recovery” is available here. @cragcrest