Online Bettors Can Sniff Out Weak Psychology Studies

Psychologists are in the midst of an ongoing, difficult reckoning. Many believe that their field is experiencing a “reproducibility crisis,” because they’ve tried and failed to repeat experiments done by their peers. Even classic results—the stuff of textbooks and TED talks—have proven surprisingly hard to replicate, perhaps because they’re the results of poor methods and statistical tomfoolery. These problems have spawned a community of researchers dedicated to improving the practices of their field and forging a more reliable way of doing science.

But if those critiques are correct, then why is it that scientists seem to be remarkably good at predicting which studies in psychology and other social sciences will replicate, and which will not?

Consider the new results from the Social Sciences Replication Project, in which 24 researchers attempted to replicate social-science studies published between 2010 and 2015 in Nature and Science—the world’s top two scientific journals. The replicators ran much bigger versions of the original studies, recruiting around five times as many volunteers as before. They did all their work in the open, and ran their plans past the teams behind the original experiments. And ultimately, they could only reproduce the results of 13 out of 21 studies—62 percent.

As it turned out, that finding was entirely predictable. While the SSRP team was doing their experimental re-runs, they also ran a “prediction market”—a stock exchange in which volunteers could buy or sell “shares” in the 21 studies, based on how reproducible they seemed. They recruited 206 volunteers—a mix of psychologists and economists, students and professors, none of whom were involved in the SSRP itself. Each started with $100 and could earn more by correctly betting on studies that eventually panned out.

At the start of the market, shares for every study cost $0.50 each. As trading continued, those prices soared and dipped depending on the traders’ activities. And after two weeks, the final price reflected the traders’ collective view on the odds that each study would successfully replicate. So, for example, a stock price of $0.87 would mean a study had an 87 percent chance of replicating. Overall, the traders thought that studies in the market would replicate 63 percent of the time—a figure that was uncannily close to the actual 62-percent success rate.

The traders’ instincts were also unfailingly sound when it came to individual studies. Look at the graph below. The market assigned higher odds of success for the 13 studies that were successfully replicated than the eight that weren’t—compare the blue diamonds to the yellow diamonds.

“If researchers can anticipate which findings will replicate, or fail to, it makes it harder to sustain dismissive claims about the replications or the replicators,” adds Brian Nosek from the Center of Open Science, who was part of the SSRP.

What clues were the traders looking for? Some said that they considered a study’s sample size: Small studies will more likely produce false positives than bigger ones. Some looked at a common statistical metric called the P value. If a result has a P value that’s less than 0.05, it’s said to be statistically significant, or positive. And if a study contains lots of P values that just skate under this threshold, it’s a possible sign that the authors committed “p-hacking”—that is, they futzed with their experiment or their data until they got “positive” but potentially misleading results. Signs like this can be ambiguous, and “scientists are usually reluctant to lob around claims of p-hacking when they see them,” says Sanjay Srivastava from the University of Oregon. “But if you are just quietly placing bets, those are things you’d look at.

Beyond statistical issues, it strikes me that several of the studies that didn’t replicate have another quality in common: newsworthiness. They reported cute, attention-grabbing, whoa-if-true results that conform to the biases of at least some parts of society. One purportedly showed that reading literary fiction improves our ability to understand other people’s beliefs and desires. Another said that thinking analytically weakens belief in religion. Yet another said that people who think about computers are worse at recalling old information—a phenomenon that the authors billed as “the Google effect.” All of these were widelycoveredin the media.

When Nosek reads studies like these, he asks himself whether he would care at all if the results were negative. In many cases, the answer would be no. Some of the traders relied on similar judgments. “I did a sniff test of whether the results actually make sense,” says Paul Smeets from Maastricht University. “Some results look quite spectacular but also seem a bit too good to be true, which usually that means they are.”

Prediction markets could help social scientists to decide which classic studies to focus on replicating, given limited time or resources. It could tell researchers or funding agencies whether they stand to waste time and money building on work that others deem to be shaky. But everything hinges on who takes part in the markets.

Dreber suspects that the 206 traders were probably invested in the reproducibility debate, and have spent more time considering these issues than most. Perhaps they were especially good at discerning unreliable studies from reliable ones. “It’s not clear to me that if we had run the markets 10 years ago, people would have been as good,” says Dreber.

Alison Ledgerwood from the University of California at Davis agrees. In the wake of the replication crisis, rather than automatically thinking that any published or statistically significant finding is true, “researchers are instead looking more carefully at various aspects of a study,” she says. “If that’s what’s going on, it’s great news. When a new study comes out, we need to think of it as one brick in a larger structure we are trying to build, and we also need to evaluate how strong each brick is likely to be before putting a lot of weight on it.”

But Ledgerwood notes that the prediction markets worked because they relied on a crowd of people making judgements as a collective. “These findings don’t mean we can each individually forecast with a crystal ball whether a given study result will replicate,” she says. “It would be a mistake to conclude that individuals can predict scientific truths with great accuracy based on their gut.”

Dreber also cautions that the SSRP only looked at 21 studies, and can’t say much about whether prediction markets can more broadly gauge the reliability of social-science studies. But this is the third time that such markets have been successfully used in this way—once for psychology and a second time for economics—and several more attempts are coming up. Evidence is mounting, and confidence in the markets is growing.

The same could be said about big projects in which psychologists work together to replicate past studies. Six such projects, including the SSRP, have now been completed. Between them, they’ve successfully replicated just 87 out of 190 studies, for an overall rate of 46 percent. “This is not acceptable,” says Simine Vazire from UC Davis.

The 62-percent success rate from the SSRP, though higher, is still galling to Vazire, since the project specifically looked at the two most prestigious journals in the world. “We should not treat publication in Science or Nature to be a mark of a particularly robust finding or a particularly skilled researcher,” she says. These journals “are not especially good at picking out really robust findings or excellent research practices. And the prediction market adds to my frustration because it shows that there are clues to the strength of the evidence in the papers themselves.”

If prediction-market participants could collectively identify reliable results, why couldn’t the scientists who initially reviewed those papers, or the journal editors who decided to publish them? “Maybe they’re not looking at the right things,” says Vazire. “They probably put too-little weight on markers of replicability, and too much on irrelevant factors, including the prestige of the authors or their institution.”

Fortunately, there are signs of progress. The number of pre-registered experiments—in which researchers lay out all their plans beforehand to obviate the possibility of p-hacking—has been doubling every year since 2012. The number of psychology journals that have adopted transparency policies has risen from zero in 2013 to at least 24 now; such policies require scientists to make their work as open as possible so that others can more easily check for errors or replicate it.

The very act of replication has become normalized. “Five years ago, when we reached out to an original author about our replication projects, it wasn’t terribly uncommon for people to say, ‘Don’t you trust me?’ or, ‘Did I do something wrong?’ That just doesn’t even come up now,” says Nosek. More often than not, researchers are cooperating with the replicating teams. Even when they disagree with the results of the replications, they’re providing reasoned explanations.

“It’s the ordinary process of scientists with different points of view debating about what the evidence might be, rather than recriminations or concerns about reputations,” says Nosek. “That’s an important signal of the reformation that’s happening.”

We want to hear what you think about this article. Submit a letter to the editor or write to letters@theatlantic.com.