For more than a decade, psychology has been contending with some of its research findings going up in smoke. Widely publicized attempts to replicate major findings have shown that study results that scientists and the public took for granted might be no more than a statistical fluke. We should, for example, be primed for skepticism when studying priming. Power posing may be powerless.

A recent piece in the New York Times recalled the success of a popular study on powerful poses, and how efforts to replicate the research failed. The article detailed how the collapse of the research behind power poses took place in the increasingly common culture of public critique and infighting in social psychology. Some of that fighting comes from efforts not to tear down, but to build up the field with better scientific rigor and statistics.

Those efforts are well-intentioned. But have they had any impact? Studies published earlier this year tackle that question, first with a survey of social scientists and then with an analysis of recently published papers. The results show that psychology remains plagued by small sample sizes, pessimism and a strong pressure to publish. But they also suggest that psychologists really are trying to turn their field around, and that online fights might even be part of the solution.

“My general research interest is trying to figure out how people can talk about things that they care about without yelling at each other,” says Matt Motyl, a psychologist at the University of Illinois in Chicago. “Usually that means politics or religion.” But heated op-eds and scathing rebuttals in scientific journals, coupled with raging debates on prominent psychology blogs on the practice of science soon drew his attention. “The vitriol and personal attacks I saw people making seemed to go beyond what the data actually said,” Motyl says. “Show me the data. Is the field getting better or is it getting worse?”

Motyl and his colleague Linda Skitka, also a psychologist at the University of Illinois in Chicago, created a survey to try and understand the views behind the public disagreements. The goal was to find out just how replicable social scientists think the work in their field is, and whether it is better — or worse — than it was 10 years ago. Motyl and Skitka sent the survey to the memberships of three social and personality psychology societies — the Society for Personality and Social Psychology, the European Society for Social Psychology and the Society for Australasian Social Psychologists — and publicized it on Twitter. They got completed surveys from more than 1,100 people, almost 80 percent of whom were social psychologists.

Half of the scientists who filled out the survey felt that their field is producing more dependable results now than it did 10 years ago. Respondents also estimated that less than half of the studies published 10 years ago yielded conclusions that could be replicated, and for recent studies, that figure is about half, Motyl and Skitka reported in the July Journal of Personality and Social Psychology.

Survey participants were also asked if they engaged in questionable research practices such as reporting only experiments that produced positive results, dropping conditions from experiments or falsifying data. Then, they were asked to justify when or if those practices were acceptable.

Unsurprisingly, faking data was deemed never acceptable. But opinions on the other practices were more variable, and many scientists provided explanations to justify when they had used practices such as deciding to collect more data after looking at their results or reporting only the experiments that produced the desired effects. But more than 70 percent of the scientists stated they would be less likely to engage in questionable practices now that they’ve become aware of the problems they cause.

“It’s heartening to see some evidence social and personality psychologists are incorporating better research practices into their work,” says Alison Ledgerwood, a psychologist at the University of California at Davis (and one of the self-identified reviewers of the paper).

The survey participants were very clear about what drove their problematic practice. “The worst practices were external to the researchers themselves,” Skitka says: “It’s the publish-or-perish business” that’s driving bad practice. Pressure to publish findings that appeared to support a hypothesis (and even editor and peer-review requests to do so) drove 83 percent of the scientists to selectively report only the studies that turned out well. When scientists dropped conditions from their studies, 39 percent said they did so out of publication pressure. And 57 percent said the same pressure drove them to report unexpected findings as expected, in the interest of telling a more compelling story — some noting that editors and reviewers wanted them to do it.

“As an untenured junior faculty member, there’s a lot of pressure,” Motyl says. “There’s pressure to make things as publishable as possible.” Success in a research career depends on impressive findings, tight storylines and publishing in important journals. Under such pressure, bad behavior can be glossed over or even rewarded.

“It’s disappointing, but not surprising to me, that researchers are still reporting pressure to selectively report study that ‘work’ and to leave out ones that don’t,” Ledgerwood says. Taking the pressure off positive publication, she notes, is going to have to come from the top. “I think our field — like many other scientific fields — is still waiting for reviewers and editors to change.”

It’s important to finally see a data-driven account of the state of psychology, she says. Often scientists declare a study on the state of the field as valid or not based on whether or not it matches their intuition. “Scientists are human too, and I think we often fall prey to this very human but biased way of thinking,” Ledgerwood notes. Ironically, psychologists themselves study that behavior, and call it motivated reasoning.

Survey vs Studies

While it’s nice to know that psychologists want to improve their methods, a survey can’t tell anyone if the science itself is actually getting better. So Motyl and Skitka also set out to compare papers in four major psychology journals published in 2003 and 2004 with the same number of studies published 10 years later, in 2013 and 2014. They conducted a series of reproducibility analyses on more than 540 articles containing over 1,500 experiments.

Motyl and Skitka assigned the papers to trained scientists, who combed through the papers by hand, searching for the statistics that each study used to determine if their main hypothesis was supported or not. “It’s probably the most extensive hand-coding of studies I’m aware of,” says Ulrich Schimmack, a psychologist at the University of Toronto.

The scientists analyzed the statistics in the papers using different measures to assess how replicable the results might be. The gold standard to determine if a study is replicable is, of course, to try and conduct it again, with the goal of achieving the same results. Barring that expensive and time-consuming effort, there are mathematical tests that scientists can perform to determine if the results from a set of studies might be the result of questionable research practices such as publication bias or P-hacking — mining data to uncover significant differences.

Motyl and his colleagues focused on eight tests for replicability and questionable practices. These included tests for how much variance surrounded the main statistics of the studies, the sample size, the likelihood of replication, and the estimated power of the tests before and after the data was collected.

Finally, the group used two tests called the P-curve and Z-curve. The P-curve is a test that looks at the distribution of P values across a series of experiments or studies. If the P values are clustered around 0.00 to 0.01, it suggests that the effects observed are valuable. If the values are clumped around 0.02 to 0.05, however, it may serve as a red flag that the findings are not robust. It could even hint at sloppy practices, hacking at the data until the results were just significant enough to publish. A Z-curve is a similar analysis, measuring how different a finding is from an average.

The different tests had highly variable results, but overall, most studies from both 2003-2004 and 2013-2014 appeared to have some real (though weak) evidence to support their findings. Sample sizes did appear to get larger over the 10-year period, which was a positive sign. And the results of the P-curves suggested that about 95 percent of the recent studies and those from a decade earlier had a large enough effect size, or the size of the difference in the effect being studied. But experiments in both groups had problems with low power — that is, they were too small and likely to produce no difference in results when an effect might actually exist.

“My interpretation is that nothing showed significant evidence that we were getting worse,” Motyl says. “There’s some evidence things are getting better in terms of things like sample size.” Motyl, Skitka and their colleagues report their analysis in a paper also published in the July Journal of Personality and Social Psychology.

The likelihood of an effect being replicable — a score called the replicability index — went down over the 10-year period, from around 0.62 to 0.52. Motyl and his group described the result in the study as “rotten,” but Schimmack — who developed the replication index — is not so pessimistic. “It’s not great, but it’s not something that we would say ‘oh my God, nothing in psychology can be replicated and everything is P-hacked,’” he says. “It came in at a D. It just passed but it’s not great.”

With the number of metrics the researchers looked at, and the different outcomes each found, the results are wide open to interpretation, says Michael Inzlicht, a social psychologist at the University of Toronto. “You can read into them what you like, so if you think the field is great or not so bad, you might focus on one metric that flatters the field. If you’re more pessimistic like myself, you can focus on those metrics that make it look not so great.” Regardless, the study was an ambitious attempt to represent the field as it stands, he notes.

But — in a scene that is perhaps characteristic of the current stage of psychology research — the results attracted immediate controversy.

The creators of the P-curve, one of the tests used in the study, claim that the analysis Motyl and his colleagues conducted is problematic. At issue is not the statistics behind the technique, but how Motyl and his group selected the P values they put into it. “You have to select the right one otherwise the inferences are false,” says Joe Simmons, a psychologist at the University of Pennsylvania. Simmons and his colleagues Leif Nelson at the University of California, Berkeley, and Uri Simonsohn, a psychologist at the University of Pennsylvania, published a critique of Motyl and Skitka’s use of the P-curve on their blog Data Colada.

Motyl and Skitka acknowledge that there will be a correction issued about the use of the P-curve, but add that it will not change the outcome of the study. Initially, the researchers were using code they wrote that wasn’t compatible with code from the P-curve website. They have since remedied that issue and, Motyl says, “the conclusion is still the same.”

The debate over this paper — much like the reproducibility crisis itself — still rages on. But the results showed one thing for sure, Schimmack says. Power is still too low across the board, resulting in findings that probably aren’t very trustworthy. “If we want to improve replicability, we need to be put more effort and resources to increase statistical power,” he notes. “That’s been the message for decades and nothing happened.”

So a paper critiquing methods in the field of psychology is being taken to task for its methods. The irony is not lost on the scientists involved. But it also shows the process of science, now playing out in the public eye and in real time on blogs and social media. “We’re doing something in a couple of weeks that in a peer review system would take years,” Schimmack says.

The findings are not definitive, but no paper’s findings ever are. “The nice contribution of the paper is that it is likely to spawn many additional efforts that will try to improve on the methodology,” says Brian Nosek, a psychologist at the University of Virginia in Charlottesville and the executive director at the Center for Open Science. The blog posts, corrections and replications flying back and forth are a sign of “productive dialog and skepticism.” And they’re a sign that as psychology research become more open, maybe it is really getting a little better.

“I can’t say that it’s fun to be critiqued on blogs that haven’t been peer-reviewed to the same degree we have,” Skitka says. “On the other hand, are we losing sleep about this? No. We made good decisions, we did the best the can, and if we made an error, we’ll go back and fix it.”

Editor’s note: This article was updated November 1, 2017, to correct the nature of Motyl and Skitka’s reanalysis. The errors were not in the P-curve code, as initially described, but resulted from an incompatibility between scripts from the P-curve website and those written by the researchers.