Again: Let’s stop talking about published research findings being true or false

Coincidentally, on the same day this post appeared, a couple people pointed me to a news article by Paul Basken entitled, “A New Theory on How Researchers Can Solve the Reproducibility Crisis: Do the Math.”

This is not good.
First, math (or even statistics) won’t solve the reproducibility problem. All the math in the world won’t save you if you gather noisy data and study ill-defined effects. Satoshi Kanazawa could be a brilliant as Carl Friedrich Gauss, squared, and it wouldn’t matter cos there’s no blood that can be squeezed from the stone of those sex-ratio studies. Similarly for the ovulation and voting paper, or the ovulating-women-are-three-times-as-likely-to-wear-red paper. Dead on arrival, all of ’em. Too much noise. Going for statistical significance won’t work in those “power = .06” studies, cos if you do get lucky and find statistical significance, it tells you just about nothing anyway. That’s why I don’t go around recommending that people do preregistered replications of these sorts of studies. Why bother? I’m not gonna tell people to waste their time.

The other problem is this: “More useful, Mr. Fournier said, would be a practice in which yes-or-no declarations would be replaced in journal articles by more specific estimates of how likely it is that a particular research observation did not just randomly occur: such as 1 in 20, or 1 in 100, or 1 in 1,000.” This is wrong for all the reasons discussed here.

So, no, there’s no “new theory on how researchers can solve the reproducibility crisis.” To the extent there is a new theory, it’s an old theory, which is that scientists should focus on scientific measurement (see here and here).

In some sense, the biggest problem with statistics in science is not that scientists don’t know statistics, but that they’re relying on statistics in the first place.

Just imagine if papers such as himmicanes, air rage, ages-ending-in-9, and other clickbait cargo-cult science had to stand on their own two feet, without relying on p-values—that is, statistics—to back up their claims. Then we wouldn’t be in this mess in the first place.

I’m not saying statistics are a bad idea. I do applied statistics for a living. But I think that if researchers want to solve the reproducibility crisis, they should be doing experiments that can successfully be reproduced—and that involves getting better measurements and better theories, not rearranging the data on the deck of the Titanic.

P.S. Just to be clear: I’m not criticizing Basken’s article, which brings up a bunch of issues that people are talking about. I’m just bothered by what I see as a naive attitude that some people have, that statisticians and statistical education will fix our scientific replication problems. As McShane and Gal have pointed out, lots of statisticians don’t understand some key principles in this area, and I worry that focus on statistics, preregistration, etc., will distract researchers from the real problem that crappy measurement is standard in some fields of research. Remember, honesty and transparency are not enough.

16 Comments

ironically, the inability of many scientists to understand your point stems from the fact that they don’t know enough about statistics. so, without providing education first, it’s even possible to communicate this point to them.

> from the fact that they don’t know enough about statistics
I think its more that they don’t know enough about science, what it is and how should work instead mistaking it as largely route application of experimental methods followed by use of the correct statistics that are interpreted as is commonly done in their field.

I did not quite grasp the advantage of S and M errors until I thought of it in the context how research is actually done in most fields – cheapest study you can get way with to test perhaps not that thoughtful hypotheses (theory is hard and time consuming) and focus almost exclusively on confidence intervals that don’t overlap zero or by much and forget the rest.

In such a selected subset of intervals an unfortunately high percentage will be in the wrong direction and those that aren’t will have centres that are exaggerated. But this is the world view that is being created.

Not understanding statistics is part of it, but another part is that people—applied researchers and also many professional statisticians—want statistics to do things it just can’t do. “Statistical significance” satisfies a real demand for certainty in the face of noise. It’s hard to teach people to accept uncertainty. I agree that we should try, but it’s tough, as so many of the incentives of publication and publicity go in the other direction.

“For a p-value of .05, as is typical, a study’s finding will be deemed significant if researchers identify a 95-percent chance that it is genuine.”

I know the argument gets brought up here that if we got rid of p-values, researchers would just hop to some other statistical summary and compare it to an arbitrary threshold, and so this problem with p-values is really a problem with hypothesis testing. And I agree with that to an extent, but I suspect that the problem wouldn’t be so bad if the statistical summary everyone used wasn’t nearly universally misinterpreted as giving far more information than it actually gives. It’s no wonder so many people ignore us when we tell them not to read too much into “p < 0.05" – they think there is really a 95% chance that their hypotheses are correct and that we're just nagging them over a 5% chance they aren't.

I haven’t seen examples of that (yet), but it wouldn’t surprise me at all. In my (admittedly) limited experience, non-statisticians I’ve talked to who use p-values in their research misinterpret the p-value this way 100% of the time. There are problems with p-values that go beyond misinterpreting them, but most people can’t get past this basic level. And I don’t blame them, because once you understand what a p-value really is, the next obvious question is “so how is this useful?” I don’t think many researchers are too keen on the idea that the main method which they and all their peers use to gather “statistical evidence” is vastly weaker than they’ve been led to believe.

This is a huge problem among journalists. You can tell that many science journalists have asked “experts” to explain p values to them, then scratched their heads and said, “Huh. So it kinda means [the inverse probability fallacy] then?” Experts: “Um, yeah, kind of.” Then the fallacy goes into the newspaper or magazine.

“Just imagine if papers such as himmicanes, air rage, ages-ending-in-9, and other clickbait cargo-cult science had to stand on their own two feet, without relying on p-values—that is, statistics—to back up their claims. Then we wouldn’t be in this mess in the first place.”

This is what I think about I see the p-value defended on the grounds that it provides error control and a defense against mistaking noise for signal. The people who make this defense typically say something like “yes, we know that p-values can be abused, but we’d be even worse off without them because then we’d be losing this layer of protection.” I wonder how often p-values end up having the opposite effect in practice – convincing people that there is signal in what would otherwise just look like noise. It’s tough not to get excited when you see that little asterisk in your regression output.