Scientists’ grasp of confidence intervals doesn’t inspire confidence

Sometimes it’s hard to have confidence in science. So many results from published scientific studies turn out to be wrong.

Part of the problem is that science has trouble quantifying just how confident in a result you should be. Confidence intervals are supposed to help with that. They’re like the margin of error in public opinion polls. If candidate A is ahead of candidate B by 2 percentage points, and the margin of error is 4 percentage points, then you know it’s not a good bet to put all your money on A. The difference between the two is not “statistically significant.”

Traditionally, science has expressed statistical significance with P values, P standing for the probability that the result you observe is a fluke. P values have all sorts of problems, which I’ve discussed here and here. Consequently many experts have advised using confidence intervals instead, and their use is becoming increasingly common. While there are some advantages in that, it is sadly the case that confidence intervals are also not what they are commonly represented to be.

That shouldn’t be surprising, because confidence intervals are a complicated concept. You could probably find more scientists who understand quantum mechanics than understand confidence intervals. And even when you do understand them, it’s very cumbersome to explain them, so most people just fall back on inaccurate shortcuts. (I have been guilty. Writing this blog post is my punishment.)

Ordinarily you might see a confidence interval expressed something like this:

The average weight loss for people on the new miracle drug was 4.6 pounds, with a 95 percent confidence interval of 2.2 pounds to 6.9 pounds.

You’d get such a result from taking a sample of people from a population, giving them the drug, recording the weight loss for each individual and then calculating the average (and the variation around the average) to compute the confidence interval. (In real life you should have a placebo group for comparison and stuff like that, but let’s keep things simple for now.) Supposedly you can then conclude that the average weight loss you’d see by giving the drug to everybody in the population would be between 2.2 and 6.9 pounds, with 95 percent confidence.

But what does that really mean? Some people will say it means if you did the experiment 100 times, the average (mean) you got would be within that 2.2–6.9 range in about 95 of the trials. Or that the true average for the whole population would fall within that range with 95 percent probability. Wrong wrong wrong. Think about it. Suppose you did the experiment a second time, and got an average of 5.9 pounds with a confidence interval of 4.1 to 8.8 pounds. Would you be 95 percent confident in both of the reported ranges? And you’d get different confidence intervals every time you did the experiment. How can they all give you the right range of 95 percent confidence?

In actual statistical fact, a confidence interval tells you not how confident to be in the answer, but how confident to be in your sampling. In other words, if you repeated the experiment (on different samples from your population) a gazillion times, your confidence interval will reliably contain the true value in 95 percent of the trials. That merely tells you how often your confidence range will be valid over the course of many repetitions of the experiment. As statistician and political scientist Andrew Gelman expresses it, “Under repeated sampling, there is a 95 percent probability that the true mean lies between the lower and upper bounds of the interval.”

But people who use confidence intervals when reporting on scientific research typically don’t understand all this. One recent study found, for instance, that psychology researchers didn’t understand the correct meaning of confidence interval any better than students with no training in statistics.

“Both researchers and students in psychology have no reliable knowledge about the correct interpretation of confidence intervals,” Rink Hoekstra of the University of Groningen in the Netherlands and collaborators wrote in Psychonomic Bulletin & Review (published online in January).

Hoekstra and colleagues presented 118 researchers and 476 students with a simple scenario: A professor analyzes an experiment and reports that the 95 percent confidence interval for the mean is 0.1 to 0.4. Then the students and researchers took a simple true-false test, with six statements about that report, such as “There is a 95 percent probability that the true mean lies between 0.1 and 0.4.” All six of the statements were false, but on average the researchers reported 3.45 of the six statements to be true — about the same performance as first-year students (3.51). Only eight of the students and three of the researchers correctly labeled all six statements as false.

The correct statement, not included in the test, is: “If we were to repeat the experiment over and over, then 95 percent of the time the confidence intervals contain the true mean.”

To be sure, this paper provoked some criticism, and not all statisticians agree on how to use the word “confident.” If you have a lot of free time you can read the comments on Gelman’s blog. Gelman himself makes the point most clearly, though, that a 95 percent probability that a confidence interval contains the mean refers to repeated sampling, not any one individual interval. “Once you put in the actual values 0.1 and 0.4, it becomes a conditional statement and it is not in general true,” he writes.

But rather than let Gelman (or me) have the last word, it makes more sense to justify this interpretation by citing the statistician Jerzy Neyman, who developed the idea of confidence intervals in the 1930s. In a 1937 paper, for instance, Neyman worked out the mathematical foundations for confidence intervals in detail, based on probability theory. He noted in particular that any given confidence level, say 95 percent, produces an interval around the true mean that will be correct that percentage of the time “in the long run.”

“Consider now the case when a sample … is already drawn and the calculations have given, say, [low end of interval] = 1 and [high end of interval] =2,” he wrote. “Can we say that in this particular case the probability of the true value of [the mean] falling between 1 and 2 is equal to [95 percent]? The answer is obviously in the negative. The parameter [the mean] is an unknown constant and no probability statement concerning its value may be made.”