Summary

In this post I work through a recent homework exercise that illustrates why you shouldn’t compare means by checking for confidence interval overlap. I calculate the type I error rate of this procedure for a simple case. This reveals where our intuition goes wrong: namely, we can recover the confidence interval heuristic by confusing standard deviations and variances.

Checking confidence intervals for overlap

Sometimes you may want to check if two (or more) means are statistically distinguishable. Since proper inferential procedures can be a bit of a pain, you might be tempted to use a shortcut: plot the (\(\alpha\)-level) confidence intervals for the means and check if they overlap.

This might seem especially useful when someone sends you plots like:

In this case you might say, oh hey, the confidence intervals for the carb and qsec coefficients overlap, so those coefficients must be different. Or you might get a plot like:

and do something similar. While this procedure is intuitive satisfying, you should avoid it because it doesn’t work.

Type I error rate for a simple case

To demonstrate what can go wrong, we’ll explicitly calculate the type I error rate for a simplified case. In particular, we’ll assume that we have two samples \(x_1, ..., x_n\) and \(y_1, ..., y_n\) from a \(\mathrm{Normal}(\mu, \sigma^2)\) distribution with known \(\sigma^2\). For this example, both samples are of size \(n\). Here the true mean for both samples is the same value, \(\mu\), and we’d like to know the probability that we reject the null of equal means using the overlapping confidence interval heuristic. That is, we want

There are two ways the confidence intervals can overlap. If \(\color{blue} \bar x \color{black} < \color{blue} \bar y\), the upper confidence bound of \(\color{blue} \bar x\) can be greater than the lower confidence bound of \(\color{blue} \bar y\). The other situation is symmetric, and since \(P(\color{blue} \bar x \color{black} < \color{blue} \bar y \color{black}) = 0.5\), we can calculate the probability of type I error for the \(\color{blue} \bar x \color{black} < \color{blue} \bar y\) case and multiply by two to get the overall probability of type I error.

Let’s translate the rejection condition into a mathematical statement. In particular, we reject when:

and at this point you may note that the right hand side looks suspiciously like a pivot that should be standard normal. Let’s work out the distribution of \(\color{blue} \bar y \color{black} - \color{blue} \bar x\) to see if this is the case. Recall that both \(\color{blue} \bar y\) and \(\color{blue} \bar x\) are \(\text{Normal}(\mu, \color{green} \sigma^2 / n \color{black})\) random variables, and they are independent.

Adding two normals together gives us a new normal, and all that’s left to do is calculate the mean and variance of the new normal. To do this we need to use the following properties of expectation and variance:

In practice, though, the \(\sqrt 2\) is still there, and we have this expression \(2 \cdot (1 - \Phi(\sqrt 2 \cdot \color{purple} z_{1 - \alpha / 2} \color{black}))\), which is a bit hard to parse. To get an idea of what this looks like, I’ve plotted \(\alpha\) (used to construct the confidence intervals for mean) against the actual type I error:

This plot indicates that the actual type I error is always lower than the desired type I error rate for this problem. That is, our heuristic about overlapping confidence intervals is far too conservative and will be systematically underpowered.

What went wrong

At first, your intuition may suggest that this confidence interval thing is a reasonable testing procedure, but clearly something is wrong with it. Where is our intuition leading us astray?

So our missing factor of \(\sqrt 2\) appears if we forget that confidence intervals work on the standard-deviation-scale, and we accidentally apply our variance-scale intuition to the problem. Minkowski’s inequality tells us that, for this particular setup, our mistake will always result in overestimating the true variance of \(\color{blue} \bar y \color{black} - \color{blue} \bar x\), and thus we have a systematically underpowered test.

Conclusion

There are a bunch of heuristics for determining if means are different based on confidence interval overlap. You shouldn’t take them too seriously. People have written great papers on this, but I seem to have misplaced my references at the moment.

For an interesting comparison of some of the many correct ways to compare means, check out Andrew Heiss’ recent blog post. You may also enjoy David Darmon’s very similar discussion on confidence interval procedures. He ends with the following thought-provoking call to action:

Left as an exercise for the reader: A coworker asked me, “If the individual confidence intervals don’t tell you whether the difference is (statistically) significant or not, then why do we make all these plots with the two standard errors?” … Develop an answer that (a) isn’t insulting to non-statisticians and (b) maintains hope for the future of the use of statistics by non-statisticians.

In future posts I plan build off this idea and explore statistics as a primarily sociological problem.