Here’s a post with a Super Bowl theme.

Kevin Lewis pointed me to an article in the Journal of the American Medical Association, using the email subject line, “Not statistically significant, but close.” The article in question, by Atheendar Venkataramani, Maheer Gandhavadi, and Anupam Jena, is called, “Association Between Playing American Football in the National Football League and Long-term Mortality,” and it reports:

In this retrospective cohort study of 3812 NFL players who debuted between 1982 and 1992, there was no statistically significant difference in the risk of long-term all-cause mortality among career NFL players compared with NFL replacement players who participated in the NFL during a 3-game league-wide player strike in 1987 (adjusted hazard ratio, 1.38; 95% CI, 0.95-1.99). Career participation in NFL football, compared with participation as an NFL replacement player, was not associated with a statistically significant difference in the risk of all-cause mortality.

They report:

The study sample consisted of 2933 career NFL players and 879 NFL replacement players . . . A total of 144 deaths occurred among career NFL players . . . and 37 deaths occurred among NFL replacement players. . . . The mortality risk by study end line was 4.9% for career NFL players and 4.2% for NFL replacement players, and the unadjusted absolute risk difference was 0.7% . . . Adjusting for birth year, BMI, height, and position group, the absolute risk difference was 1.0%.

They also report that “Four career NFL players were excluded from the sample because they died while actively employed on an NFL roster.” Including them would bump up the mortality rate from 4.9% to 5.0%.

The real story, though, is that the sample is not large enough to reliably detect the difference between a 4% mortality rate and a 5% mortality rate.

From the standpoint of statistical communication, the challenge is how to report such a finding.

I think the authors did a pretty good job of not overselling their results. To start with, their title is noncommittal. They do report, “not associated with a statistically significant difference,” which I don’t think is so helpful. But, flip it around: what if the numbers had been very slightly different and the 95% interval had excluded a null effect? Then we wouldn’t want them blaring that they’d found a clear effect.

And the paper includes a strong section on Limitations:

This study has several limitations. First, the residual differences between career NFL players and NFL replacement players that may bias the results could not be addressed. Additional research that collects data on medical comorbidities, lifestyle factors, and earnings across the 2 groups could further address this bias. A complementary approach might focus on groups of NFL career and replacement players who are more similar in their duration of NFL exposure. For example, comparing career NFL players with shorter tenures vs NFL replacement players may avoid healthy worker bias from players who are able to persist in the league because of better unobserved health. . . .

Second, the estimates were based on a small number of events: 181 deaths, of which only 37 occurred among NFL replacement players. Consequently, the present analysis could be underpowered to detect meaningful associations. Additional analyses with longer-term follow-up may be informative.

Third, the data were drawn from online databases. While these have been used in other analyses, it is possible that some information was misreported . . .

That’s how you do it: report what you found and be clear about what you did.

The corresponding author works at the University of Pennsylvania. Eagles fan?

I would think that the NFL replacement players had more adult football experience than they got from this 3-game replacement, as well as steroid use, etc. If we are interested in how football affects mortality, then this is an odd choice of control group. We might compare NFL players to MLB or NBA players, perhaps with race as an independent variable, but there may be other demographic (SES) differences between the sports.

I’ve discussed this general point several times on the blog, for example here (2009), here (2011), here (2012), and here (2013). The question comes up a lot!

The short answer is that the population of interest is not just all NFL players during these years. At the very least we are interested in current and future NFL players, and we’re also interested in the many more people who have played or will play college football and high school football.

I’m sorry I haven’t read all the comments on all those posts, but I didn’t see any justification you gave in what you wrote in the primary posts. The authors of this paper themselves wrote, “Fifth, the results may be specific to NFL players who debuted between 1982 and 1992. It is possible that cohorts who debuted at other times faced different mortality risks.” Not only that, but the point here is to see whether two groups are different, not how well a model fits one group. One might argue that the two groups are as if chosen randomly, but the paper gives evidence they were not “as if” (see the section “Sample Characteristics”). So I don’t see how you can justify a P-value for the difference of the groups for the question of interest.

Of course the authors can do pure descriptive statistics, but the reason for being interested in this study at all is for its general implications about the risks of playing football.

Regarding your last sentence: I don’t really care about the p-value. Any two groups will be different. So I disagree with your statement that “the point here is to see whether two groups are different.” I think the point of the study is to (indirectly) assess the health risks of playing football.

I was responding to your sentence, “The real story, though, is that the sample is not large enough to reliably detect the difference between a 4% mortality rate and a 5% mortality rate.” It seems to me that you are referencing statistical significance and, thus, P-values (by comparing two groups).

Otherwise, I agree with what (in this comment) you say is the point, although there are such serious problems with this study that it does not make any useful contribution–at least, not by using P-values.

To elaborate, let me put it this way: “the sample is not large enough to reliably detect the difference between a 4% mortality rate and a 5% mortality rate, even in the best possible case which would be that the groups represent simple random samples of the population of interest.”

OK, I guess that’s fine. But suppose they had the entire populations of interest and not a (random or not) sample–even better than the best possible case. I assume this does not include the future (as not being possible). Would there still be a story about how to report the differences found?

If they had the entire population of interest, that would be it. Such problems do arise, for example in legal cases that are narrowly focused on some set of transactions. But in this case the interest is in football more generally. (Actually, here, even if we only cared about these particular players, there’s still an implicit sample defined based on counterfactuals. For example, if one of the replacement players had died as a passenger in a plane crash, this wouldn’t really be relevant to a question about the health risks of football.)

Just to expand upon the above points: Sure, it’s easier to analyze a random sample than to extrapolate from the past to the present and the future. Just as it’s easier to analyze a fully formulated probability model (for example, ideal roulette wheels) than to do something like reconstructing climate from tree rings. In the problem of extrapolating about future football players, or the problem of inference about the climate, we need to make assumptions. In simple random sampling or the idealize roulette wheel, there are assumptions too, but they’re baked into the data collection. Real surveys of people are not random samples because nonresponse.

Thank you for covering our paper and for your comments. I’ve always learned a lot from your blog – and never imagined my work would be a subject of it! And thanks to the rest of you for the comments. To address some of these:

1. Mortality is one metric – and the statistical power to detect differences will grow over time – but agree with the first comment that there are definitely other metrics of interest. We didn’t have those data (and they would require levying a survey of some sort), but we think future work should look into this. Mainly, we wanted to start a conversation around a better comparison group for professional football players, and this is why we arrived on replacement players. I am sure that going forward, people will find (even) better comparisons. That is what we want!

2. The replacement player comparison group means that the active margin of policy relevance is professional football versus not. So the reply above is correct that we cannot comment really on the health consequences of football at other levels at all. Our view is that it will be important to natural experiments for these other levels of exposure. Some papers that might be relevant (that I just learned about):

3. I appreciate the comments on how to discuss the findings. I trained as an economist and we have very different conventions around thinking about p-values (typically, not making discrete changes in interpretations at specific thresholds), but I am also a physician and medicine and medical journals have specific convention on reporting findings based on p-values. These differences have played out interestingly in the press – some report our findings as suggestive of a mortality difference and others flat out say no difference. From our view, the best we can say is that the results are what they are, and we encourage people to interpret them as Bayesians.