Cloak and dagger

I saw this JAMA Pediatrics article [by Julia Raifman, Ellen Moscoe, and S. Bryn Austin] getting a lot of press for claiming that LGBT suicide attempts went down 14% after gay marriage was legalized.

The heart of the study is comparing suicide attempt rates (in last 12 months) before and after exposure — gay marriage legalization in their state. For LGBT teens, this dropped from 28.5% to 24.5%.

In order to test whether this drop was just an ongoing trend in dropping LGBT suicide attempts, they do a placebo test by looking at whether rates dropped 2 years before legalization. In the text of the article, they simple state that there is no drop.

But then you open up the supplement and find that about half of the drop in rates — 2.2% — already came 2 years before legalization. However, since 0 is contained in the 95% confidence interval, it’s not significant! Robustness check passed.

In figure 1 of the article, they graph suicide attempts before legalization to show they’re flat, but even though they have the data for some of the states they don’t show LGBT rates.

Very suspicious to me, what do you think?

My reply: I wouldn’t quite say “suspicious.” I expect these researchers are doing their best; these are just hard problems. What they’ve found is an association which they want to present as causation, and they don’t fully recognize that limitation in their paper.

Here are the key figures:

And from here it’s pretty clear that the trends are noisy, so that little differences in the model can make big differences in the results, especially when you’re playing the statistical significance game. That’s fine—if the trends are noisy, they’re noisy, and your analysis needs to recognize this, and in any case it’s a good idea to explore such data.

I also share Elan’s concern about the whole “robustness check” approach to applied statistics, in which a central analysis is presented and then various alternatives are presented, with the goal is to show the same thing as the main finding (for perturbation-style robustness checks) or to show nothing (for placebo-style robustness checks).

One problem with this mode of operation is that robustness checks themselves have many researcher degrees of freedom, so it’s not clear what we can take from these. Just for example, if you do a perturbation-style robustness check and you find a result in the same direction but not statistically significant (or, as the saying goes, “not quite” statistically significant), you can call it a success because it’s in the right direction and, if anything, it makes you feel even better that the main analysis, which you chose, succeeded. But if you do a placebo-style robustness check and you find a result in the same direction but not statistically significant, you can just call it a zero and claim success in that way.

So I think there’s a problem in that there’s a pressure for researchers to seek, and claim, more certainty and rigor than is typically possible from social science data. If I’d written this paper, I think I would’ve started with various versions of the figures above, explored the data more, then moved to the regression line, but always going back to the connection between model, data, and substantive theories. But that’s not what I see here: in the paper at hand, there’s the more standard pattern of some theory and exploration motivating a model, then statistical significance is taken as tentative proof, to be shored up with robustness studies, then the result is taken as a stylized fact and it’s story time. There’s nothing particularly bad about this particular paper, indeed their general conclusions might well be correct (or not). They’re following the rules of social science research and it’s hard to blame them for that. I don’t see this paper as “junk science” in the way of the himmicanes, air rage, or ages-ending-in-9 papers (I guess that’s why it appeared in JAMA, which is maybe a bit more serious-minded than PPNAS or Lancet); rather, it’s a reasonable bit of data exploration that could be better. I’d say that a recognition that it is data exploration could be a first step to encouraging researchers to think more seriously about how best to explore such data. If they really do have direct data on suicide rates of gay people, that would seem like a good place to look, as Elan suggests.

20 Comments

I know it’s hard to answer in the abstract but what would you suggest as an alternative to the “robustness check” method? I’ve seen papers use this check for model specification or even a particular measure of a difficult to measure thing like wealth inequality within a group.

Journals should consider: a) a section for data exploration papers so interesting stuff that people spent time on doesn’t have to be mashed into simplistic “this causes that” presentation and b) a section for negative results because sometimes those are interesting and lots of people have spent lots of time on stuff that came out “nope”. Given digital publishing, why not issue “specials” of a and b? Medical journals have at least had case studies that are meant to inform treatment by other doctors and it seems both a and b could have a similar effect for researchers by telling them what has been done in associative but not causative work and what has been done that was “nope”.

I agree with Michael; this is pretty standard, much better than the typical social science paper. Absent a journal of exploratory studies — god knows where that would take us — some conclusion must be stated. Granted, uncertainty. What we desperately red is independent replication. So, back to previous discussions.

Whether this is standard in social science or not, do you think that it in any way proves what it claims? Do you have any concern that there are hundreds of press articles out right now stating with confidence that gay marriage bills saved hundreds of vulnerable teen lives, all because a placebo test came back p = 0.08 rather than p = 0.04?

What’s the harm? Serious question.
I don’t yet see how this leads to much harm. I mean you probably agree it’s unlikely to have increased suicides, might well be null, media should report uncertainty better etc. But I’d like to see the decision analysis that shows publicising it like this was particularly harmful all things considered. I could see it if it had come out the other way, but I’d actually trust the researchers to be good Bayesians and show a bit more care in interpreting and publicising this counter-intuitive result.

I think it is not the authors’s fault that media does not understand that the carefully phrased “providing empirical evidence for an association between same-sex marriage policies and mental health outcomes” does not refer to actual lives lost (just attempted but unsuccessful suicides) or to a cause and effect relationship.

I don’t generally think robustness checks of this sort prove anything other than X conclusion holds under these alternatives. The extent to which one cherry picks those examples is a matter of ethics, ability, and exposure, and other researcher degrees of freedom. That said, you’re bemoaning the current state of affairs when people are asking how to improve them.

The “press” can seize on anything — that’s not our concern. We should do the best we can with what we’ve got. The problem here is that the best we can is below Andrew’s standard. I don’t disagree with Andrew, just agree with Michael that id like to discuss feasible alternatives for presenting findings.

About the point that change started happening before the intervention date, that is actually something that has been pretty widely discussed in a lot of the literature on the impact of changes in law. For example if a new law goes into effect on January 1, does that mean that no one in December is changing their behavior? I’d bet in places that decriminalized marijuana in the last round of voting saw changes in behavior after the election even if the change itself did not go into effect later. Police may have changed their behavior in anticipation also. Other times people don’t know about the change until later. In other words the idea of a “clean date” is not very useful in the social world. Of course, as we have discussed here before, the places that had the early change may have been the places most open to change or where attitudes had changed the most, and this can explain some of it as well. As a researcher dealing with the media and policy makers your job is to spell out all these caveats.

Yes robustness checks, what else does one do? Add more comparisons, try different models or time frames, increase the length of follow up. The idea that people shouldn’t be pushing against their results is just strange to me. You try to think of every argument that a reviewer or reader will make (wrong specification, it’s all MA) and to respond to the ones they have made. I would think that the big concerns are the assumption that state level differences are time invariant, that the didn’t model legislative versus court changes, and that they didn’t use a hierarchical model, that’s what I would have asked about if I were a reviewer and it would not have been fishing for them to try it.

A Bayesian mixture model of N different plausible likelihoods with priors over the different models. At its heart, Bayes is one big robustness check… (ie. if you go over in this direction in explanation space does it explain better (higher posterior) or worse (lower posterior)).