Nevertheless, she persisted

by Danielle Navarro, 16 Jan 2019

[Author note 3] I’ve added and removed this post so many times now, and it upsets me that I feel as scared as I do about this topic. What I’ve realised is that the reason I’m so scared is that the original essay included some (I think mild) criticisms of Jesse Singal’s reporting on transgender issues. Like many transgender people, I’m not a fan, and I really wish he would just stop writing about us at all. Regardless of anything else, at this point the well is so thoroughly poisoned that all he’s really accomplishing by continuing to act in this way is causing more pain. But I can’t stop him - and apparently I can’t seem to convince my colleagues that there is something amiss here. I’m disappointed, admittedly, but I’m so tired of trying to explain, so I’m giving up. I’ve removed that part of the essay - this is now solely a discussion of the Steensma et al (2013) paper.

[Author note 2] Ah, here I am digging this essay out again in January 2019. Sigh. Some things never change, do they?

[Author note 1] This essay is one I wrote a long time ago, sometime around March 2018 I think, though I’ve edited it to reflect a few recent updates. I was prompted to dig it out this morning when I saw this cross my twitter feed:

Important: Dutch researcher Thomas Steensma has come forward to say his study on trans kids is being misused and was never designed to measure desistance in the first place. This is the same study that Jesse Singal says he “misinterpreted” back in March. https://t.co/AQK7of6y0K

As a scientist, I genuinely feel sorry for Steensma here. The actual paper itself is very clear about the intended scope of the work, and it’s been consistently misinterpreted in the mainstream press. As a transgender person, though, I want to say thank you to him for taking the time to clarify. He shouldn’t have needed to do so, but it’s still important that he did!

The essay

This paper by Steensma et al (2013) seems to have been the source of some rather intense public discussion.

There’s been quite a lot written on this paper, and I was quite curious to know what it was about. From what I could work out from the media coverage, the paper investigated the incidence of “desistence” among children and adolescents with gender dysphoria, and was (and still is??) the largest such study to date. It featured prominently in this article by Jesse Singal published in mid-2016, though just the other day he published this piece saying that the paper was confusing and that he’d misunderstood what it was saying. The paper is defended by James Cantor here, who argues that it provides strong evidence that most children with gender dysphoria turn out not to be transgender. It’s also been critiqued fairly extensively. Julia Serano discusses it in this article, in which she points out a number of reasons to be careful about interpreting the results. Brynn Tannehill discusses it here. It seems to pop up all over the place.

What is the paper about?

Like many scientists, I’ve learned not to trust journalists when they attempt to summarise academic research. It’s not that I think they’re malicious, it’s that I think they’re not well versed in scientific and statistical methods research, and as a consequence the work is inevitably distorted by the time it gets reported in the media. Based on what I’d read, I was expecting the paper to ask how often it is that kids with gender dysphoria turn out not to be transgender. It doesn’t. In retrospect I shouldn’t have been surprised. After all, it’s in the title:

Factors Associated With Desistence and Persistence of Childhood Gender Dysphoria

Scientific articles aren’t usually supposed to be mystery novels. If the authors had been intending to write a paper measuring the rate of desistence, they would have called it something like this:

Estimates of the Rate of Desistence and Persistence of Childhood Gender Dysphoria

The distinction matters. The paper reports a study measuring the correlates of desistence, not the incidence of desistence. If you try read it as study measuring incidence, it does indeed come across as rather incompetent. There are many, many reasons why I would not want to rely on Steensma et al (2013) as an estimate of the frequency with which gender dysphoric children turn out not to be transgender (and for the most part, my thoughts on that point are pretty similar to those made by Julia Serano and Brynn Tannehill). However, the paper doesn’t appear to make any claim about this frequency. On my reading, the authors actually appear to have gone out of their way not to make any such claims, and focus almost entirely on the correlates of persistence. To some extent, I wonder whether people are reading more into the study than the paper itself claims.

What does the paper conclude?

Steensma et al (2013) explicitly state their conclusions in the abstract:

“Intensity of early GD [gender dysphoria] appears to be an important predictor of persistence of GD. Clinical recommendations for the support of children with GD may need to be developed independently for natal boys and for girls, as the presentation of boys and girls with GD is different, and different factors are predictive for the persistence of GD”

There are two claims here. First, the stronger the gender dysphoria symptoms at the time the child is first referred to the clinic, the more likely they are to continue with gender transition and return for further treatment. That’s not entirely surprising, but it’s useful to know.

Second the claim is that there are sex differences, and that the factors that predict how gender dysphoria unfolds are different for natal boys and natal girls. That’s a little more curious, but it wouldn’t be all that remarkable if true — it would basically amount to a suggestion that we shouldn’t generalise too strongly from trans girls to trans boys and vice versa. That being said, after reading the paper I have some quibbles about their analysis and I am not convinced that this claim is supported by their data.

Again… you’ll notice that the conclusions do not include any claims about the “true rate” with which gender dysphoria resolves spontaneously without transition. I don’t think that’s an accident. My guess is that the authors know perfectly well that their data don’t have a lot to say on that topic, and have been appropriately circumspect.

Curiously, there’s something that does pop up in their discussion that is absent from the abstract — after doing quite a lot of analyses, the single best predictor of whether a child persists with gender transition is simply the child’s beliefs about themselves. Gender identity, it turns out, is the best predictor of gender transition. Weird, huh?

Anyway, let’s take a look at the paper in some detail.

The sample

Any time you do applied research — where you don’t have the luxury of nice experimental control — you are going to run into issues with how people get selected into a sample. Real life is messy. Here’s what Steensma et al did to obtain their sample:

Between 2000 and 2008, 225 children (144 boys, 81 girls) were consecutively referred to the clinic. From this sample, 127 adolescents were selected who were 15 years of age or older during the 4-year period of follow-up between 2008 and 2012. Of these adolescents, 47 adolescents (37%, 23 boys, 24 girls) were identified as persisters. They reapplied to the clinic in adolescence, requested medical treatment, were diagnosed again with GID, and considered eligible for treatment

To be eligible for inclusion, a child had to be referred to the clinic between 2000 and 2008, and old enough at the time of follow-up between 2008 and 2012. That seems like a reasonable practical decision, and while I can certainly think of a few ways this would produce weirdness in the data (e.g., there will be a spurious correlation between age and cohort induced by this selection into the follow-up group) from the perspective of the stated goals of the study (i.e., finding predictors of which children will persist with transition), it doesn’t seem especially problematic. I’m loathe to criticise researchers for doing the best they can given the constraints on applied research.

Measuring persistence

The second thing they mention in this passage is the operational definition of persistence they use. Specifically, they note that 47 of the 127 children are classified as persisters if they reapplied to the clinic, were again diagnosed with gender dysphoria, etc. At this point, it’s hugely important to remember the stated goal of the study:

If the goal of the study was to estimate the rate with which gender dysphoria resolves on its own, this is a very dangerous error that will lead to a massive overestimate. Very few people persist with transition if they aren’t actually transgender (there are some: detransitioners do exist), so the false positive rate is very low. However, people might “desist” in the sense of not returning to the clinic for many reasons. It might be that the dysphoria has resolved harmlessly, but it might also be that living as a transgender person in a hostile world is a pretty scary prospect and they decided to live with the gender dysphoria. It could be something else too. In other words the false negative rate here likely to be higher. Using “does not return to the clinic” as a proxy for “not really transgender” is not a good idea.

If the goal of the study is to determine the factors that predict whether a child will return to the clinic as per the authors’ description of their study, then this is a perfectly sensible measure of persistence, because that’s the actual thing you’re trying to predict In other words, the measure of persistence in Steensma et al (2013) is perfectly appropriate to address the stated goals of the study. It’s just poorly suited to the goals that other people have projected onto the research.

That being said… the authors do not do themselves any favours in this part of the paper. Immediately after defining the measure of persistence, they write the following:

As the Amsterdam clinic is the only gender identity service in the Netherlands where psychological and medical treatment is offered to adolescents with GD, we assumed that for the 80 adolescents (56 boys and 24 girls), who did not return to the clinic, that their GD had desisted, and that they no longer had a desire for gender reassignment (emphasis added)

Hmm… I’m not sure about this. As other people have argued, that’s not a reasonable assumption. If we were talking about getting antibiotics to treat an infection, and the Amsterdam clinic were the only place you could get it, this might be reasonable. However, we’re talking about gender dysphoria, and the context is rather different. The world is rather hostile to transgender people, and there are serious downsides to transitioning. At most, all you can say is that it seems likely that those 80 adolescents decided not to transition at this time (i.e., desisted); you cannot reasonably infer that they no longer had a desire to do so. If some of these kids transition 20 years later, then yes they are certainly “desisters” in the sense used by Steensma et al, but it does not seem reasonable to assume that they no longer had any issues with gender dysphoria. As with a lot of my comments about the paper, this isn’t exactly a criticism of what the authors did so much as a caution against overinterpreting the results, and I do feel this particular passage was not a good word choice.

Measuring childhood dysphoria

The paper reports several measures of the level of gender dysphoria for each child when they were first referred to the clinic, the extent to which the child had begun social transition, and a variety of demographic measures. Because these measures were all taken at the time the children were first referred to the clinic, data are available for all 127 children. I’ve only just started reading this literature, so I don’t have a lot of comments on the measures themselves, but on superficial inspection they seem perfectly ordinary psychometric instruments and clinical assessment measures. I’m sure that there are some problematic things in there somewhere, but behavioural research is always tricky to do well and I’m always sympathetic to researchers trying to use the tools they have.

Measuring adolescent dysphoria

Because this is a longitudinal study, Steensma et al (2013) attempted to follow-up for all 127 participants in adolescence, even those who had not returned to the clinic. Here’s the relevant passage:

a set of questionnaires, assessing information on current GD, body image, and sexual orientation was mailed. All 47 persisters participated in the study. Of the 80 desisters, 46 adolescents sent back the [questionnaires]

Not surprisingly, all 47 people who were still attending the clinic and pursuing transition filled out the questionnaire. Of the 80 people who had not returned to the clinic, 46 returned the questionnaire, and those that did return the questionnaires reported few symptoms of dysphoria. Again, not surprising — the “desisters” were the kids who arrived at the clinic showing comparatively low levels of dysphoria and (as I’ll comment on later) hadn’t actually started social transition in any sense. As Table 2 illustrates, they had low levels of dysphoria in childhood, and lo and behold they “desisted” and when they returned their questionnaires in adolescence (see Table 4) they still turned out to have low levels of gender dysphoria. Remarkable!

Anyway, it should be noted that care is required with respect to the adolescent data. A response rate of 57.5% isn’t all that bad, but it’s hard to know how to interpret the follow-up data. It beggars belief to presume that the data are missing at random, so if you wanted to make strong claims about “desisters” in general, then you cannot ignore the non-responders or make simplistic assumptions about them. My recollection is that people with depression are far less likely to respond to surveys, so if there were adolescents with high levels of dysphoria among the desisters, they’re very likely to also be nonresponders (i.e., if you’re too miserable to bother following up with the clinic, you’re probably also too miserable to bother filling out the survey). As it happens, there is a statistical literature on non-ignorable non-response models, but you probably wouldn’t be all that surprised to learn that this is a hard problem and the authors didn’t go down that path!

In any case, again I feel the obligation to defend Steensma et al. Remember, they aren’t writing a paper about the “true rate of desistance” and the paper doesn’t make strong claims based on the adolescent data. They’re writing a paper on which variables predict whether a child referred to a gender identity clinic will return for further treatment. For the question as stated, this issue is largely irrelevant. It’s only because people seem to persist (if you’ll pardon the pun) in treating this as a paper about something else entirely that this issue of the nonresponders keeps coming up.

Just to highlight the fact that this paper has been consistently misinterpreted by many people, here is Steensma himself:

Steensma stands by the study’s methodology. But interestingly, he added that citing these findings as a measure of desistance is wrongheaded, because the study was never designed with that goal in mind.

“Providing these [desistance] numbers will only lead to wrong conclusions,” he said.

Rather, he says, the researchers wanted to see if they could find predictors of persistence. Which they did: The study found that transgender children who were older, born female, and reported more intense gender dysphoria were more likely to stick with their transgender identity than younger children, natal boys and those with less pronounced gender dysphoric traits.

I don’t what more he can say at this point? The paper was perfectly clear back in 2013. Steensma is perfectly clear about it in 2018. This paper cannot be used to estimate the rates of desistence. It never was.

Who pursues transition?

Right. Sorry about that … let’s get back to the paper.

As I’ve mentioned numerous times already, this paper does not focus on the rate of desistence. However, it does report descriptive statistics (in Table 1) for all 127 participants based on the childhood data, broken down by persistence group.

A quick look at the table makes it very clear where the analysis is going — those kids who arrived at the clinic partly or completely transitioned usually persisted, whereas those who hadn’t started social transition at the time of referral tended not to “persist” with transition later. Not entirely surprising, I suppose. Anyway, the table organises the data in a way that obscures the persistence rate — again, not unreasonably, because that’s explicitly not the point of the paper! — but it’s not too hard to convert the percentages to frequencies and then work out the numbers. A quick calculation suggests there must have been 12 natal boys who came in partially or fully socially transitioned to living as a girl, and 10 of those persisted (a persistence rate of 83%, or if you really must frame it this way, a desistence rate of 17%). There were 25 natal girls who came in at least partly transitioned to living as a boy, and of those 14 persisted (56% persistence). In contrast, among those children who had not taken any steps towards transitioning when they first arrived, most did not later transition! (Shocking stuff, right?) For the natal boys, there were (.565 x 23) + (.964 x 56) = 67 who came in with no transitioning, and among those only 13 persisted (19% persistence). For natal girls, 23 came in with no transition steps taken, and 10 of those persisted (44% persistence).

As a gender diverse person, none of this seems remarkable to me. There’s a big difference between wanting to transition (…because you are trans) and being uncomfortable with one’s assigned gender (…because gender roles are pretty stupid). Kids who are referred to the clinic after they start socially transitioning are much more likely to be transgender, so they “persist”. Kids who are referred to the clinic because they aren’t trans but have difficulties with gender roles aren’t likely to have started transitioning, nor are they likely to come back to the clinic to pursue transition. Similarly, those people will likely have fewer symptoms of dysphoria, and… well, you get the idea, right? Pretty sure this is exactly what you’d expect to see?

Some results

Okay, next up we’ll take a look at Table 2. Because the outcome variable (persistence) is binary valued (i.e., they return to the clinic or they don’t), the analyses are based on logistic regression. Table 2 examines each of the childhood measures (i.e., at time of referral) for all 127 children separately. Natal boys were more likely to desist, with an odds ratio of .41. There’s actually a useful point to make about this calculation, which you can reproduce from Table 1.

natal boys persisting = 23

natal boys desisting = 56

So the odds of persisting for natal boys is 23 / 56 = .41. For the natal girls the numbers are

natal girls persisting = 24

natal girls desisting = 24

and the odds here are 24 / 24 = 1.0. Putting these two together gives you the odds ratio of .41 / 1.0 = .41. This turns out to be a statistically significant effect.

It’s hard to know how to interpret it though.

On the dangers of naive interpretation

Suppose it really were the case that all persisters are genuinely transgender, and all desisters aren’t (doubtful, but bear with me). That would imply that there were “really” 23 trans girls (natal boys) and 24 trans boys in the initial sample. If the intake at the clinic does not impose any particular bias, this suggests that “actually being transgender” has roughly the same incidence rate in males and females. Yet there were many more natal boys being referred to the clinic in the first place: there were 56 desisters among the natal boys and only 24 among natal girls. Under our very dubious assumption that none of the desisters are transgender, that would suggest that the biggest issue here is that there are “too many” natal boys being referred to the clinic.

(Again, I really cannot stress enough that this is not a wise way to look at the data; I’m going way beyond the scope of the paper and making some silly assumptions, in order to highlight how things can go awry if you’re not careful)

So why is this gender imbalance in the data? There are a number of possibilities. One possibile explanation would be that masculinity in girls is seen as more socially acceptable than femininity in boys, and as a result we observe more gender non-conforming boys being perceived as in need of “treatment”. I don’t think anything in life is this simple, but the fact that the intake data are so asymmetric – and yet the persistence data are so even – does require some explanation.

Where things get weird is if you start trying to push this line of reasoning too far. The study isn’t really designed to investigate this kind of thing. If you do push this line of thought to its end point, what you would have to conclude is while “being trans” has roughly the same rates in both sexes, it is feminine boys (and not masculine girls) who are being disproportionately referred to gender identity clinics incorrectly. Suffice it to say, I don’t think that many of the folks criticising the current standards for treating gender dysphoria would be particularly pleased at this conclusion, but strictly speaking that is what follows if you try to assume that “desistence” is a good proxy for “not actually trans” when interpreting these data.

Perhaps it’s best if we all agree not to do that?

A statistical complaint

Okay, earlier when I said I don’t have a critique of the paper itself, I lied. I do have a criticism.

I think their multivariate analysis reported in Table 3 is wrong in a way that potentially undermines one of their conclusions. The basic idea is sensible: they took the variables with strong relationships in the initial analysis (Table 2) as the basis for a multivariate logistic regression, and found that several of them made unique contributions to the prediction equation (left column of Table 3). The resulting fitted model accounts for 58% of the variance in the outcome. It’s a moderately successful regression model, and so far it all looks good.

My gripe with this analysis is that they fall for a common fallacy: as Andrew Gelman and Hal Stern grumbled back in 2006, the difference between significance and non-significance is not necessarily significant. In their general conclusions Steensma et al (2013) argue that the predictors of desistence for natal boys and natal girls are different, but they never actually test that hypothesis. What they do instead is run two separate regressions, one for natal boys and another for natal girls, then look at which variables came out significant in each case. That’s NOT a test of group difference, and if you actually look at Table 3 properly you can see that for the most part there actually aren’t any meaningful differences. For instance, “age at intake” is a significant predictor for the natal boys but not natal girls. Aha, a difference!

Yeah, no. If you look at the point estimate of the odds ratio in Table 3, it’s 1.90 for natal boys and an almost identical 1.98 for natal girls. I feel very confident that this is not a significant difference between groups 😀 — what’s happened here is that the effect hovers on the borderline of statistical significance for both groups, but the sample size is larger for natal boys. As a consequence, the corresponding confidence interval is narrower in that regression. The end result is that one group shows a significant effect due to increased statistical power, and the other doesn’t. There is no evidence of differential predictive ability here.

Across the board, the confidence intervals for the natal boy odds ratio and the natal girl odds ratio overlap considerably. Without access to the raw data it’s very hard to know if there really is a difference here — though I should note in passing that there’s a couple of those predictors where I’d guess that there actually is a real difference, but the paper doesn’t report any analysis to that effect.

This is a bit of a problem, because if we go back to the authors main conclusions, the second one was this:

Clinical recommendations for the support of children with GD may need to be developed independently for natal boys and for girls, as the presentation of boys and girls with GD is different, and different factors are predictive for the persistence of GD

They might actually be right about this, but as far as I can tell they haven’t run the analysis to demonstrate it. Still, it’s an easy one to check. Just re-run the regression on the full data set and include the appropriate interaction terms.

What about the follow-up questionnaires?

Oh right. I almost forgot. Remember how people are focused on the mailout questionnaires sent at follow-up for some reason? The authors do talk about them a bit on page 587, but there’s not a lot of interesting things to say. People who returned to the clinic later on (persisters) have more gender dysphoria than those people who didn’t return but were still sufficiently motivated to return the survey. Well, yeah: those people didn’t have high levels of gender dysphoria to start with, as the childhood data in Table 1 illustrate.

But regardless, that’s pretty hard to interpret in detail because we don’t know much about why the desisters didn’t come back, and why the nonresponders didn’t respond. Commendably, the authors don’t really make a big deal out of this - it’s not especially informative, and they don’t treat it as such.

So what were the successful predictors?

Looking over the results and discussion, it seems to me the most interesting point in the paper is slightly buried. At the childhood intake, there were quite a lot of different measurements taken. So which of these measures turn out to be the best predictor of whether a child subsequently persists with transition? It isn’t sex, or degree of transition, or whatever. It’s the cognitive subscale from the GIIC (Gender Identity Interview for Children). I don’t know anything about the psychometric properties of this instrument nor what the questions are, so I’ll just defer to Steensma et al (2013) here, who write:

Persisters indicated that they believed that they were the “other” sex, and the desisters indicated they wished they were the “other” sex; this difference may also underlie our finding of a higher report of cognitive cross-gender identification in the persisters than in the desisters (emphasis added)

If I’m reading this correctly (I might not be because I don’t know the GIIC), it sounds rather like they’re arguing that self reported gender identity turns out to be the best predictor of whether someone actually transitions. That’s more or less exactly what adult transgender people say about the distinction between gender identity and discomfort with gender roles. I can’t say I’m surprised, but it’s terribly nice to see the data backing up that intuition.