The 99.9999%: more thoughts on stats in the autism sequencing paper

Yesterday I got incensed about a quote in a story in the NYT from a prominent autism researcher about the significance of findings in their recent paper (which described the sequencing of protein-coding genes from autistic individuals, their parents and siblings).

The statement that so offended me, from the lead author of the paper, was that he was 99.9999% sure that a gene identified in his study plays a role in causing in autism. It’s a ridiculous assertion, completely at odds with what his group says in the paper. It needs to be corrected.

However, my original post also included some statistical analyses that were based on a cursory reading their work, and, as a result, didn’t directly speak to the central claims of the paper. My basic critique is unchanged – the 99.9999% claim is insupportable – but this got me interested in the fine details of their results and analysis, and I thought it would be useful to post my thoughts here.

Let me also say at the outset that, while I didn’t like what was said in the NYT, and don’t agree with everything the authors say in the paper, I am not trying to take them to task here. Analyzing this kind of data is difficult, and there are all sorts of complexities to deal with.

First, a summary of what they did and found. The core data are sequences of the “exomes” (essentially the protein-coding portion of the genome) from 238 children with autism spectrum disorders (ASD), both of their unaffected parents, and (in 200 cases) an unaffected sibling.

The analyses they present focus on the families with data for unaffected siblings, enabling them to compare the transmission of inherited variants (those present in the parents) and de novo mutations (those not present in the parents) between affected and unaffected siblings. They observed no differences in the transmission of inherited variants between affected and unaffected individuals, but found a significant increase in the number of de novo mutations that change proteins in affected individuals. Specifically, there were 125 do novo, protein altering (mis-sense, nonsense or splicing) mutations in affected individuals compared to 87 in siblings, a significant difference (p=.01).

This observation provides reasonable support for the hypothesis that de novo mutations are associated with autism. But it does not indicate which – if any – of these specific mutations is involved. After all, that there were 87 de novo protein-altering mutations in unaffected siblings suggests that many of those identified in autistic individuals are not involved in the disease. There is also the possibility that the elevated mutations rate is a secondary consequence of some other factor that leads to autism, and that none of these specific mutations are actually directly involved in the disease.

Given that the the observation of a de novo mutation in one gene in one affected individual conveys limited information about its involvement in autism, the authors focus on cases where independent mutations were observed in the same gene in different affected individuals. The reasoning is that such an observation would be unlikely to occur by chance in a gene that is not involved in ASD.

This is where I initially mistook what they did. I assumed the quote in the NYT story was referring to the chances of observing the same gene twice amongst the 125 de novo mutations in affected individuals, and pointed out that we actually expect this to happen at least 30% of the time (I say at least because the 30% number comes from assuming random mutations are equally likely to occur in all genes – which is not correct for reasons such as differences in gene size, GC content, etc… – all things the authors factor into their calculations). Indeed, the observation in the paper that two genes are hit twice in the set of 125 is not a statistically significant finding, and, by itself would offer no evidence that these genes are involved in ASD – and the authors do not assert that it does.

Instead the authors focused on a small subset of mutations – those that introduce a premature stop codon into a gene (thereby generating a truncated protein, or, because of a process known as non-sense mediated decay, likely decrease the expression level) or alter a splice site (potentially affecting the structure of the gene). The numbers here are a lot smaller. In the affected individuals there were ten nonsense and five splicing mutations, while there were only five nonsense and no splicing mutations in the unaffected siblings. And, crucially, in the set of 15 such mutations one gene – SCN2A – appeared twice.

So now the question is, is this a significant observation? Under the simplest of models, if you picked genes 15 times randomly from a set of 21,000 you’d expect to hit at least once gene twice with a probability of around .005 – making it a reasonably significant observation.

However, this is actually an overestimate of the significance, as differences in gene size, base composition, etc… make it more likely that a random mutation will land on some genes than others, thereby increasing the probability of seeing the same gene twice. The authors did extensive simulations that take this into account, and, restricting their analysis to the 80% of genes expressed in the brain, they conclude the observation of two nonsense/splicing variants in brain expressed genes is significant, at a p-value of 0.008.

However, it is worth noting (from the authors Figure S8) that under conservative but reasonable estimates of the de novo mutation rate and number of genes involved in ASD, the degree to which the data implicate SCN2A specifically is weaker, with a q-value (probability that the gene is not involved in ASD under various models) of around 0.03. Again, this may seem a bit counterintuitive, given that their data say that it’s significant that they saw the same gene twice, and they found only one such gene, how could that gene not be involved? But one actually has more power to validate the general model that de novo nonsense/splicing mutations are contributing to ASD than you do to implicate specific genes. This is why State’s assertion in the NYT that SCN2A was 99.9999% likely involved in ASD was pretty egregious – it is simply not consistent with their own data.

There are a few other things to note here.

First, the p-values and q-values they report is not specific to an individual gene – it is the average probability of observing a double hit in non-ASD genes and the average probability that a double-hit gene is involved in ASD. But SCN2A is relatively large (2000 amino acids), and thus the observation of two mutations in this gene is somewhat weaker evidence for its involvement in ASD than it would be for a smaller gene. I haven’t done a full simulations, but given that SCN2A is 4-5x larger than average, it should be on the order of 20x more likely to be doubly hit by chance than a typical gene, and thus the average q-value reported is an underestimate. It would be easy, using the simulations the authors already have on hand, to ask what the false-discovery rate when the doubly hit gene is 2000 amino acids or longer. I suspect it would not longer be significant.

The model also fails to consider the possibility that such fairly significant mutations in many genes might be lethal, and thus would never be observed. Hard to get a great estimate of what fraction of genes this might be, and the number is probably small given that they’ll almost all be heterozygous, but, again, given that the observations are only marginally significant, this possibility seems worth considering.

Finally, the more I read the paper, the more uncomfortable I grew with the way that the paper moved back and forth from non-synonymous to nonsense/splicing mutations, depending on where they got statistical significance. They start out by arguing that the there is a significant increase in the number of de novo synonymous mutations in ASD affected individuals. They get statistical significance here because there are a relatively large number of such mutations. They then look for cases where the same gene was hit twice, and find two. But this is not a significant observations – failing to distinguish between the possibility that a subset of ASD-involved genes were being hit from the null model of genes being hit randomly. However, for one of these pairs they noticed that there were two nonsense mutations. There wasn’t a significant enrichment of de novo nonsense mutations in cases (10) vs controls (5), so they added in the five splicing mutations from cases (there were none in controls) and got a marginally significant enrichment (p=.02). Then they looked at how likely it would be to find the same gene hit twice by nonsense/splicing mutations, and got a marginally significant result.

It’s possible to justify this path of analysis from first principles, as nonsense/splicing mutations are difference from missense mutations – and maybe this was part of the analysis design from the beginning. But the way the paper was set out, it felt that they were hunting for significant p-values – which is a big no-no. What if they had observed that highly conserved amino acids in some gene had been hit by the same missense mutation in two families? Would they have pursued this result and evaluated its significance? This is a crucial question, because if they pursued the nonsense observation simply because it was what they observed, then their statistics are wrong, as they need to be corrected for all the other possible double-hit leads they would have pursued. This is not a subtle effect either – such a correction would almost certainly render the results insignificant.

I don’t know the details of how this experiment was planned. Maybe they always intended to do this exact analysis in the first place, and thus it’s completely kosher. But the scientific literature is filled with post facto statistical analyses, in which people do an experiment, make an observation, and then evaluate the significance of this observation under the assumption that this was always what they were looking for in the first place.

It’s sort of like how, in baseball broadcasts, the announcers are always saying things like “This batter has gotten hits in his first at bat in 20 straight games played on Sunday afternoon”. They say this because it sounds so improbable – and in some sense it is, as this specific streak is, indeed, improbable. But if you consider all the possible ways you can slice and dice a player’s past performance, it is inevitable that there would be some kind of anomaly like this – rendering it statistically uninteresting.

I’m not saying that something this extreme happened in this autism paper. But the way the data were presented in the paper definitely made it seem like they were looking for a statistically significant observation on which to sell their paper (to Nature and to the public).

And it’s a shame – the data in the paper are cool. But does it really make sense to make such a big deal out of what is, at best, a single marginally significant observation? What if they hadn’t chosen one of those two families for their study? Would the result be uninteresting? Of course not.

In the end, what this paper should have said was, we generated a lot of cool data, we found some evidence that de novo mutations are enriched in kids with ASD relative to their siblings, but we need more data – a lot of it – to really figure out what’s going on here. Unfortunately, in the world we live in, this would have been dismissed as kind of boring, and likely not worthy of a Nature paper (although far less interesting genome papers are published there all the time).

So the authors made a big deal out of an interesting single observation, when they should have waited for more data. And then, probably for the same reasons, they oversold the result to the press – and ended up expressing an indefensible 99.9999% confidence in SNC2A’s involvement in ASD to a reporter.

The importance of these studies of increased rates of de novo mutations in autism cohorts can be better appreciated by studying the rate of sperm mutations in healthy volunteer donors. Molina et al (2011) studied sperm mutations in 10 healthy male volunteer donors focusing on three mutations identified in individuals with a genetic syndrome that also have high rates of autism. The three sperm mutations that were specifically examined for in healthy donors were: 7q11.23 (Williams syndrome), 15q11-13 (Prader-Willi syndrome), and 22q11 (Di George/velo-cardio-facial syndrome) and most genetic and epigenetic cases of Williams syndrome, Prader-Willi syndrome and 22q11 deletion syndrome are caused by de novo mutations in contrast to being inherited events. All three sperm mutations (deletions and duplications) in all three regions were found in the sperm of all the volunteer donors. The appearance in a family of a child with any of these genetic disorders appears to be a random event that can strike any family at any time for no discernible reason other than pure chance.Interestingly the rate of sperm mutations (minimum 10,000 sperm) appears to be higher than the prevelance rates of these genetic syndromes in the general population. Reduced sperm motility, increased rates of fetal loss and population prevelance rates may be under reported could all explain this phenomena.

results from scientific studies are routinely sexed up to warrant publications in scn. this, in itself, is not surprising, but the fact that all big and established investigators indulge themselves in this practice is.

i am surprised michael eisen is surprised, which i am sure he sees day in and day out his own department folks doing. now, what he is doing to expose this publicly is commendable and we should all support this. just out of curiosity, michael, how many senior established investigators support your endeavors like this?

Aside from the “hunting for significant p values” problem, their p values could also be less significant than they seem at face value because of the multiple sampling problem. Labs all over the place are looking for correlations like these, so of course some labs are going to hit them by chance. And it’s specfically those low p value results that will get published in Nature.

On the other hand, the fact that their one significant result also popped up in another independent case (mentioned in the Neale paper), that SCN1A also came up in O’Roake, and that the sodium channels were implicated in an another study nearly 10 years ago (http://www.ncbi.nlm.nih.gov/pubmed/12610651) hints that it might not be a fluke and they actually were reasonably conservative about what was called significant.

I think the underlying problem is that these studies just barely have sufficient power to detect a whole bunch of very rare variants that cause phenotypes which are inappropriately (if unavoidably) lumped together as ASD. Such studies have to sell a story that shows at least incremental progress towards discovering the “cause of Autism” so that the research can continue and make it to the next round with more patients and more sequencing. Despite its notable statistical exaggerations, the NYT article actually did a reasonably good job of communicating that message.

Andrew, for what it’s worth, according to the Neale paper, no additional SCN2A mutations came up in the either of the other exome sequencing papers, in either cases or controls. This WEAKENS the case for SCN2A.

They do mention that a de novo splice site mutation came up in another, unpublished, case. But it’s hard to evaluate the significance of this observation without knowing how many cases were looked at, etc…

Nonetheless, I’m not saying that SCN2A is not involved in autism. It seems like a pretty reasonable candidate. But lots of other pretty reasonable candidates that appeared to be bolstered by data stronger than presented for SCN2A have failed to pan out in other genetic studies. And I still think it was irresponsible of State to assert a degree of confidence at least four orders of magnitude stronger than the data support.

Note also that a case/control could have more than one rare variant, which adds a level to the complexity to simulating data to derive an empirical p-value. Most of the statistical methods for analyzing rare variants require some kind of collapsing over the set of observed variants. As noted, single variant analyses with small sample sizes are really best used as hypothesis generating tools.

They estimate the total number of genes causing autism to be in the order of thousand, so the 21000 number in your calculation should be changed to 1000. now what are the chances that only 1 gene will apear twice in 200 cases?
And what are the chances that there will be almost no overlap with two other studies with 200 cases?
in addition I wouldn’t bet that the “stop” codons are enriched in the set, first 5:15 ratio is not a significant enrichment compared with 87: 125, second assuming random mutation you would expect 18/61 ratio of stop codons (the number of aa codons that can turn to stop in 1 mutation). but they observe 10/125, this is a significant decrease, maybe due to negative selection, so I would not expect early truncation to play a significant role in autism.
Taking the three papers together I think that a more reasonable conclusion one should make is, de novo SNPs are not a significant cause of autism.

7 Trackbacks

[…] it is NOT junk a blog about genomes, DNA, evolution, open science, baseball and other important things « The AAAS believes the public should read press releases not papers The 99.9999%: more thoughts on stats in the autism sequencing paper » […]

[…] that may cause autism in some kids probably went too far, as UC-Berkeley biologist Michael Eisen details here. His take: all we’ve learned is that kids with autism have more genetic mutations than those […]

[…] that may cause autism in some kids probably went too far, as UC-Berkeley biologist Michael Eisen details here. His take: all we’ve learned is that kids with autism have more genetic mutations than those […]

[…] that may cause autism in some kids probably went too far, as UC-Berkeley biologist Michael Eisen details here. His take: all we’ve learned is that kids with autism have more genetic mutations than those […]

[…] that may cause autism in some kids probably went too far, as UC-Berkeley biologist Michael Eisen details here. His take: all we’ve learned is that kids with autism have more genetic mutations than those […]

Search

Michael Eisen

I'm a biologist at UC Berkeley and an Investigator of the Howard Hughes Medical Institute. I work primarily on flies, and my research encompases evolution, development, genetics, genomics, chemical ecology and behavior. I am a strong proponent of open science, and a co-founder of the Public Library of Science. And most importantly, I am a Red Sox fan. (More about me here).