Link List

Wednesday, September 25, 2013

Genome sequencing yields masses of data. It's one of the founding justifications of the current cachet term Big Data. The jury is still out on how much of it is meaningful in any sort of clinical way as opposed to other sorts of data; that's fine, it's early days yet. But too much of current thinking seems to rest on ideas that are outdated in fundamental ways. It would be helpful to get beyond this.

A new paper in The American Journal of Human Genetics, ("Actionable, Pathogenic Incidental Findings in 1,000 Participants’ Exomes", Dorschner et al.) reports on a study of gene variants in 1000 genomes of participants in the National Heart, Lung, and Blood Institute Exome Sequencing Project. How many variants associated with genetic conditions that might be undiagnosed does each individual carry? This is addressing the issue of how much individuals should be told about "incidental findings" in their genome or exome sequences.

The investigators looked at single nucleotide variants in 114 genes in 500 European Americans and 500 African American genomes. They found 585 instances of 239 unique variants identified by the Human Gene Mutation Database as disease-causing. Of these, 16 autosomal-dominant variants in 17 people were thought to be potentially pathogenic; one individual had 2 variants. A smattering of other variants not listed in HGMD were found as well. The paper reports a frequency of ~3.4% and ~1.2% of pathogenic variants in individuals of European and African descent, respectively.

The 114 genes were chosen by a panel of experts, and pathogenicity determined by the same.

“Actionable” genes in adults were defined as having deleterious
mutation(s) whose penetrance would result in specific, defined medical
recommendation(s) both supported by evidence and, when implemented,
expected to improve an outcome(s) in terms of mortality or the avoidance
of significant morbidity.

Variants were classified as pathogenic, likely pathogenic VUS (variant of uncertain significance), VUS and likely benign VUS. Classification criteria included allele frequency of the variant (if low in the healthy population, it was considered to be more likely pathogenic than if high, relative to disease frequency), segregation evidence, number of reports of affected individuals with the variant, and whether the mutation has been reported as a new mutation or not. The group decided not to return VUS incidental findings to the individual if the variant was in a gene unrelated to the reason they were included in the study in the first place.

Reviewers of the data followed stringent criteria to classify alleles. For example, variants were considered suspect if they were identified by the HGMD as disease-causing. But they were not considered disease-causing if the allele frequency was common enough that this meant, relative to disease frequency, that the allele alone couldn't be causal.

This raises a point that we've made before, too often to a deaf audience; the variant is not 'dominant' and the 150 year old term due to Mendel should be dropped from usage, in favor of a more accurate conception of probabilistic causation (see below). If the allele was more common than the disease, other alleles or factors must also be involved. Or, perhaps the allele was improperly classified as causal in the first place.

And,"maximum allowable allele frequencies for each disease were calculated
under a very conservative model, including the assumption that the given
disorder was wholly due to that variant." Disease frequencies were overestimated when they weren't known. But, is dominance a function of disease frequency? Suggesting such a thing should raise big red flags about semantics and the conceptual working frameworks being used.

The eight participants with confirmed pathogenic (versus likely
pathogenic) mutations included three with increased risk of breast and
ovarian cancer (MIM 604370, caused by BRCA1 mutations, or MIM 612555, caused by BRCA2 mutations), one with a mutation in LDLR, associated with familial hypercholesterolemia (MIM 614337), one with a mutation in PMS2, associated with Lynch syndrome (MIM 614337), and two with mutations in MYBPC3, associated with hypertrophic cardiomyopathy (MIM 115197), as well as one person with two SERPINA1 mutations, associated with the autosomal-recessive disorder alpha-1-antitrypsin deficiency (MIM 613490).

Fewer actionable alleles were found in African Americans than European Americans, presumably because fewer studies have been done in this population and fewer causal alleles identified. Dorschner et al. did not have access to phenotypes of the people included in this study, so can't know their health status with respect to these variants. Nor, of course, can they know whether individuals might have a condition for which a causal allele was not found.

They report few pathogenic alleles, though the fact that they looked only for alleles associated with adult onset conditions could partially explain this. And, their criteria were stringent. And, of course, they were looking only for single gene disorders, so it's not a surprise that they identified so few potentially pathogenic alleles, in fact. Of course, single gene disorders are only a small subset of conditions that might affect us.

A 2011 Cell paper we've mentioned before ("Exome sequencing of ion channel genes reveals complex variant profiles confounding personal risk assessment in epilepsy," Klassen et al.) looks at this question from a different angle. Klassen et al. compared the exomes of 237 ion channel genes (known to be associated with epilepsy) in affected and unaffected people. They found rare variants in Mendelian disease genes at equivalent prevalence in both groups. That is, healthy people were as likely to have purportedly causal variants as those with sporadic, idiopathic epilepsy. They caution that finding a variant is only a first step.

The unjustified dominance of 'dominance'
We feel compelled to comment again, and further, on the terminological gestalt involved in papers such as these, as we commented yesterday (and have done on earlier posts).

The concept of dominant single-locus causation goes back to Mendel, who carefully chose traits in peas that worked that way. He knew other traits didn't. He learned that not all 'dominant' traits showed 'Mendelian' inheritance and even came to doubt his own theory later in life.

There are traits that seem to 'segregate' in families in classical Mendelian fashion. There is, for such traits, a qualitative (e.g., yes/no) relationship between genotype and phenotype (trait). When a dominant allele is present, you always get the trait. This means we see (in the case of dominance) about 1/2 of offspring of an affected parent who are affected. Much of the 20th century in human genetics was spent trying to fit patterns of inheritance to such single-gene models. But it was very clear that dominance (even when there was evidence for it) wasn't dominance! The traits were not always black and white (or should we say green and yellow?). And the segregation proportion wasn't 50%. What to do?

The conviction that the trait was due to the effects of a single locus seemed to have strong support, so the concept of 'penetrance' was introduced. This is not the first time that a fudge factor has been used to force a model to fit data when it didn't really. The idea in this case is that the inheritance of the allele (variant) from the parent had a 50% probability, but that the allele, once present, did not always cause the trait. If you can add a factor of 'incomplete penetrance', that can vary from 0 to 1, then you can fit a whole lot of data that otherwise wouldn't support Mendelian causation.

What we now know is that there are many variants at genes associated with single-gene traits, that other genes almost always also contribute (along with environmental factors as well, most of the time), and that the trait itself is quantitative: the same 'A' dominant allele doesn't always cause the same degree of severity and so on. In other words, the trait is mainly (in a statistical sense) due to the presence of variation in a single gene, but the effect depends on the specific allele that is involved in a given case and also is affected by the rest of the genome.

In other words, there is a quantitative relationship between genotype and phenotype. This is a general, accurate description of the pattern of causation. The pattern is not 'Mendelian' and the causation is not dominant. Or, to be clearer, the extreme cases are close to, or even exactly, single-allele dominant, but this is the exception that proves (tests) the quantitative-relationship rule.

We should stop being so misled by constraining legacy terminology. Mendel did great work. But we don't have to keep working with obsolete ideas.

10 comments:

As always, a lucid reminder of the fallacies of simplistic thinking in genetics. But also a reminder that biology as a science is actually built on idiosyncratically simple examples: we learn through the fortunate discovery of exceptional cases that teach us elementary principles.

At a more profound epistemic level, and in defense of the use of fudge factors such as 'penetrance' (as long as one is aware that it serves as a place holder for yet to be identified modifier genes - which most students do) I would like to add this:

What is wrong with teaching students the concepts of dominant vs. recessive? I think we have no alternative.For how can one appreciate 'gray' if one has not internalized the concepts of black and white? How can one appreciate the idea of a political ‘center’ if one is no concepts of liberal vs conservatism? What is ‘lukewarm’, if one is unaware of hot and cold?

Human language and thinking is in some sense digital if not binary, and to abnegate the convenience of discretizing reference points in our communication of concepts would be highly impractical. An indifferent continuum would be a horror. I am not sure how effective a science education program would be that bans all extremes and starts right in the fuzzy middle of lukewarm, gray, moderate .. while banning cold and hot, black and white, etc. as practically non-existing idealisations or extremes?

To discover laws, we must continue to see the world through the lens of “black and white” (or yellow and green?) as Mendel did. Yes, we know: he was wrong.

But as a famous scientist once said: It is better to be wrong than vague. So let’s be clear.

Well, Sui, you won't be surprised that I disagree. If we have been taught to think in black and white you might see gray as 'wrong' in the sense of your comment.

But if we were taught in terms of a + b, we would be more accurate and would not conceptually misled. Here 'a' is the allelic effect of one allele at a locus (or whatever the genetic referent is) and 'b' is the other (in a diploid).

For continuous traits, like severity, S = a + b, clearly and consistently and the challenge is to estimate or explain the allelic values.

If we truly believe a trait to be dichotomous, then we can simply say that a or b can = 0. This does raise the question of how to account for dominance, where essentially one wants a or b to equal 1 ('present'). This can accurately be accommodated, in a realistic and biologically sensible way, by saying that we assign dichotomous values to a potentially quantitative variable (the state of the trait), and if a + b > T we use the term 'affected'.

Since there can always be exceptions (e.g., singularities in mathematics), we could say that some times our definitions etc. are such that a given allele by itself causes some observable state.

Or we could even try to think (even better than all of the above?) in terms of a phenotype being a superposition of many possible different causal effects, even expressed probabilistically.

To me, this is more realistic and less misleading than Mendelian concepts which were invented 100 years before the functional nature of DNA was known.

I forgot to add that even a + b doesn't account for interactions or quantitative dominance deviations (standard issues in quantitative genetics). And, like incomplete penetrance, partial dominance is a tacit assumption that the ground state of life is dominance, when most of the time actual dominance is the exception.

Further, we clearly still have a lot to learn about gene action and interactions, gene by environment interactions, the effects of various kinds of RNA, genomic background and so forth, surely including things we still don't know about. Teaching by starting from a baseline that we know at best is inadequate becomes worse and worse pedagogy with every new bit of understanding.

There's a difference between how to teach stuff based on what we know now and how the field itself has advanced over more than a century (which is what I think Sui is talking about). For teaching at the intro. level: teach Mendel and make sure everyone understands him, and then *immediately* introduce the reasons his model is an over-simplification and his pre-selection of "mendelizing" traits was unrepresentative. But I would argue that Mendel's simple model (and even his biased selection of traits) was absolutely essential to the furtherance of the field of genetics -- indeed for all the later discoveries that show how wrong-headed he was. I would say the same for the Watson-Crick model, which didn't have things like introns or regulatory sequences in it, but was essential for the advances that led to the discovery of such things. I like simple models. Indeed I like simple models PRECISELY BECAUSE reality is so damned complicated and hard to understand. The trick is never to forget that a simple model is nothing more than a *model*, an intellectual tool, not an accurate representation of reality (a.k.a. God's Truth, which we may never know). The human brain is feeble and needs all the crutches it can get -- as long as it knows they're crutches. Or, to quote Alfred North Whitehead (who really wasn't much of a philosopher), "Seek simplicity and distrust it."

Teaching Mendel as history is fine if it's taught that way, and I agree and always say that he enabled the greatest scientific advance perhaps in history for the following century plus. The problem in my view, as I've said, is that too many fail to note that the 'approximation' is not a good one much if not most of the time. This applies (as we tried to suggest in this and the previous post) is that professionals at the highest level still work often with 1863 perspectives.

I would say also that a better approach can be made by saying that this is the reality, and if we pare it down to a degenerate case ('dominance') we can see some usefulness, etc.

After all, from this point of view, we don't teach calculus by first focusing on singularities and then generalize to differentiable functions.

So in that sense I think your idea of a simplified model is rather different. Simplified models have, often at least, general use even if as approximations. But simplified Mendelian models don't have general use in understanding genetics; they are degenerate rather than simplified, in that sense.

Indeed. The essential thing is your caveat of immediately introducing how much more is now known. That apparently doesn't happen nearly often enough. We've blogged about 2 papers in the last week or two, in major journals, based on the simplistic Mendelian view of the world. The first was a state-of-the-art how-to for finding genes for rare diseases -- success rates are on the order of 20% using this approach. There are multiple reasons why that would be so, certainly -- it's hard -- but the assumption of dominant/recessive alleles is surely one of them.

I'd just add that my comments were about the history of the field, not current practice (or education) -- except insofar as we still lean on simplified models as crutches as we lurch toward further understanding. But if we're doing our job right, our "simple" models get more and more complex, as they should. To cite another early 20-th century philosopher (Einstein -- and I've probably got the quote wrong), "Your model should be just as simple as it needs to be -- and no more!"

Yes I think this is a right way to put it. But if we haven't got any better ideas, and need to keep the grants flowing, and have to use the Big Data technologies for that, or to promise quick success, then we are pressured to invoke simple promises.....just like preachers do!

For an interesting meditation on (and by no means a wholehearted defense of) simple models, check out Evans, M.R. et al. (2013) Do simple models lead to generality in ecology? Trends in Ecology and Evolution 28:578-83. Their answer to the question in the title: not so much.

Comments

We always welcome comments, but we moderate them to reduce spam, gratuitous unkindness and so forth. Because we moderate comments, they won't appear on the blog until one of us publishes them, but we try to do that in a timely way.

We've had to make a change to the commenting page. People had told us that Blogger was eating their comments, so now, rather than embedding comment editing with the posts, it has to be done on a separate, full page. Unfortunately, the 'reply' option has disappeared so comments will just follow one another. We'll see how this goes.