Editor’s Note: We are pleased to share the following post from Genomena, a blog maintained by Nathaniel Pearson (Director of Research at Knome). The entry is reposted here with permission. - Janine Holley

May 30, 2012 : Nathaniel Pearson : Three new papers spotlight a glut of rare variants in our genomes, with key insights for human history and health.

Bolstered by the papers’ data from more than 80 million copies of individual human genes, the growing catalog of such rare variants casts our recent ancestors’ rampant population growth into more precise temporal relief — and should, in the long run, help finely trace the geographic sojourns of particular copies of human chromosome segments. More importantly, however, many of those rare variants likely figure centrally in our health.

These basic insights have been clear to geneticists for a long time, and it’s great to see them percolate through the lay press. The new data papers scoured every letter of many genes in thousands of people, and found a bumper crop of spelling variants that are each found in just one or a few of those people. The third paper summarized what such findings suggest about precisely how big the human population has been over time, and roughly what they mean for efforts to understand disease.

Altogether, the findings cast such bright light on our origins and health because, under simple assumptions1, geneticists can predict how often variants that do (or don’t) greatly alter proteins should pop up in a given proportion of people, if our ancestors were steady in number, and if proteins weren’t especially important for health. And those are two big ifs.

The new data highlight that real patterns of such variant frequencies in our genomes drastically flout those null expectations — and they call sensible attention to rare variants, which underlie that deviation, as we search for the genetic basis of disease. More specifically, the papers all underscore two broad insights that have been clear for several year• Our population has skyrocketed, but just for the past few millenia — a trend that’s left a strong signature of many young, rare spelling variants in our genomes.

• Many of those rare variants may be making us sick.

A tippy tree, laden with rare fruitThe findings in the new papers hinge on a simple insight: the more widely common a genetic variant is, the older it likely is. This is because old variants have typically been carried down many branches of the growing human family tree, spreading far and wide on the planet. By contrast, variants that just arose recently are typically confined to recently sprouted, geographically narrow branches of the tree.

While details of very early human population dynamics are hard to precisely infer2, the new data, along with much other genetic and ancillary historical evidence (see Keinan and Clark’s reference citations, for starters), suggest that our population has grown extremely fast in the past few millenia. Such growth has, effectively, stretched the human family tree at its tips: the tree’s young twigs look longer3, in units of generations, than we’d otherwise expect, given how long the trunk and inner branches are. And because new genetic variants pop up roughly randomly (by mutation) on the branches as they grow, the long, fast-growing tips of the tree harbor more of its total load of mutations than they would have, had the tree grown at a constant rate.

You can picture each such mutation as if it were a little brainstorm in the head of the late Dr. Seuss. Had Seuss drawn genomic family trees, he might have represented each mutation as an odd, never-before-seen kind of fruit, confined to the branch (big or small, and including its sub-branches) where the mutation struck. Many of the rife rare variants in our genomes can thus be thought of as distinctive fruits, each confined to just one or a few twigs amid a great, bushy tree.

In this light, the new papers affirm what’s become clear over the past few years, as we sequence more and more people’s whole genomes: we’ll still be finding new human genetic variants for a long time, even after having sequenced many more of us.4

And, as long as our population continues to dramatically balloon — a system out of equilibrium, in population genetic terms — the tree will continue to loosely resemble an inflationary universe, its various branches speeding apart from each other via new mutations. In this analogy, the genetic counterpart of the red-shift that signals cosmic expansion is, roughly speaking, the overall skew in frequency, toward rarity, of our genetic variants.

Rare variants in diseaseVisions of the human family tree, tips bent toward our probing grasp by newfound fruit, may recall the myth of another tree. Apt, then, that the crop of rare variants in our genomes may include much of the fruit of human affliction.

Rare variants are thought to figure centrally in disease for two related reasons: as we’ve seen, most such variants are rare because they arose recently, so haven’t had time to spread widely among people; and young variants, by definition, haven’t withstood natural selection for long.

Such selection — often assisted by chance — tends to keep harmful variants rare, or purge them from the population altogether. Non-harmful rare variants, by contrast, are in principle free to get more common (though chance often strikes them down too).

That is, over time, consistently harmful variants tend to vanish, especially if the population is big enough to stably harbor a rich variety of alternative variants; meanwhile, variants that happen not to harm their carriers are free to spread, whether by chance or, in rare cases, by helping their carriers have more kids than others do.

Together, these trends mean that a snapshot of the rare variants we carry today, like a minute’s worth of the world’s newest tweets, is likely enriched for items that will soon be either gone5 or, in a few cases, more common.

And they help explain why surveys of the common genetic variants covered by fast, cheap SNP chip screens rarely offer clear insight into disease risk. For a given stretch of the genome, such common variants do distinguish big branches of the human family tree from each other, making them quite informative of ancestry. But a consensus has emerged that the long tail of human genetic diversity — all those rare variants — is where we’ll find much of the genetic contribution to disease risk.

Spotting which rare variants harm us, however, turns out to be tough.

Proof of burden

Take the extreme case of a variant found in just one woman, among everyone on earth. If we split humanity into those who get a given disease in life, and those who don’t, our chosen woman must fall into one group or the other. And if we look at enough diseases, she’ll eventually fall into the sick group for at least one of them.

But it’s clearly too far a leap to infer that the unique variant she carries made her sick. That is, the variant’s distribution among people with and without the disease simply can’t be statistically significant, given how rare it is overall.

To meet this inherent challenge to squarely implicating a given rare variant in a given disease, geneticists look to leverage other insights. If the variant really is too rare to show up on further screening of more sick or healthy people — and that’s a place where the new data are already helping us at Knome, as we shortlist intriguing variants for research clients — they next ask how readily it may affect physiology by altering the amount or chemical makeup of a protein encoded by a gene that either harbors the variant itself, or sits near it in the genome.

And, next, they may look at more people with the disease in question, and ask whether other rare variants tend to cluster nearby in their genomes, moreso than other people’s. In recent years, as richly detailed data on human genetic variation has started to flow, geneticists have been honing rare variant burden tests specifically to find such regions. Refining such tests, and gathering more genetic and phenotypic data to feed them, stands to bring many key insights into the genetic basis of disease (and on a time frame shorter, we can certainly hope, than that needed for natural selection itself to weed all those harmful variants from the crop of rare variants we carry!).

A new drugTo thoroughly catalog the rare variants that pepper our genomes, of course, we have to read what DNA letters we carry at each site in the genome, rather than just at those sites already known to vary in spelling (as in SNP chips). The newly published work furthers that effort, by carefully sifting through particular sets of genes in many thousands of people — more people than have ever been so comprehensively sequenced together.

Notably, the Novembre group’s paper focuses on a few hundred genes already thought to help govern how the body responds to particular drugs. Such genes are actually an intriguing testing ground for the notion that rare variants crucially shape not just disease risk, but other phenotypes (outward traits) too.

Many drugs derive from defense chemicals made by plants and molds — nature’s organic chemists extraordinaires — that our ancestors have long eaten, breathed, and otherwise touched. But modern folks have also tinkered greatly with such drugs, concentrating, combining, and diversifying them in our quest to prevent and cure diseases. As such, many drugs, and cocktails thereof, are (like other facets of our overall diets) fairly new parts of the human environment.

Drugs we take are thus exposing even the most common (read: oldest) variants in our genomes to novel regimes of natural selection. Many such drugs work better, at particular doses, in some people than others — and such variation may often trace largely to variation in our genomes.

Looking ahead, I’m intrigued to see whether rare genetic variants turn out to explain unusual responses to particular drugs as well (or better) as they explain particular diseases — or, alternatively, whether such variation in drug response traces largely to common variants in our genomes.

Tall trees: the diversity skylineAn intriguing tidbit in the Akey group’s paper is a spatial contour of overall genetic diversity across thousands of genes in our genomes. Plotting the classic measure of nucleotide diversity — that is, how often two randomly chosen chromosomal copies of a genome site differ in spelling — Akey’s post-doc Jacob Tennessen et al. predictably found the strongest peak in diversity in the HLA gene cluster on chromosome 6. Expressed on the surface of immune response cells, these genes work, in large part, to help us fight infection — a job thought to be well served by great genetic diversity within a genome, which presumedly helps its carrier respond to many kinds of germs.

Byzantine in its sequence variation, HLA turns out to play surprising functional roles in mate choice, drug response, and diseases from multiple sclerosis to narcolepsy. Notably, women and other female great apes likely pick their mates in part (and unconsciously) by how they smell, thanks partly to what versions of HLA they and their suitors carry. Such preferences are thought to help preserve genetic variation longer here than elsewhere in the genome — so well, in fact, that your copies of some HLA genes more closely resemble some gorillas’ copies than some other people’s copies…and those gorillas’ HLA genes are likewise closer to yours than to each others’!

Essentially, even the inner branches of the family tree of this part of the genome are incredibly long, stretching back ten-fold more generations than is typical. As we’ll see in a coming post, the overall depth of the tree for a given part of the genome can be thought of as a rough proxy for how big the ancestral population for that part of the genome has, over time, tended to be.

Other peaks in genetic diversity — lower than HLA, but still prominent — include odorant receptor and keratin (hair/skin protein-making) genes, which are widely presumed to accumulate functionally unimportant variation, reflecting less stringent evolutionary constraint in people than in some other mammals. Strikingly, however, the Akey group also found that another immune response gene, DEFB108B, marks a peak in genetic diversity roughly as tall as that of the much better known HLA cluster. It’ll be intriguing to learn more about what DEFB108B does in our bodies, and whether its remarkable diversity reflects HLA-like importance, or keratin-like dispensability.

Stay tuned on that front. As more of us are sequenced and phenotyped, we’ll learn much more about which of our variants — among the common ones, and the newly commonplace rare ones — matter most, and how. Much of what we learn will speak directly to the pending challenges of genomically personalized medicine, as framed in fervent discussion of another recent paper, both at large

1Back-of-the-envelope estimates typically ignore any complications from non-random mating, variation in mutation rate, and so forth — but are quite robust.

2Moreover, the history of human population change has, of course, varied in space (among regional sub-populations), as well as over time. Notably, the new papers suggest that such variation may be fairly minor in the grand scheme, dwarfed by the remarkable overall recent growth. And Keinan and Clark note that sample sizes, in particular, may add roughly as much noise to the picture as do real underlying variables.

3Ultimately, the length of these twigs tracks how long many randomly chosen pairs of extant copies of our chromosomes have descended along separate lines.

4In the end, you likely harbor a dozen or so brand new genetic variants that arose by mutation only in you. But you also likely harbor plenty of other very rare variants that, til we sequence your genome, will have never been spotted in anyone else.

5Note that this doesn’t mean that no one with harmful variants has kids — after all, everyone carries some such variants, and people are breeding just fine. Rather, because a given variant can be inherited independently of other variants in the same genome, and may wreak harm only in combination with another copy of itself (or some other variant), people simply tend to have more kids who inherit more copies of healthier alternative variants than kids who inherit more copies of harmful ones. Moreover, much of the natural selection in question likely happens beyond our view, before pregnancy begins, when unhealthy early embryos fail to implant and thrive in the womb.

Editor’s Note: We are pleased to share an article submitted by Jeffrey Rosenfeld, PhD. Jeffrey is a Bioinformatics Scientist in the Division of High Performance and Research Computing at the University of Medicine and Dentistry of New Jersey (UMDNJ) and a Research Associate in the Division of Invertebrate Zoology at the American Museum of Natural History.

May 15, 2012 : Jeffey Rosenfeld : In the past few years, the prices of sequencing have plummeted and now for a few thousand dollars, the complete sequence of an individual can be obtained. Even so, many scientists have opted to sequence just the exome (coding regions) of an individual and to ignore the rest of the genome. This focus on the exome has some justification, but I think that it is shortsighted and despite the higher cost, the sequencing of a complete genome is more valuable even if that means sequencing fewer samples.

The supporters of exome sequencing generally make the following points to justify their position:

A. The sequencing of an exome is much cheaper than the sequencing of a genome. It must be substantially cheaper to sequence 1% of the genome than the whole genome.B. We don’t understand how to interpret non-coding variants and therefore we should limit our sequencing to genes that are well annotated.C. Variants that are associated with a genetic disease are more likely to be found in a coding region since they directly alter the structure of a protein.

I am not going to deny that there is some validity to these points, but I don’t think that they outweigh the shortcomings of exome sequencing and the benefits of whole genome sequencing that I will outline below. I understand that this is a contentious issue, and I welcome your comments whether you agree or disagree with my position.

1. Cost

The first reason the people generally look to exome sequencing is that of cost. Intuitively, the sequencing of 1% of the genome (the exome) should be cheaper than sequencing the entire genome. While this is true, the price differential is nowhere near 1:100 and is closer to 2:1 or 3:1 depending upon how the costs of the sequencing is calculated. Currently, a whole genome costs ~$4,000 and an exome costs ~$1,500. Why are these prices so close to each other? The answer is that the actual reagent cost of running the sequencer is not the only factor in the cost of a genome or an exome. For either type of experiment, library prep is required along with the costs associated with setting up a sequencing run of any size. For an exome, there is the additional cost of purchasing the selection kit which allow one to extract the coding sequences from raw DNA either using a microarray or in solution. This kit can cost several hundred dollars, and is therefore a substantial portion of the cost of exome sequencing.

Because of the lack of strong cost differential, the economic argument of favoring exome sequencing is not very strong. For the same amount of funding, a researcher would need to choose between say, 30 exomes and 10 genomes. While 30 samples are obviously better than 10, this is not a great differential. It is much less than the 1:100 differential that one would naively think of concerning the price of genome and exome sequences. An additional factor affecting the cost of exome sequencing is the time required to perform the hybridization. For the Nimblegen protocol, 72 hours of time are required for hybridization and 24 hours are required for the Agilent approach. These times add a delay into the time taken from sample to sequence which may be problematic for clinical applications. As an example, the Ion Torrent machine is being pitched as a tool for rapid sequencing that will produce results in a single day. When an exome is targeted using Agilent or NImblegen, this will grow to at least 2 or 4 days of time.

2. Exome coverage

The definition of an exome is somewhat elusive. It can refer to:a) All of coding exons of the genomeb) A + microRNA genesc) A + 5’ UTR and 3’ UTR regionsd) Unannotated transcripts that have been discovered in RNA-seq experiments or from the ENCODE projecte) All "functional" portions of the genome

These five definitions will include very different portions of the genome and some of them such as E are difficult to define in and of themselves. It has been shown in multiple studies that there pervasive transcription along substantial portions of the genome. Should all of these regions be considered part of the exome? In general these are not included in the exome kits since their inclusion would push the size of the exome much closer to that of a genome and any potential savings from the lesser amount of sequencing will decrease. Instead, the exome is generally limited to coding genes with some level of annotation along with microRNAs and to some extent UTRs.

Each of the different vendors that produce exome kits have taken different approaches to defining the exome. A recent paper http://www.nature.com/nbt/journal/v29/n10/full/nbt.1975.html compared the exome selection offering from the three main players in the field Agilent, Nimblegen and Illumina.

This figure gives a great comparison of the different technologies. Firstly, the approaches to selecting the exome sequence differ. Nimblegen uses overlapping DNA baits, Agilent uses RNA baits which are distinct but contiguous and Illumina uses distinct DNA baits that are not contiguous and contain breaks of un-targeted sequence. Because of this, Nimblegen contains many times the number of probes as the other two technologies. The rest of the figure shows Venn diagrams illustrating the overlap between the targeted regions. For two different defintions of human genes, RefSeq and Ensembl, there is substantial agreement between the technologies as indicated by the 28.5 and 28.4Mb of sequence that they all cover. The biggest discrepancy is with regard to UTR regions where Illumina has 28 Mb that are missing from the other two platforms.

A different technique to assess coverage is to look at the amount of the exome target from a particular kit that is covered at a sufficient threshold to make a confident call of a variant. For many scientists, a threshold of 20x coverage is required to trust a variant derived from an exome sequence. Any loci with lesser amounts of coverage are ignored. Since the general sequencing coverage for an exome is 80x, in theory, it should be no problem to achieve 20x coverage of the entire targeted region. In practice, this is not the case for three reasons. Firstly, exome sequencing, as with all sequencing, produces reads in a statistical distribution and not evenly along the genome. Randomly, some regions are going to have their DNA sequenced more often and thus have a higher number of reads. This idea forms the basis of the famous Lander-Waterman statistics that are used for designing sequencing projects. The second reason for variation in coverage is that some of the baits used for selecting the exomic DNA will have a higher affinity than other baits, mainly due to GC content.. Those probes with higher affinity for their targets will produce greater amounts of sequenced DNA. The final concern is due to the repetitive nature of the genome. The selection probes need to target a unique location in the genome to ensure that they are truly obtaining the DNA that they intend to select. If the targeted region is repeated in the genome, then sequence from all of matching regions will be equally selected. Many human genes share domains with other proteins, and any shared sequences cannot be targetted. This is an equivalent problem to the unique mapping of sequencing reads which is a big concern in the use of short sequence reads. Any reads that map to more than one location of the genome cannot be uniquely placed and are generally discarded.

These concerns are illustrated in this figure from Agilent regarding their SureSelect sequencing:

This is an old figure, but I think that while the numbers might have changed a bit, the overall message remains. The read depth is extremely variable and you do not achieve anything close to 100% coverage of the exome. While accurate data is available for 80% of the exome (depth > 20x) this also means that 20% of the exome is missed. In odds terms, this means that for a disease study where an exomic variant correlates with the disease, there is a 1:5 chance of not having the variant included in the data. A researcher could conclude that there is no coding variant associated with their disorder when in actuality, it was just that it fell into the 20% that was missed. An error level of 20% is not trivial and cannot be lightly dismissed.

3. Whole Genomes

When a whole genome is sequenced, many of the issues regarding exome sequencing are not relevant. There is no need to buy a hybridization kit or to wait for the kit to hybridize. While there are sequencing biases (as there are in any sequencing experiment), there are not the additional biases introduced from the exome selection. Overall, there is probably the standard 5% error in sequencing giving a confidence level of 95%.

But, the biggest gain from a whole-genome sequencing is that the entire genome (excluding some unclonable regions) is obtained. If one wants to focus on the exome because it is easier to understand and interpret, they can easily filter out the non-coding portions of the genome to obtain an in silico exome. This is an easy action to perform and if a positive result is not found in the exome, then you already have the rest of the genome sequenced to begin looking for an intronic variant related to splicing, or a non-coding promoter or enhancer variant. In a traditional exome experiment, this is not possible. If no variant is found in the exome, then there is no result and one needs to go back and sequence the whole genome again from scratch.

To give a picture of the fraction of disease associated variants that are coding or non-coding, I looked at the UCSC collection of GWAS studies. The current list contains 5454 unique SNPs loci that were identified as part of a GWAS study. Of these SNPs, 3047 (56%) of them are not within coding genes. Thus, more than half of the identified important genomic variants are not in coding regions and would not be covered by exomes. (Some of these SNPs may be in UTRs or non-coding RNAs which are targeted by some of the platforms)

I see this as a betting situation. Would you rather spend $1,500 and have a 44% chance of getting the answer of spend $4,000 and have a 95% chance of getting the answer? I think that the $4,000 genome is much more reasonable. Just because we don’t understand non-coding sequence does not mean that we can or should ignore it. As scientists, we have an obligation to try our best to investigate human disease and not to only focus on things that are easy to understand.

As a final point, there has been some recent talk concerning variants that are only found from exome sequencing and not genome sequencing. These results are not a fair comparison of apples to apples. The exomes are generally sequenced at 80x coverage, and the genomes are sequenced at 30x coverage. For the specific variants under discussion, 80x sequencing coverage is required to identify them from any technique. This 80x coverage could have been of just the exome, or the entire genome. If the whole genome were sequenced to 80x for a true comparison, then I am confident that there would not have been an advantage for the exome over the genome.