The Pith: The rarer the genetic variant, the more likely that variant is to be specific to a distinct population. Including information about the distribution of these genetic variants missed in current techniques can increase greatly the precision of statistical inferences.

A few days ago I mentioned in passing an article in The New York Times which reported on results from a paper which illustrated how starkly differentiated populations might be on rare alleles. By this, I mean that some genetic variants are present at very low frequencies. It turns out that many of these are low frequency variants private to particular populations, in contrast to higher frequency variants which span varied human populations. The explanation presented by one of the authors of the referenced paper was that higher frequency variants presumably date back to a time before human populations had become geographically diversified across the world. Shared variants at higher frequencies then are shadows of shared past history. In contrast, rare variants are a reflection of more recent events, narrowing the circle of those effected.

I have now read the paper in question, Demographic history and rare allele sharing among human populations. From what I can gather The New York Times article was really an elaboration upon some of the issues which were highlighted in the discussion. The “meat” of the paper in terms of methods and results is actually rather technical and deeply embedded in the language of mathematical statistics. For example:

After further consideration, I have decided that I shall spare you my own clumsy exposition in plain English as to the details of site frequency spectrum calculations. There are after all enough points of interest in the paper at which I can throw my verbal talents more effectively. First, the abstract:

High-throughput sequencing technology enables population-level surveys of human genomic variation. Here, we examine the joint allele frequency distributions across continental human populations and present an approach for combining complementary aspects of whole-genome, low-coverage data and targeted high-coverage data. We apply this approach to data generated by the pilot phase of the Thousand Genomes Project, including whole-genome 2–4× coverage data for 179 samples from HapMap European, Asian, and African panels as well as high-coverage target sequencing of the exons of 800 genes from 697 individuals in seven populations. We use the site frequency spectra obtained from these data to infer demographic parameters for an Out-of-Africa model for populations of African, European, and Asian descent and to predict, by a jackknife-based approach, the amount of genetic diversity that will be discovered as sample sizes are increased. We predict that the number of discovered nonsynonymous coding variants will reach 100,000 in each population after ∼1,000 sequenced chromosomes per population, whereas ∼2,500 chromosomes will be needed for the same number of synonymous variants. Beyond this point, the number of segregating sites in the European and Asian panel populations is expected to overcome that of the African panel because of faster recent population growth. Overall, we find that the majority of human genomic variable sites are rare and exhibit little sharing among diverged populations. Our results emphasize that replication of disease association for specific rare genetic variants across diverged populations must overcome both reduced statistical power because of rarity and higher population divergence.

The first figure illustrates one of the clearest, though most unsurprising, findings in the paper: the lack of overlap of rare alleles across two distinct populations. In this panel they’re comparing Chinese from Beijing (CHB) and Yoruba from Nigeria (YRI). They focused on rare alleles as defined by variants present in 15 or less out of 100 in their sample. The union of the two populations yielded ~3,300 alleles, but only ~200 of these intersected across the populations. In other words, well over ~90% of these alleles were private across these populations. This immediately clues you in on the peculiarity of these genetic variants, as you should know that at any random polymorphic gene there will be far less between population variance than this. The zone of intersection on the histogram is notably “flat,” while it is “cool” on the heat map. In contrast, the “edges” of the graphs, which are defined by alleles exclusive to each respective population, exhibit a wide distribution in counts (observe that there are many more very rare alleles than moderately rare alleles).

An important aspect of this paper is that they synthesized results from “high coverage” and “low coverage” sequencing efforts. The former is highly accurate in terms of the actual state of the genome, but often very targeted and narrow (in this paper they focused on a set of exomes, regions of the genome which actually encode proteins). In contrast, the latter covers wider swaths of the genome (the full genome in this case), but may not be as accurate. One can immediately imagine the problem when one is fixing upon low frequency variants: errors in the data as well as limitations in the sample size may result in inflation or omission of alleles. When it comes to high frequency alleles error is of less account because a mistake here and there will not change the qualitative assessment. In any case, by comparing the rare variants found in deeply covered regions of the genome with the presumed underestimates which are yielded in the more thinly covered projects the authors generated parameters which allowed them to project the proportion of private alleles as a function of frequency across populations.

To the left you see a set of series on a line chart generated by their method. On the x-axis you have the minor allele frequency (the rare variant on a locus). For the y-axis you have the ratio of the allele shared across the two populations. What is notable to me is how even two closely related populations tend to differ a great deal at very low frequencies! The Chinese data needs a little explanation I think. The Chinese in Denver are almost certainly skewed toward a South Chinese sample. Historically American Chinese were disproportionately Cantonese, while the newer immigration waves tend to be Fuijianese, whether directly from Fujian, or ethnic Fujianese from Taiwan (where they are the majority). Though likely cosmopolitan, the Beijing Chinese are obviously going to sample more from the north of the country. This difference shows up on PCA plots, where the Beijing and Denver Chinese samples exhibit the distances from populations to their north and south that you’d expect if the latter was derived from southern Chinese populations.

The fact that very rare alleles are not shared across even closely related populations should not be too surprising when you think about it of course (everything is so obvious in hindsight!). For example, much of southern China was populated by Han ~1,500 years ago, during the first interregnum between Chinese dynasties (a period of disunity of particularly great length, lasting three centuries). During the Song ~1000 A.D. the Yangtze region and provinces to the south definitively surpassed the Yellow River basin in demographic heft. Without taking into account migration, this gives about ~1,000 years on average, or 40 generations (assuming 25 years) for new genetic variants to arise which might be private to the Han of the north and south of China respectively. The same process writ small certainly applies within putative populations, and there are going to be family private alleles. That is, genetic markers of recent origin distinctive to family lineages (more broadly construed we already know this with tandem repeats, but here we’re focusing on single nucleotide polymorphisms, changes on one base pair).

Finally, let’s hit their main demographic finding, which received a lot of coverage in The New York Times. They estimated that the last common ancestor of Asians and Africans in their data set was on the order of ~50,000 years before the present. This is absolutely unsurprising. As they note this is entirely consonant with the archeological record. What is fascinating is the confidence: 45 to 69 thousand years over the 95% interval. This immediately seemed congenially narrow to me, and they confirm this by reviewing earlier estimates with noisier data sets which had much larger intervals. Here is the rough demographic model which they inferred from their data:

CEU refers to Utah whites, CHB to Chinese in Beijing, JPT to Japanese, and YRI are Yoruba. You can see that their estimate of the last common ancestor of Europeans and Asians is ~23 thousand years B.P., in line with other calculations, though a touch on the low side for my own taste. The N refers to population sizes, while the nature of the tree illustrates the non-African bottleneck followed by demographic expansion vs. the relatively constant African population size over the past ~100,000 years.

The real good stuff comes in the discussion. Here’s something that jumped out at me: “It should be emphasized that, because we use a single Western African population as our African panel, the divergence described by our model might have occurred earlier than the actual Out-of-Africa event.” Within the discussion it is noted repeatedly that their results are sensitive to a host of conditions. For example, they were limited in the populations they used, and their demographic-historical model was obviously not as complex as it could have been. These results then perhaps should be seen as an important guide, and a pointer to things to come, rather than a substantive marker to lay down and take to heart. Given the populations they had and the data available the method outlined here seems very useful, but there are still limitations imposed by the population set and the nature of the data (which will be obviated in the near future).

Finally, there’s the practical payoff in medical genetics. The New York Times accurately reflected the inference one could make from this: if lots of diseases which are common are due to a host of rare variants, then it is even more important to gain a better understanding of fine-grained human variation. Risk alleles found in one population via genome-wide association in one population have been found to often predict well in other populations, but if these more common variants are part of our common ancestral heritage, then they should be relatively robust to genetic background. Such may not be the case with many rare variants, which reflect the peculiarities of more recent history. If medicine is to be truly personal in the genomic sense, then it seems likely that it will be more context dependent than had been hoped 10 years ago.

So theoretically, if the rarer SNPs were identified by population, we could pare down genetic datasets to exclude more common SNPs. That should cut down the time required for ADMIXTURE runs, or some new software based on this finding. Is there any database of allele frequency by population for SNPs?

http://blogs.discovermagazine.com/gnxp Razib Khan

#1, you want to look for “ancestrally informative markers.” these are at higher frequency than these alleles, but they are good at differentiating between populations (high Fst). you’ll find lists in papers online. google scholar it.

well, if u like alfred remember the HDGP browser and the HapMap browser.

gcochran

The estimate of the mutation rate they used was 2.36 x 10-8 per generation. Recent direct measurements from family triads show about 1.1 x 10-8 per generation. I believe that fresher number, if correct, would materially change their estimates of population split dates.

http://johnhawks.net/weblog John Hawks

Passages in the discussion may be there to satisfy reviewers with different results.

http://washparkprophet.blogspot.com ohwilleke

“They estimated that the last common ancestor of Asians and Africans in their data set was on the order of ~50,000 years before the present. This is absolutely unsurprising. As they note this is entirely consonant with the archeological record. What is fascinating is the confidence: 45 to 69 thousand years over the 95% interval. This immediately seemed congenially narrow to me, and they confirm this by reviewing earlier estimates with noisier data sets which had much larger intervals.”

It seems as if the inclusion of exclusion of outlier populations, like Tibetans, Ainu, aboriginal Australians, Papuans, Khoisan, Ket, etc. could easily skew this a great deal, particularly if the people mostly descended from later waves greatly outnumber the people from early waves.

I could easily see excluding critical parts of the global population from the sample, even if those populations account for only 1%-2% of the total, could have many tens of thousands of years of impact on the age range.

http://wiringthebrain.blogspot.com kjmtchl

The major conclusion from this paper prompts a big “Duh!” in response. Of course rarer variants are population-specific. They are rare because they are recent, and/or being selected against. Most will be neutral but of those that have any phenotypic effect, vastly more will be deleterious than advantageous. Rare variants are therefore far more likely to contribute to disease. This means that genome-wide association studies combining diverse populations and looking only at globally common variants are not sampling the variants of highest interest (explaining why they have not found much, at least for disorders with a strong effect on fitness, where predisposing variants are highly unlikely to ever become common). In contrast, some genome-wide association studies have been performed by segregating samples into population-specific bins and then summing hits anywhere in a gene across populations to give a gene-wide association value (allowing for different variants in different populations to be playing a part). As any rare variant will necessarily arise on the background of some common haplotype, analysing the common SNPs can detect a signal due to a linked rare variant. It may thus be possible to go back to GWAS datasets with this approach and re-mine them for interesting hits. On the other hand, it is becoming easier just to do whole-exome or whole-genome sequencing, which has the advantage of revealing all the variation and possibly pinpointing the real pathogenic mutations.

Discover's Newsletter

Sign up to get the latest science news delivered weekly right to your inbox!

Gene Expression

This blog is about evolution, genetics, genomics and their interstices. Please beware that comments are aggressively moderated. Uncivil or churlish comments will likely get you banned immediately, so make any contribution count!

About Razib Khan

I have degrees in biology and biochemistry, a passion for genetics, history, and philosophy, and shrimp is my favorite food. In relation to nationality I'm a American Northwesterner, in politics I'm a reactionary, and as for religion I have none (I'm an atheist). If you want to know more, see the links at http://www.razib.com