This article has a correction. Please see:

Pinpointing Genetic Selection

The human genome contains hundreds of regions with evidence of recent positive natural selection, yet, for all but a handful of cases, the underlying advantageous mutation remains unknown. Current methods to detect the signal of selection often results in the identification of a broad genomic region containing many candidate regions that vary among individuals. By combining existing statistical methods, Grossman et al. (p. 883, published online 7 January) developed a method, termed Composite of Multiple Signals, which can increase the ability to pinpoint the specific variant under selection. Several candidate regions under selection in human populations were identified.

Abstract

The human genome contains hundreds of regions whose patterns of genetic variation indicate recent positive natural selection, yet for most the underlying gene and the advantageous mutation remain unknown. We developed a method, composite of multiple signals (CMS), that combines tests for multiple signals of selection and increases resolution by up to 100-fold. By applying CMS to candidate regions from the International Haplotype Map, we localized population-specific selective signals to 55 kilobases (median), identifying known and novel causal variants. CMS can not just identify individual loci but implicates precise variants selected by evolution.

Numerous methods have been developed to exploit signatures left by positive natural selection to identify genomic regions in the human genome harboring recent local adaptations, presumably to such pressures as infectious disease, changes in diet, and new environments (1, 2). Hundreds of such regions have been identified, but they are typically large (hundreds of kilobases to megabases) and contain many genes and thousands of polymorphisms. In only a handful has there been much progress in identifying the causal mutations and extracting these biological insights about their function. More powerful methods are needed to pinpoint the exact mutations driving evolution, especially as increasingly powerful sequencing technologies make it possible to sequence the genomes of humans and many other species.

Initial surveys of selective events have relied on three patterns of variation caused by a new beneficial mutation rising quickly in prevalence in a population: (i)Long haplotypes: An allele under positive selection increases in frequency so rapidly that long-range associations with neighboring polymorphisms—the “long-range haplotype”—are not disrupted by recombination. (ii) High-frequency derived alleles: A new (nonancestral, or derived) allele rises to a frequency higher than expected under genetic drift, carrying neighboring derived alleles with it. (iii) Highly differentiated alleles: Positive selection in one geographic region causes larger frequency differences between populations than for neutrally evolving alleles. In humans, these three signals are detectable back to between 30,000 to 80,000 years ago (2).

If each signature provides distinct information about selective sweeps, combining the signals should have greater power for localizing the source of selection than any single test. As inputs to a composite statistic we chose two established metrics for haplotype length (iHS and XP-EHH) (3, 4) and one for population differentiation (FST) (5). We also developed and incorporated two additional tests. ΔDAF tests for derived alleles that are at high frequency relative to other populations; it is more sensitive for distinguishing selected alleles than the simple derived allele frequency (DAF, fig. S1). ΔiHH measures the absolute rather than the relative length of haplotypes and is particularly sensitive for identifying lower-frequency selected alleles.

To characterize each test’s ability to localize signals of recent local adaptation spatially and to distinguish causal variants from nearby neutral markers, we simulated neutrally evolving regions and regions containing a positively selected allele by standard coalescent approaches (6). We tested a range of demographic models, including a standard neutral model; a calibrated model of European, East Asian, and West African populations; and several more extreme models. Regions under selection were modeled as containing a single, centrally located selected variant that appeared within the last 5000 to 30,000 years, was subject to a specified intensity of selection, and rose to present-day frequencies ranging from 20 to 100% (table S1).

For each model set we generated 1500 replicates, each consisting of 1 Mb of simulated sequence data (~10,000 polymorphisms) for 120 chromosomes from each population. In addition, we generated a data set that matched the frequency distribution and density of Phase II of the International Haplotype Map Project (HapMapII) (7).

Under all scenarios, each of the five statistics had distinguishable distributions for causal and for neutral variants (including neutral variants in selected regions). The FST and XP-EHH signals peaked more narrowly around the causal variant, making them useful for spatial localization, but poorly distinguished the precise causal variant (Fig. 1 and fig. S2). In contrast, iHS, ΔiHH, and ΔDAF contributed little to spatial resolution, but better distinguished causal variants. The five tests were nearly uncorrelated in neutral regions, and only weakly correlated for neutral variants within selected regions (fig. S3). In the latter case, correlation was appreciable only immediately around the causal variant.

As each of the five tests had power to distinguish selected from nonselected variants and were only weakly correlated for neutral variants, we combined them in a composite likelihood statistic, termed the composite of multiple signals (CMS). For each test i, we estimated from simulation the probability P of a score si if selected and if unselected. Assuming a uniform prior probability of selection π, the CMS score is the approximate posterior probability that the variant is selected:

(1)

We calculate the CMS score and significance (on the basis of the genome-wide distribution of scores) for every variant. To localize a signal, the distribution of CMS scores across the entire region is used to estimate a posterior probability curve for the position of the causal variant and determine 90% credible intervals [supporting online material (SOM)].

In simulations, CMS showed power both to localize the selection signal spatially and distinguish the causal variant (Fig. 1, K and L). Whereas single tests provided weak localization (~1 Mb), CMS localized the signal to an average 89 kb (for full sequence data) and contained the causal variant in 90% of cases. With sparser genotype data (corresponding to HapMapII), CMS localized to 104 kb, even when the causal variant was absent from the data set. CMS also showed greater specificity for the causal variant. At score thresholds giving 90% power to detect the true causal variant, the individual tests identified ~500 to 1500 candidate causal variants per region, whereas CMS narrowed the signal to ~100 (table S2). The causal variant was among the top 20 variants in half of cases and was the highest-scoring variant in a quarter of cases, with high power given that we included sweeps to frequencies as low as 20%. The power for sweeps where the causal allele is at high frequency (>50%) is even greater, with the causal variant among the top 10 variants in half of cases (table S3).

The CMS results were robust under all demographic scenarios tested (constant population size and bottlenecks of varying strengths), even though the test was optimized for a single model (6) (fig. S4). The most extreme bottleneck scenarios did increase the number of high-scoring variants in neutral regions, but the false-positive rate remained below 0.01% in all cases (SOM) (8). These false-positives occurred as isolated points, easily distinguishable from the clearly defined peaks found in selected regions (table S4).

We then applied CMS to empirical human data for 185 candidate regions identified as under recent positive selection in HapMapII data. The data set includes 3.1 million variants genotyped in three populations: Northern Europeans, West Africans (Yoruba from Nigeria), and East Asians (Chinese and Japanese) (7).

As positive controls, we examined several well-characterized regions under positive selection (Figs. 2 and 3). In three regions (containing, respectively, SLC24A5, LCT, and EDAR), a putative causative variant has been previously identified and genotyped in HapMapII (2, 3). In each region, the variant was within the top 10 CMS scores, out of 1000 to 1500 variants in the region. We also examined four regions (350 kb to 1 MB) containing pigmentation-related genes (MATP, TYRP1, OCA2 and HERC2, and KITLG) that are suggested targets of recent selection, but where no candidate variant has been proposed (1, 9, 10). CMS improved the spatial resolution by 3- to 80-fold, and in each case, the narrowed region contains a single pigmentation-related gene. In each case, a strong CMS signal is found at a variant known to be associated in the human population with eye color or skin pigmentation (9).

Localizing selection at MATP. Scores of six individual tests (A to F) and CMS (G) for a region containing MATP. A nonsynonymous SNP [rs16891982, F374L (Phe374→Leu), red dotted line] associated with pigmentation is believed to be the mutation under selection.

We then examined the remaining 178 candidate HapMapII regions, containing ~1500 genes, for which the selected locus and variant are unknown. After application of CMS, 64 regions contained a single gene, 35 contained multiple genes, and 79 contained no genes at all. CMS suggested numerous intriguing coding and regulatory functional candidates (figs. S5 and S6 and table S5).

Many regions include striking amino acid changes (table S6). For example, CMS localized a region on chromosome 10 with evidence for selection in East Asians to the protocadherin gene PCDH15. The third-highest-ranking variant is an acidic-to-nonpolar (Asp435→Ala) mutation altering a highly conserved residue predicted to lie in the Ca2+-binding site at the interface of cadherin repeats in the protein’s extracellular domain (SOM) (Fig. 4A and figs. S7 and S8) (11). PCDH15 plays a role in development of inner-ear hair cells and maintaining retinal photoreceptors (12, 13). Another signal in East Asians localized to the leptin receptor, LEPR. The highest-scoring variant is a Lys109→Arg change in LEPR associated with blood pressure, glucose response, and body mass index (14).

Many signals, however, are localized to intergenic regions or regulatory changes in gene regions, suggesting that selected variants may lie in regulatory elements (which also harbor many variants affected in complex diseases). For example, a signal of selection in West Africans localized to a single gene, PAWR. Several high-scoring variants show strong association with PAWR expression uniquely in West Africans, and with no other genes in the region (fig. S9). Another signal in West Africans localized to a 22-kb region containing two genes, USF1 and ARHGAP30. Several high-scoring single-nucleotide polymorphisms (SNPs) in USF1 show strong association with USF1 expression uniquely in West Africans. One variant lies within an experimentally determined transcription factor binding site (15).

Beyond identifying individual gene and polymorphism targets, by reducing the number of genes within each region from about eight to about one, the method reveals instances of multiple genes in the same pathway showing signs of selection. For example, in addition to PCDH15, four genes linked to cochlear function or Usher syndrome (1, 16) show evidence for selection in East Asia. We used the PANTHER Gene Ontology database to test for this enrichment on all CMS-localized regions from HapMapII (SOM) (17). We found statistically significant enrichment for several categories (table S7): sensory perception genes (including PCDH15) are enriched for selection in East Asia, immune-related genes in West Africa, and genes related to homeostasis and metabolism in all three populations.

CMS can narrow candidate regions for recent local adaptation in humans and identify small numbers of candidate polymorphisms. For this kind of event, we may already be close to the limit on localization from population signals alone. According to our simulations, each causal variant has on average 20 perfect proxies (fig. S10), all essentially indistinguishable from the causal variant. Identifying specific causal variants may thus require functional characterization of small sets of candidates.

The CMS method can be adapted to a wider range of selective regimes, including detecting (i) older selection occurring any time after the divergence of human populations (50,000 to 75,000 years) (FST and ΔDAF would here become the predominant CMS signals) and (ii) selection on standing variation or very old selection (by incorporating additional population-based tests). It can be applied to nonhuman species with population samples of dense genotype or sequence data; as these increasingly become available, the details of the appropriate CMS test would depend on the demographic history and population structure of the species.

Within human genetics, the research community is currently generating data sets of human variation in many populations, through initiatives such as the 1000 Genomes Project (18). With continuing improvements in sequencing technology, it will be possible to examine nearly every variant in the genome in many individuals and populations. With such data emerging for humans and other species, it may be possible to observe much of evolution’s most recent handiwork and identify many of the functional adaptations that work to shape species.

.,
PCDH15 is expressed in the neurosensory epithelium of the eye and ear and mutant alleles are responsible for both USH1F and DFNB23.Hum. Mol. Genet.12,
3215 (2003).doi:10.1093/hmg/ddg358pmid:14570705