Abstract

The study of recent natural selection in human populations has important applications to human history and medicine. Positive natural selection drives the increase in beneficial alleles and plays a role in explaining diversity across human populations. By discovering traits subject to positive selection, we can better understand the population level response to environmental pressures including infectious disease. Our study examines unusual population differentiation between three large data sets to detect natural selection. The populations examined, African Americans, Nigerians, and Gambians, are genetically close to one another (F(ST) < 0.01 for all pairs), allowing us to detect selection even with moderate changes in allele frequency. We also develop a tree-based method to pinpoint the population in which selection occurred, incorporating information across populations. Our genome-wide significant results corroborate loci previously reported to be under selection in Africans including HBB and CD36. At the HLA locus on chromosome 6, results suggest the existence of multiple, independent targets of population-specific selective pressure. In addition, we report a genome-wide significant (p = 1.36 × 10(-11)) signal of selection in the prostate stem cell antigen (PSCA) gene. The most significantly differentiated marker in our analysis, rs2920283, is highly differentiated in both Africa and East Asia and has prior genome-wide significant associations to bladder and gastric cancers.

PCA Analysis of Population StructureThis analysis of population structure in our main data sets shows Europeans and Nigerians forming separate, tight clusters. African Americans form a cline between the Nigerian and European clusters; this cline is indicative of varying degrees of European ancestry. The Gambian samples are separated from the Nigerians on PC2, form separate but overlapping clusters, and show evidence of European-like admixture within the Fula subpopulation.

Tree Estimates From Sample Data(A) This tree was estimated using unrelated individuals from the YRI, CEU, and CHB populations sampled as part of the International HapMap Project Phase III. The branch lengths show strong concordance with estimated pairwise values for Fst.(B) This tree was estimated using our main data sets of African-American, Nigerian, and Gambian samples after accounting for significant European-like admixture in the African-American and Gambian data sets. We note that the second tree is scaled approximately by a factor of 100 with respect to the first. The values quoted are based on genome-wide average estimates of Fst.

Q-Q Plots of Population Differentiation in Africans(A) We compare the actual and expected distribution of selection statistics. The red line represents expectation under neutrality. It is clear that a fat-tail of highly differentiated markers exists, consistent with multiple selective events.(B) We repeated the analysis after removing the 5 Mb regions containing each of our most significant SNPs and still observe a fat-tail of highly differentiated markers.

Genome-Wide Population Differentiation in AfricansAll values are reported after correcting for variation in Fst according to quantity of background selection. We note genome-specific peaks in the HLA locus on chromosome 6 and CD36 on chromosome 7. HLA has a major role in immunity with multiple prior disease associations, and CD36 is known for its role in malaria resistance. We also observe a highly suggestive peak at PSCA (chromosome 8) tightly linked to a protein-altering variant with prior associations to gastric and bladder cancers. The highly suggestive signal at HBB is unsurprising given its role in malaria resistance. HLA, HBB, and CD36 have been previously reported targets of selection.

Distribution of Allele Frequencies at PSCAThe allele frequencies of the most differentiated SNP at PSCA are plotted in 52 distinct ethnic groups genotyped as part of the Human Genome Diversity Project. We note the high degree of differentiation in East Asia, Africa, and South America (insert, upper right). Although small samples sizes of these populations hinder analysis of selection, analysis of selection pressures in each of these populations might elucidate the cause of the large allele frequency differences at PSCA.