Abstract:
It has hitherto been difficult to obtain genome-wide data from the Near East. By targeting the inner ear region of the petrous bone for extraction [Pinhasi et al., PLoS One 2015] and using a genome-wide capture technology [Haak et al., Nature, 2015] we achieved unprecedented success in obtaining genome-wide data on more than 1.2 million single nucleotide polymorphism targets from 34 Neolithic individuals from Northwestern Anatolia (~6,300 years BCE), including 18 at greater than 1× coverage. Our analysis reveals a homogeneous population that is genetically a plausible source for the first farmers of Europe in the sense of (i) having a high frequency of Y-chromosome haplogroup G2a, and (ii) low Fst distances from early farmers of Germany (0.004 ± 0.0004) and Spain (0.014 ± 0.0009). Model-free principal components and model-based admixture analyses confirm a strong genetic relationship between Anatolian and European farmers. We model early European farmers as mixtures of Neolithic Anatolians and Mesolithic European hunter-gatherers, revealing very limited admixture with indigenous hunter-gatherers during the initial spread of Neolithic farmers into Europe. Our results therefore provide an overwhelming support to the migration of Near Eastern/Anatolian farmers into southeast and Central Europe around 7,000-6,500 BCE [Ammerman & Cavalli Sforza, 1984, Pinhasi et al., PLoS Biology, 2005]. Our results also show differences between early Anatolians and all present-day populations from the Near East, Anatolia, and Caucasus, showing that the early Anatolian farmers, just as their European relatives, were later demographically replaced to a substantial degree.

Ancient European haplotype enrichment in modern Eurasian populations.

Institutes
1) Graduate Program in Molecular Medicine, University of Maryland School of Medicine, Baltimore, MD; 2) Institute for Genome Sciences, Program in Personalized and Genomic Medicine, Department of Medicine, University of Maryland School of Medicine, Baltimore, MD.

Abstract:
The diversification of modern European populations is a fascinating puzzle that has recently advanced due to the sequencing of ancient European genomes. We analyzed 732 modern West Eurasian individuals using three ancient samples coming from the Lazardis et al. Human Origins Array dataset. Specifically, we determined ancient European haplotype enrichment by calculating pairwise differences (PWD) between each ancient European individual and modern Western Eurasian individuals in 50 SNP blocks. Modern Western Eurasians had the fewest PWD across all population groups with the farming Stuttgart individual and had the most PWD with the Loschbour and Motala12 hunter-gatherer individuals confirming Lazardis et al. observation that modern Europeans are more similarly related to ancient individuals coming from a farming community. We selected SNP blocks, for gene ontology enrichment analysis through the use of GORILLA, based on 1) the 10% of regions with greatest differences of PWD between groups, and 2) the 10% of those regions from the first criterion that most closely correlated with the geography of those groups. Most SNP blocks positively correlated to PC1 (latitude) and PC2 (longitude), therefore we focused on outliers that negatively correlated to biogeography. For SNP blocks that negatively correlated to PC1; “regulation of chondrocyte development”, “androsterone dehydrogenase activity”, and “antigen processing and presentation of endogenous peptide antigen” had the highest enrichment score in the comparison of the Stuttgart, Loschbour, and Motala12 individuals, respectively. Interestingly, the “alpha-beta T cell receptor complex” and “interleukin-17 receptor activity” (including CD3D,E,G and IL17RC,E) were enriched in the Loschbour and Motala12 comparisons of SNP blocks that were positively correlated to PC2. In addition, the Stuttgart individual had the lowest PWD disparity between all modern populations for the SNP blocks that contain the IL17R and CD3 genes, which potentially indicates selection acting on these immune system haplotypes from the Stuttgart individual consistent with the Stuttgart farmer and modern Europeans’ continual close interaction with animals and zoonotic disease exposure. In conclusion, our approach of calculating PWD in small SNP blocks supported prior conclusions made by Lazardis et al. and illuminated small genomic haplotypes that are of importance to the evolution of modern West Eurasian populations.

Clarifying the disputed role of FOXP2 in modern human origins.

Abstract:
Identified for its pivotal role in the development of spoken language, the FOXP2 gene is also known for its controversial role in human evolution. Early genetic work identified a selective sweep for two derived amino acid substitutions in FOXP2 during recent human evolution (within the past 200,000 years), supported in large part by detection of an extremely low Tajima’s D value at the gene. When the genomes of other ancient hominids were found to contain the same fixed genetic variants , however, the conflicting timelines between the signals of selection obtained from the molecular sequence of the gene as compared to divergence time estimates between humans and other ancient hominid species were irreconcilable. Selection for these two amino acids thus appears not to be human-specific, yet many papers continue to work from a hypothesis of positive selection of FOXP2 in humans. Here, we comprehensively re-analyze FOXP2 with next-generation genomic datasets comprising hundreds of individuals and thousands of SNPs. Specifically, we test for fine-scale molecular patterns in the gene and between various human populations in order to resolve estimates of selection. We are unable to replicate the original negative Dsignal in the expanded human genomic datasets, despite having many more variants, more diverse individuals, and greater statistical power. We can, however, mimic the negative D result when running calculations on a subset of the HGDP genomic dataset with a sample of human populations comparable to the original work; i.e. one-third Africans and two-thirds individuals who underwent the Out-of-Africa expansion. The D signal thus appears to have been due to the pooling of Africans and non-Africans together for analyses, which increases the number of segregating sites relative to pairwise genetic differences. Such a result seems to have been an unintended consequence of a small sampling strategy. We apply additional selective sweep statistics and haplotype analysis to this locus to evaluate evidence for selection over the past 200,000 years, finding indications of balancing selection in Africans but not non-Africans. FOXP2 does not appear to have undergone a recent selective sweep, as had been previously proposed.

Haplogroup C Phylogeny for Altaian Populations and its Implications for the Peopling of Siberia and the Americas.

Abstract:
Characterization of mitochondrial DNA at a genomic level is very important since it provides opportunities for more accurately estimating the timing and directionality of prehistoric human migrations from a maternal perspective. The Altai Mountains are located at the geographic center of the Eurasian landmass, and have been a hotspot of human activities since ancient times due to its geographic location and rich natural resources.Aiming to contribute to a better understanding of the prehistoric human expansions in Siberia and subsequent colonization of the Americas, we sequenced and characterized eighteen whole mtDNA genomes belonging to haplogroup C from Altaian populations. The sequenced Altaian mtDNAs represent all four subgroups of haplogroup C (C1, C4, C5, and C7), and two of them belong to C1a, the Asian sister branch of Native American C1. The Altaian whole mitochondrial sequences were analyzed together with 313 previously published haplogroup C sequences from different parts of the world.The analyses of whole mitochondrial genomes reveal that haplogroup C lineages in Siberia are distributed without any specific association with geography or language, and suggest northeastern Siberia as a place of origin for haplogroup C and its subbranches C1, C4, C5, and C7. The analyses also indicate that Native American haplogroup C types are distantly related with their Siberian sister branches. Given the distribution pattern of haplogroup C in Eurasia, the timing of expansions could be inferred from the age estimates of the lineages within haplogroup C. Age estimation of haplogroup C sequences in our data set via ρ statistics shows that haplogorup C has a TMRCA of 31.25 kyr (24.13-38.56), and its subbranches C1, C4, C5, and C7 have TMRCAs of 21.64 kyr (16.83-26.55), 24.88 kyr (16.65-33.41), 19.76 kyr (13.63-26.08), and 27.2 kyr (16.69-38.17), respectively.Still, it is almost impossible to pinpoint geographic origin of Native Americans and directionality of prehistoric migrations in Siberia with certainty. Based on the results of the current study, the Amur region in northeastern Siberia could be the geographic origin for ancestral Native Americans. In order to obtain clearer picture of human population movements in Siberia and the Americas from a maternal perspective, more mitochondrial genomes need to be sequenced, especially mitochondrial genomes belonging to the relatively diverse haplogroups C and D.

Genetic, Geographic and Cultural Reconstruction of an Ancient Endogamous Community.

Abstract:
The provenance of a rare R1a1 Y-haplogroup (Y-HG) subtype designated as 657A lies in proximity to an ancient migration route running through Afghanistan but is largely absent from other geographic locations. A clan of 657A Brahmin “founder” family lineages within the Goud Saraswat community (GSB) in a town in Western India was identified in which 15 of 16 males from nine families were R1a1 Y-HG, including 10 who were 657A. TMRCA calculations using pairwise comparisons to control cohorts suggested a probable migration history for this priestly subgroup. To support this genetic narrative we present archeological, toponymic, numismatic, linguistic, iconographic, architectural, sociological and literary data. Specifically, in this study we test two main hypotheses regarding these 657A families: (1) Using Y-HG centroid analysis, chi-square analysis of TMRCA distributions and archeological find-spots, and discriminant function analysis we show that the parental Z93 L342.2 sub-clade in which 657A occurs originated in West Asia and that 657A individuals migrated toward the southeast by a Bolan Pass route distinct from the traditionally presumed route of “Vedic” ingress into the Indian subcontinent; and (2) Priestly 657A lineages in Western India retain distinct family practices with respect to literacy, religious practice and migration not shared by other more orthodox Brahmins of canonical geographic origin within the same community, despite intermarriage. Long-term transmission of differentiated family practices within a single patrilineal endogamous community has rarely been documented.

Reconstructing genetic history of Siberian and Northeastern European populations.

Abstract:
Siberia and Western Russia are home to some of the least studied ethnic groups in the world, and their genetic history holds keys to understanding peopling of the world. We present whole-genome sequencing data from 28 individuals belonging to 14 distinct indigenous populations from that region. We used these datasets together with an additional 32 modern-day and 15 ancient human genomes to build and compare autosomal, Y-DNA and mtDNA trees and delineate genetic history. Our analyses uncover complex migratory processes that shaped the genetic landscapes in Asia and Europe. Admixture events between ancient Siberian groups resulted in distinct ancestries of nowadays Western and Eastern Siberians. Western Siberians share genetic affinity with modern Europeans. Both can trace their ancestry to the lineage of a 24,000-year-old Siberian Mal’ta boy. For Eastern Siberians, they have much weaker genetic affinity with Europeans and their ancestor separated from East Asians much later (approximately 10,000 years ago). Major migration wave from Eastern Siberians into Western Siberian groups occurred approximately 7,000 years ago, and it extended into Northeastern Europe. This is based on the admixtures we observed between Siberians and lineages represented by the 5,000-year-old hunter-gatherer Ire8 from Pitted Ware Culture excavated in Sweden, the 2,900-year-old Iron age Hungarian IR1 from the Mezocsat Culture, and modern-day northeastern Europeans. Our whole-genome data based on a broad sample of populations in Siberia and Western Russia provides new insights at a high-resolution into the genetic history of Eurasians.

Ages of mitochondrial DNA lineages coincides with the agriculture spread in Finland.

Abstract:
The current inhabitants of Finland in the Northeastern Europe are quite unique in terms of their genetic composition. Based on Y chromosomal and genome wide studies Finns differ from other European populations: especially the Y chromosomal diversity is reduced and distinctive. In contrast, Finnish mitochondrial DNA (mtDNA) haplogroup distribution is similar to other European populations. Mitochondrial genepool in modern Europeans is a mixture of Mesolithic hunter-gatherer associated haplogroups (U and V) and Neolithic associated farmer haplogroups (H, J, K and T). The frequency of hunter-associated haplogroup U in Finland is one of the highest in Western Eurasia. Also, it is more common in Eastern and Northern parts of the country while farmer haplogroups are more frequent in Southern and Western Finland.In this study we compiled a comprehensive data set of 833 modern Finnish complete mtDNA sequences from the public databases and utilized coalescent based Bayesian phylogenetic inference (BEAST v.1.8.1) to perform fine resolution phylogenetic analyses on the sequences. We also exploited previously published radiocarbon dated ancient complete mtDNA sequences from Western Eurasia in our analysis as calibration points to the phylogenetic trees, enhancing their accuracy.Our results demonstrate that among Finns, many typically “European” haplogroups, both hunter-gatherer and farmer associated, actually comprise lineages specific for Finns. Several of these lineages, despite being rather common in present Finnish population, are virtually absent from other populations. Oldest of these haplogroups date back over 7,000 years, though most appear to be around 3,000-5,000 years old. This period temporally coincidences with the arrival and especially the spreading of the agriculture and Corded Ware culture in Finland. Age estimates are also concurrent with the arrival of another culture, the textile ceramics, into Finland from Volga region (main period of textile ceramics lies between 1,700-1,000 BC). According to these results there is distinct evidence that arrival of these cultural entities also influenced Finnish mitochondrial DNA pool and this impact is still visible in modern day Finns.

An empirical recombination for demographic inference and IDB detection.

Abstract:
Genome-wide data facilitate the investigation of genomic relatedness between individuals within or across populations, providing an insight into demographic histories. Genomic regions of identity by decent (IBD) in individuals, co-inherited from common ancestors, can be detected and analyzed to reveal genetic relatedness for demographic inference. Many methods have recently been developed to detect IBD regions, aiming at detecting identical regions that are statistically unlikely to occur without common ancestors.Some employ coalescent or probabilistic models to identify such IBD regions with significantly low frequencies of occurrence; the others use non-coalescent and non-probabilistic models to detect IBD regions with long lengths, which serve as proxies for low frequencies. However, due to high computational cost of coalescent or probabilistic models, the first ones are usually not fit to large datasets, and because of no short IBD regions detected, the last ones cannot provide comprehensive information for demographic inference.We propose an empirical approach that is able to infer demographic histories and to detect IBD regions simultaneously. This approach comprises of an empirical model of recombination and an IBD detection algorithm. The empirical model builds coalescent trees with recombination events based on genomic similarities of individuals, and the detection algorithm incorporates the information of coalescent trees with recombination events to identify IBD regions. These two procedures can be executed iteratively till no new IBD regions found and no new changes in coalescent trees. In addition, the two procedures can be in parallel in each iteration to improve computational efficiency.We applied our method in simulated data and two real datasets: the 1,000 genomes and the HLA alleles in Taiwan populations. First in simulation analysis, our method is able to infer demographic histories and to detect short IBD regions with high accuracy while maintaining high computational efficiency. Second in the 1,000 genome dataset, our approach not only reveals recent demographic events based on long detected IBD regions, but also ancient histories from short IBD regions. Finally in the HLA alleles in Taiwan populations, we demonstrate the pure utility of the empirical recombination model for recent demographic inference. Therefore, our proposed method is capable of detecting IBD regions efficiently and making demographic inference comprehensively.

A new locus of genetic resistance to severe malaria is associated with a locus of ancient balancing selection.

Abstract:
We describe a genome-wide association study of severe malaria susceptibility using DNA from over 10,000 individuals from across sub-Saharan Africa with replication in a further 15,000. We identify a new locus of association near the glycophorin gene cluster on chromosome 4, which encodes red cell surface proteins previously shown to interact with malaria parasite surface receptors during invasion, and determines the MNS blood group. A single haplotype at this locus, common in parts of East Africa, confers 33% protection against severe malaria, and is linked to variation displaying signatures of ancient balancing selection. We describe attempts to elucidate the possible causal mutations, including imputation into an African-enriched reference panel and the refinement and imputation of large structural variants in the region. This association brings the number of loci confirmed by GWAS to be associated with severe malaria to four, all of which are involved in red blood cell function or morphology, and at least three of which display unambiguous signals of balancing selection. These analyses bring important new insights into malaria biology and may have implications for genome-wide association studies of infectious diseases more generally.

The evolutionary impact of Denisovan ancestry in Australo-Melanesians.

Abstract:
Analyses of genome sequences from archaic and modern humans have documented major admixture events between the ancestors of Neanderthals and non-Africans as well as between the Denisovans (a sister-group of the Neanderthals) and populations in island south-east Asia. Understanding the impact of these ancient admixture events on evolution and phenotypes is a central goal in human population genomics. While a number of recent studies have made progress towards understanding the structure and impact of Neanderthal admixture [Sankaramanan et al. Nature 2014; Vernot and Akey Science 2014], the Denisovan admixture event remains poorly understood. To this end, we adapted a statistical method previously developed for inferring Neanderthal ancestry to infer Neanderthal and Denisovan local ancestries in Melanesian populations. We applied this method to a dataset of high-coverage whole-genome sequences from 11 Melanesian individuals (2 Aboriginal Australians, 1 Bougainville Islander, 8 Papua New Guineans) that were sequenced as part of the Simons Genome Diversity Project to infer maps of Denisovan and Neanderthal ancestry in these populations. Power to confidently infer Denisovan ancestry is estimated to be about half that of Neanderthal ancestry – a consequence of the greater divergence of the sequenced Denisovan genome from the ancestral population. Nevertheless, our statistical method identifies around 38,000 Neanderthal-derived alleles and around 25,000 Denisovan-derived alleles. Using the confidently inferred ancestries across multiple individuals, we can reconstruct about 150 Mb of the genome of the introgressing Denisovan. We observe that the proportion of both Denisovan and Neanderthal local ancestry is reduced in regions of the genome with strong background selection. This observation is consistent with a model in which Neanderthal and Denisovan alleles are subject to strong purifying selection in the admixed Melanesian populations analogous to the previous observation of strong purifying selection against Neanderthal alleles in non-Africans. In addition, we document a number of regions with elevated proportions of archaic ancestry (including a previously reported example at the STAT2 locus) which represent putative candidates for adaptive introgression.

Abstract:
The 1000 Genomes Project data harbor information about a great variety of relationships which can be recovered using identity by descent (IBD) analysis. Short IBD segments convey information about events far back in time because the shorter IBD segments are, the older they are assumed to be. At the same time longer IBD segments can be used to detect more recent relationships as they occur in families. The identification of short IBD segments becomes possible through next generation sequencing (NGS), which offers high variant density and reports variants of all frequencies. However, only recently HapFABIA has been proposed as the first method for detecting very short IBD segments in NGS data. HapFABIA utilizes rare variants to identify IBD segments with a low false discovery rate. We applied HapFABIA to the 1000 Genomes Phase 3 whole genome sequencing data to identify IBD segments which are shared within and between populations as well as with the genomes of Neandertal and Denisova. Using the proportion of IBD segments an individual shares with any other individual in the data set, we were able to discover first degree relatives that we consequently removed from further analyses. Not only are most IBD segments found in Africans, but also each African individual has about ten times more IBD segments than any East Asian, South Asian, or European individual. Furthermore, the number of IBD segments of an individual correlates with his degree of African ancestry as reported by other methods. IBD segments can be used to recover the population of origin of an individual and find individuals with wrong population labels. By comparing the rare variants that tag an IBD segment with the genome of Neandertal and Denisova, we were able to find IBD segments shared with these ancient genomes. We extracted two types of very old IBD segments that are shared with Neandertals/Denisovans: (1) longer segments primarily found in East Asians, South Asians, and Europeans that indicate introgression events outside of Africa; (2) shorter segments mainly shared by Africans that may indicate events involving ancestors of humans and other ancient hominins within Africa. Our results from the autosomes are further supported by an analysis of chromosome X, on which segments that are shared by Africans and match the Neandertal and/or Denisova genome were even more prominent.

Abstract:
Human populations throughout the world have had to adapt to novel pathogens and environments; this adaptive evolution has shaped present-day genomes. Here, we introduce novel frameworks for detecting adaptive sweeps from de novo mutations that are easily extensible to detecting adaptive evolution from standing variation. While current methods for detecting adaptive mutations rely on single statistics that probe one of three major signatures of a sweep — long-range haplotype blocks, changes in the site frequency spectrum, and population differentiation — recently, composite methods have shown increased power by combining multiple statistics. However, these methods falter when a subset of their component statistics is undefined, as often happens with long-range haplotype statistics, and they yield scores that are fundamentally difficult to interpret.Our approach classifies local targets of selective sweeps within multiple populations in a way that combines multiple statistics, has an easy probabilistic interpretation, and deals naturally with undefined statistics. We introduce two classifiers that infer the probability that a new locus has undergone a sweep, based on distributions learned from demographic simulations. The first is a Naïve Bayes classifier, which assumes independence among component statistics, while the second uses a machine-learning tool called an Averaged One-Dependence Estimator (AODE) to allow for pairwise dependencies. In simulated data, we show that the Naïve Bayes classifier vastly outperforms state-of-the-art methods in detection and localization of sweep signals, in some cases reducing the number of false positive predictions by seven-fold. We show that this classifier performs particularly well when identifying completed sweeps and fast sweeps, which have great biological significance. For a subset of sweep parameters, the AODE further improves classification performance. In data from the 1000 Genomes Project, we show that both classifiers can detect known sweep targets, including the DARC locus in West Africans, the EDAR locus in East Asians, and the SLC24A5 locus in Europeans. We also show that the dependency structure implemented in the AODE is necessary for detection of some signals, including the CD36 locus in West Africans, which harbors malaria resistance alleles. Our methods produce fewer false positives and negatives compared to existing approaches, thus identifying promising targets for experimental validation.

Abstract:
Lim et al (Plos Genetics 2014) showed recently that loss-of-function (LoF) and missense variants in 0.5-5% frequency are enriched in Finnish population compared to Non-Finnish Europeans, providing an opportunity to study downstream effects of these variants in Finns. However this change in the frequencies may not be confined only to the coding region. To this extent we have studied the enrichment of variants in the Finnish population across the whole genome. To study the bottleneck effects across the whole genome, we analyzed single nucleotide variants (SNVs) from 1463 low coverage whole genome sequences both from Finland (~4.6x) and UK (6x). These samples were processed together by the Haplotype Reference Consortium to harmonize the variant calls and minimize the batch effects. As observed previously, we see a 1.34x enrichment of the LoF variants (p-value LoF = 0.056) in the 2-5% minor allele frequency (MAF) range and 1.1x enrichment in the missense variants (p-value missense = 2.95e-05). Further, we studied the enrichment of variants across the whole genome. We found significant enrichment in Finns in the MAF range from 0.5-5%, with maximum enrichment in the MAF range of 2-5% (p-value = 6.4e-323). We also see enrichment across different functional sub-categories in Finns with the highest enrichment observed for conserved regions (p-value conserved_regions=9.36e-24, p-value TFBS=6.02e-46, p-value promoter=1.67e-11, p-value enhancers=0.001), although not as considerable as for the LoF variants. Furthermore, in the regulatory regions, rare and low frequency variants (MAF <= 2%) are enriched beyond expected bottleneck effects. When limiting the analysis to the 23,441 variants that were enriched at least 100x in Finns, genes in pathways related to neuron development, signal transductions and cation transport channels were observed to be significantly over represented after correcting for multiple testing. These results show that the enrichment of low frequency variants in founder populations is not limited to coding loss-of-function and missense variants, but are also observed in conserved regions and regulatory elements. This finding provides opportunities to study downstream health effects of these variants in founder populations with multiple bottleneck effects such as Finns outside of the coding regions.

Reconstructing the Genetic History of Indigenous Caribbean Populations.

Abstract:
In collaboration with the Garifuna/Kalinago of St. Vincent, the First People’s Community of Arima, Trinidad, and Taíno descendant communities in Puerto Rico, we are conducting an anthropological genetic study of the prehistoric and historic settlement of the Caribbean. Using genetic data generated with the GenoChip, we are evaluating hypotheses concerning the original settlement of the Greater and Lesser Antilles, as well as the expansion of Carib and Awakan-speaking populations into this region over the past few thousand years. Our initial results suggest that the Greater Antilles were colonized by indigenous populations from South America and possibly Mesoamerica, whereas the Lesser Antilles were settled by only South American groups. In addition, while sharing some indigenous mtDNA (maternal) and Y-chromosome (paternal) lineages in common, populations from the Greater and Lesser Antilles otherwise appear to be largely genetically distinct from each other. Autosomal SNP data from these indigenous Caribbean communities further expand our understanding of the genetic contributions from African, European and South Asian populations since European contact. Overall, this study demonstrates the region’s first peoples’ ongoing legacy in shaping the genetic diversity of contemporary Caribbean populations.

Abstract:
Because the number of X chromosomes differs for men and women, comparisons between sex-linked and autosomal genetic loci reveal sex-biased patterns of human demography. Using 44 high-coverage whole genomes from a diverse global set of 11 human populations we quantified the strength of selective constraint on different chromosomes, found evidence of sex-biased colonization, and determined whether recent migrations are matrilocal or patrilocal. Relative amounts of genic and intergenic diversity were similar across all studied populations regardless of subsistence pattern or geography. The strength of selective constraint on genes was greater for X-linked loci compared to autosomal loci – a pattern that is consistent with selection against deleterious recessive alleles. The ratio of X chromosome to autosome diversity (Q) was greater than the null expectation of 0.75 for African populations and less than 0.75 for non-African populations, with lower values of Q for populations located farther from Africa. This pattern is consistent with a male-biased serial founder effect model, and computer simulations suggest a plausible out-of-Africa bottleneck size of 320-340 males and 60-70 females. Using PSMC, we found evidence of large historic population sizes for West African Pygmies, but not Hadza or Sandawe populations. Genetic distances revealed female-biased gene flow between Hadza and Sandawe hunter-gatherers, between Maasai pastoralists and African farmers, and between Chinese and Japanese populations. We found evidence of male-biased gene flow between African farmers and hunter-gatherers, and between different African farmer populations. This calls into question the idea that patrilocality is coupled with the emergence of agriculture.

The recent production of population-scale genomic data offers an unprecedented opportunity to understand how natural selection has shaped human phenotypic variation within populations. Sardinia has a rich history of genetic studies driven by its relative isolation and high incidence of malaria, which was endemic there until eradication efforts in the 1940s. To identify signatures of recent positive selection in Sardinia, we use 23 million single nucleotide polymorphisms from low-coverage whole genomes of 3,514 Sardinians along with data from the 1000 Genomes project. Using haplotype (iHS, nSL), cross-population (Fst, PBS, XP-EHH), and site-frequency-spectra (CLR) based statistics we find many genetic variants show evidence of selection. To assess the significance of these selection statistics, we use an empirical null distribution generated from randomly chosen variants matched by minor allele frequency, local recombination rate, and background score. We also evaluate these statistics relative to a null, neutral model using a demographic history inferred from deeply sequenced Sardinian individuals. We show that selection statistics computed for outlier variants cannot be explained by neutral forces alone. By intersecting genome-wide-association study data for hundreds of traits in Sardinia with publicly available functional genomic databases we find that autoimmunity-related genes are significantly enriched for these putatively adaptive variants. Taken together, these results illustrate the importance of characterizing both the demographic history of and phenotypic variation within a population, and especially the utility of whole-genome-sequence data, when proposing and interpreting genetic signatures of positive selection.

Adaptation in global human populations has been hard, soft and polygenic.

Institutes
1) Department of Bioengineering and Therapeutic Sciences, University of California at San Francisco, San Francisco, CA 94158; 2) Institute for Human Genetics, University of California at San Francisco, San Francisco, CA 94158; 3) Institute for Quantitative Biosciences (QB3), University of California at San Francisco, San Francisco, CA 94158.

Abstract:
There is ample debate about the strength and mode of natural selection that has occurred in recent human evolution. This is particularly so for classical hard sweeps, during which an adaptive allele quickly drags a single haplotype to high frequency. An alternative model of adaptation involves soft sweeps, whereby multiple haplotypes are brought to high frequency (i.e. when a previously segregating neutral or slightly deleterious allele becomes adaptive in a new environment). Yet another alternative model includes polygenic selection, whereby complex phenotypes driven by multiple loci across the genome are selected. Here we develop new statistics designed to identify both hard and soft sweeps, by tracking the decay of homozygosity of the k-most frequent haplotypes away from a core locus. We evaluate our statistics with rigorous simulations under multiple realistic models of human demography and find that they have high power. We then integrate signals of selection across the genome to identify characteristic signals of polygenic selection. We apply our approaches to a large dataset of 1,728 unrelated individuals spanning 20 worldwide human populations from the 1000 Genomes Project. We find that a large number of novel regions consistent with soft sweeps, particularly in African populations, and instances of polygenic selection driving the regulatory architecture of several genes. We then use an Approximate Bayesian Computation framework to infer selection parameters for these regions.

The relative effective population size of chromosome X and the autosomes along distinct branches of the human population tree.

Abstract:
In recent years, many studies have focused on the effective population size of chromosome X relative to the autosomes. This comparison can be useful to reveal past demographic processes, differences in the histories of males and females, and the action of natural selection. We have recently shown how the ratio of nucleotide diversity between the two (X-to-Autosome ratio; X/A), when compared between pairs of populations (relative X/A), can be used to uncover sex-biased processes in human history. While this strategy serves to alleviate the response of genetic diversity to the influence of events in a time range that largely predates the split of the studied populations, a different and more natural approach to capture recent changes occurring after populations split can be formulated based on the differentiation of allele frequencies between populations, as commonly summarized by the F ST statistic. Here, we consider population differentiation in humans, and extend beyond simple pairwise comparisons, using allele frequency differences across several populations to learn about the ratio of X-to-autosomal effective population size along distinct branches in the tree of human populations. We then test these for differences from the expectation of equal female-to-male breeding ratios, as well as differences between different branches. Using coalescent simulations of a variety of previously published human demographic models, we show that our approach is able to capture the ratio of interest and is more accurate than estimates based only on pairwise F ST across all pairs of populations. We then turn to the latest data from the 1000 Genomes Project, controlling for the effect of uncertainty associated with low coverage sequencing, as well as the influence of linked selection (background selection or hitchhiking), all of which differentially affect the X chromosome and the autosomes. Estimating the X-to-autosomal effective population size ratio for branches leading to different 1000 Genomes populations, as well as for internal branches in the population tree, points to a higher female effective population size in African-specific population history, but not in non-Africans. More interestingly, we localize previously-debated observations to a significant increase in male effective population size on the branch leading to all non-African populations, suggesting male-biased processes associated to the Out-of-Africa event.

Estimation of growth rates for populations and haplogroups using full Y chromosome sequences.

Abstract:
Evolutionary processes affecting a population influence gene genealogies across the genome. Coalescent theory provides the mathematical framework to connect realized genealogies to the underlying evolutionary processes. However, in most cases, information about the genealogies is obtained only indirectly through the observation of genetic variation. Therefore, in general, very limited information about any individual locus is available. As the longest non-recombining portion of the human genome, the Y chromosome accumulates mutations relatively quickly. When large amounts of sequence are used, the Y chromosome provides an unparalleled ability to resolve the structure and coalescence times of its genealogy. Because patterns of variation in the Y chromosome are only influenced by processes affecting men, they can be used to study both demographic and social phenomena. The 1000 Genomes Project includes whole Y-chromosome data from more than 1000 men and has an extensive representation of most lineages that have experienced recent massive expansions in size. Though the dynamics of population growth have likely changed over time, we are more interested in the growth rates at the times of these rapid expansions than on an average effect. To study this, we have developed a new method that takes advantage of the temporal resolution provided by Y-chromosome data and of historical data, while accounting for the uncertainties associated with the coalescent and mutational processes. We estimate the growth rates for several branches of the Y-chromosome tree, including those in Europe, sub-Saharan Africa and South Asia. We estimate that several lineages within the European R1b, sub-Saharan African E1b, and South Asian R1a haplogroups experienced growth rates of at least 20-60% per generation at the onset of their massive expansions, some 3-5 thousand years ago. These high growth rates are comparable to those experienced by human populations during the 20th century. However, we find that most observed genealogies are unlikely to be the result of whole population expansion or of natural selection.

Abstract:
Understanding how natural selection had shaped the existing genetic variation within humans is a major goal in population genetics. With the growing understanding that many human diseases and complex traits have a polygenic genetic architecture, it has been hypothesized that adaptation in recent human history might be largely polygenic as well. The increased frequency of many alleles associated with genetic basis for tall stature in northern Europe, has been the major supporting example for the polygenic adaptation model. However, beyond this outstanding example, the nature and extent of polygenic adaptation in recent human history is still poorly understood. Current methods for testing for polygenic adaptation, based on allele frequency differences between populations, do not account for the linkage disequilibrium between loci. In turn, there is no general framework available for testing for adaptation over one set of functionally related loci, while controlling for possible causal effects (on allele frequency differences) by other genetically-linked genomic features. For example, one would like to test for adaptation among known GWAS hits, controlling for the selection for height; or to test for selection within regulatory regions, controlling for possible selection on non-synonymous sites; or to control for admixture effects on allele frequency differences, etc. To address this need, we have developed POLARIS, a novel and general method for POLygenic Adaptation Regression analysIS. Our method is based on a multivariate normal model for the frequency differences between populations, which is structured to explicitly represent linkage disequilibrium, drift and annotation-dependent polygenic adaptation. The method allows to test, and control, for annotation-dependent effects on both the mean and variance of allele frequency difference, giving it a great flexibility to mix directed and undirected hypotheses. As we demonstrate with an initial analysis of publically available datasets, POLARIS opens the road for a richer and more extensive characterization of the nature and extent of polygenic adaptation in recent human history.

Rare variants are a large source of heritability for gene expression patterns.

Abstract:
Understanding the genetic architecture of complex traits is a central challenge in human genetics. There currently exists a large disparity between heritability estimates from family-based studies and large-scale genome-wide association studies (GWAS), which has been sensationalized as the “missing heritability problem”. Among the possible explanations for this disparity are rare variants of large effect that are not tagged by genotyping platforms. However, recent population genetic models suggest that the conditions under which rare variants are expected to substantially contribute to heritability may be fairly limited. To better understand the heritability of complex phenotypes, we investigated the role of cisalleles in gene expression levels across European and African individuals using RNA and whole genome sequencing data from the GEUVADIS and 1000 Genomes Projects. In particular, we investigate whether rare variants are likely to be a source of missing heritability in expression across genes. Using variance-component methods, we partitioned the heritability of expression levels explained by cis variants for each gene in the genome across several frequency bins from rare (≤1%) to common (>10%). We performed extensive simulations to validate our heritability estimation procedure. We find that when pooling all variants in cis (within 500kb of a gene), heritability estimates are on average h c2=17.6% (with 4.7% of genes having h c2>50%). Using variance-component methods, we find that in cis, rare variants (MAF ≤ 1%) contribute significantly more heritability than common variants (MAF > 10%) across genes (p MWU=1.1×10-6). In particular, 35.6% of h c2 across genes is contributed by rare variants, while common variants contribute 22.3%. This observation suggests that rare variants play a substantial role in the heritability of gene expression patterns, which is inconsistent with neutral evolutionary forces operating on the cisregulatory architecture of most genes. We discuss our results in the light of recent population genetic models of quantitative traits, and highlight the importance of understanding how natural selection can shape the genetic architecture of gene expression in humans. We conclude by discussing implications for studying a variety of complex phenotypes in humans.

Population differentiation analysis of 54,734 European Americans reveals independent evolution of ADH1Bgene in Europe and East Asia.

Abstract:
Population differentiation is a widely used approach to detect the action of natural selection. Existing methods search for unusual differentiation in allele frequencies across discrete populations, e.g. using FST. Loci that are unusually differentiated with respect to the genome-wide FST or with respect to a null distribution of F­STare reported as signals of selection. These approaches are particularly powerful for closely related populations with large sample sizes.However, population genetic data often is not naturally partitioned into discrete populations. We developed a test for selection that uses SNP loadings from principal components analysis (PCA). For a given PC reflecting geographic ancestry, under the null hypothesis of no selection, the square of the SNP loadings, rescaled by a scaling factor derived from the eigenvalue of the PC, follows a chi-square (1 d.o.f.) distribution. This statistic is able to infer selection with genome-wide significance, a key consideration in genome scans for selection. We confirmed via simulations that this statistic has correct null calibration under a wide range of demographies and is well-powered to detect selection at large sample sizes.We applied the method to a cohort of 54,734 European Americans genotyped on genome-wide arrays. PCs were inferred using our FastPCA software (running time: 57 minutes). The top 4 PCs corresponded to clines of Irish, Eastern European, Northern European, Southeast European and Ashkenazi Jewish ancestry, validated via PCA projection of samples of known ancestry. We detected genome-wide significant signals of selection at 4 known selected loci (LCT, HLA, OCA2 and IRF4) and 3 novel loci: ADH1B, IGFBP3 and IGH. 2 of the 3 novel loci could not be detected using discrete-population tests (or other existing tests). The ADH1B gene is associated with alcoholism (via the same coding SNP rs1229984 producing a signal in our selection scan) and has been shown to be under recent selection in East Asians (via a haplotype-based test for recent selection); we show here that it is a rare example of independent evolution on two continents. The IGFBP3 gene and IGH locus have been implicated in breast cancer and multiple sclerosis, respectively. Our results show that application of our PC-based selection statistic to large data sets can infer novel, genome-wide significant signals of selection at loci linked to disease traits.

Genetic origins and admixed ancestry characterization of Japanese people.

Abstract:
A modern human population found at a certain geographic location is often descended from multiple ethnic groups owning to the complex migration history of human expansion. In Japan, although it has been studied extensively over the past decades, the genetic origins of Japanese people remain controversial. Current genetic evidence supports a dual model which suggested that the Japanese people are constituted mainly by an early settlement of human populations during the Upper Paleolithic period (i.e., Jomon people) followed by an admixture event with the people migrated from the Korean peninsula around 2300 year ago (i.e., Yayoi people). However, the genetic origin(s) of the native Jomons remains unclear. Tracing the genomic signatures of admixture history can not only reveal the unknown human migration events but also provide critical information that can facilitate the genetic profiling of disease susceptibility, which is critical for the success of personalized medicine. Here, we analyzed a combined dataset of the whole genome SNP genotyping data from 2,277 individuals sampled globally across >100 populations for a total of 19,290 SNPs (after intersecting the two datasets). We performed principle component analysis to project individuals onto a series of orthogonal axes to reveal the genetic structure among diverse ethnic groups. After separating the genetic components contributed from the populations representing the Yayoi, we identified several candidate populations that share common non-Yayoi ancestry with the modern Japanese people. Our results suggest that the genetic origins of Jomons may consist of multiple migration events from both Southeast and Northeast Asia. Surprisingly, we also identified an additional migration wave from the Hmong population. We assigned local ancestry (LA) on the phased chromosomes of the mainland and Okinawa Japanese by performing RFmix (which used the identified candidate ancestral populations to infer the LA tracts in admixed chromosomes by finding the most likely sequence of ancestries through maximum a posterior estimation). Because an ancient population admixture would allow more recombination events to break LA tracks into shorter segments than a recent admixture event, our results of the LA track-length distributions differ significantly between the Yayoi, Hmong, and Jomon ancestries (in descending order), suggesting that the Hmong migration may have occurred before the Yayoi migration.

Abstract:
Saudi Arabia is the largest Gulf Cooperation Council (GCC) country. Its population consists of different tribes that originated in the northern, western, eastern, middle and south regions of Saudi Arabia, respectively. Due to political and cultural reasons, there has historically been very limited admixture between different tribes. People from the different Saudi tribes then migrated from Saudi Arabia, contributing to foundation of the populations now inhabiting other Gulf countries. Few population genetics research projects have been conducted on this highly consanguineous population that has been shown to have one of the highest prevalence in the world of recessive disorders and common metabolic diseases, especially diabetes. It is therefore important to identify the genetic substructures of the Saudi population, both to help in tracing the migratory genetic flows that contributed to other Gulf populations, and to permit designing of efficient genetic studies aimed at the identification of risk factors underlying common and rare diseases in the GCC countries. We carried out the largest population genetic study in Saudi Arabia to date, by genotyping 2,150 Saudi nationals sampled from different regions of Saudi Arabia using Axiom GWH-96 Array (Affymetrix) arrays. Model-based and model-free clustering were applied to these data, including in our analyses data on eight populations (encompassing Europe, America, Oceana, East Asia, Central South Asia, Middle East, Africa and Qatari populations) from the Human Genetic Diversity Project (HGDP) data set. We identified clear clustering of the Saudi samples into different subgroups, with some tribes showing similarity with both Central East Asian (Kalash Pakistan, Balochi Pakistan, Sindhi Pakistan, Makrani Pakistan and Brahui Pakistan subpopulations) European (Orkney Islands Europe, Russian Europe and Russian Caucasus subpopulations) and Qatari populations, while other tribes appear to show specificity of background.These data strongly support the presence of genetic stratification within the Saudi population, and suggest the presence of subgroups that are characterized by a unique genetic background different from other Arabian populations. Our findings constitute a valuable resource for the investigation of both general and population-specific genetic risk variants associated with different disorders in this population.

Abstract:Purpose.- Denmark has strong historical bonds not only with Norway and Sweden, but also with Western and Eastern Europe through a series of invasions, conquests and alliances. In addition, within Denmark, industrialization in the second half of the 19thcentury led to considerable migration from the countryside to the cities. In this work we explore the extent to which such distant and more recent historical events left their mark on the genetic structure of the current Danish population.Methods.- We ran an extensive genetic analysis on the Where Are You From? data set of ~600 students from 36 high schools across Denmark. Each student provided a saliva sample for DNA analysis and completed an online questionnaire about family origin, education level and basic biometrical data. All participants gave their informed consent and the Ethical Committee of the University of Aarhus approved the study. Genotyping was outsourced to 23andMe and more than 500,000 SNPs were available for analysis. After merging our data with data from POPRES, we ran PCA and ADMIXTURE to detect genetic structure. For more fine-grain effects, we identified each individual’s closest genetic relatives through IBD tract sharing and calculated the geographic distance between the individual’s place of birth and the weighted average geographic coordinates of their closest relatives. Finally, we explored population structure within Denmark as the result of recent admixture with adjacent populations by use of an IBD-based local ancestry method (i.e. “chromosome painting”).Results.- Although Denmark forms a distinguishable cluster from neighboring countries in the PCA plots (compatible with isolation-by-distance), no stong structure was observed within the country. Similarly, ADMIXTURE revealed high levels of homogeneity in the Danish samples compared to other North European countries. However, we did observe significant correlation between PC1 (south-north orientation) and average grandparental geographic coordinates rotated clockwise at ~30°. Also, the IBD-based geographic correlation analysis revealed that Danes tend to live near their closest genetic relatives at a median distance of 100 Km – significantly closer than the random expectation. Finally, chromosome painting revealed strong genetic influence from neighboring Nordic (Sweden and Norway) and Germanic (Germany and Holland) countries and negligible influence from Finland, France and Portugal.

Assessing the benefits of priors that encourage sparsity for estimating ancestral admixture from genome-wide data.

Abstract:
Several recent papers have demonstrated the benefits of using sparse matrix factorization techniques—sparse factor analysis and non-negative matrix factorization—to infer population structure from genetic polymorphism data. The primary strength of sparse matrix factorization is its flexibility; it can capture a wide range of population structure scenarios, and can do so in a way that often has a natural interpretation. For example, sparse matrix factorization is able to recapitulate a mixture of continuous and discrete population structure, whereas other methods, such as PCA and STRUCTURE, cannot do this. However, we have found that this flexibility can come at a cost: in realistic demographic settings, it incorrectly predicts individual admixture proportions. We hypothesize that this is because sparse matrix factorization does not completely specify an admixture model. Motivated by this, we propose a model-based approach, building on ADMIXTURE, that encourages sparsity in the admixture proportions (or “loadings”). We encourage sparse estimates by introducing an exact L0-norm penalty term in the cost function that penalizes non-zero admixture proportions, then we iteratively solve for the model parameters using a hybrid EM algorithm. This penalty can also be interpreted as a prior on the number of ancestral populations contributing to an individual’s genome. We explore the behaviour of penalized and unpenalized admixture estimates in data from the Human Genome Diversity Project. Although the idea of encouraging sparse admixture estimates has been suggested previously, to our knowledge the features of this approach have not been empirically assessed in real genetic data from human populations.

Abstract:
We studied sex-biased population histories from Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) high-coverage whole genomes (~30x depth). CAAPA comprises 673 individuals who are African-American, African, Afro-Caribbean (Barbados, Jamaica), or Latin-American (Colombia, Brazil, Puerto Rico, Honduras, Dominican Republic). X chromosomes show a decrease of European ancestry as estimated with ADMIXTURE, consistent with a history of European male-driven colonization. CAAPA Latin Americans have female-biased Native American ancestry (5.36% mean excess X-chromosomal), male-biased European ancestry (1.36% mean excess autosomal), and female-biased African ancestry (6.72% mean excess X-chromosomal). Some CAAPA African-descent populations have never been studied genetically. The Garifuna from Honduras have very little autosomal European ancestry (2.2%) but high Native American ancestry (16.6%). The Afro-Brazilians from Condé have a high proportion of African ancestry (50.5%). The Cartagena Colombians (from one of two slave ports in South & Central America) have more African ancestry than the TGP Colombians (CLM): on average CAAPA individuals have 31.1% autosomal and 29.7% X-chromosomal African ancestry and TGP CLM have 7.7% and 6.8%, respectively. Y and MT haplotype analysis support the above sex-biased admixture findings: Afro-Caribbeans have African mitochondria, Latin Americans have a mix of African and Native American mitochondria, yet both groups have mostly European Y chromosomes. We identify three Native American Y haplotypes in the Honduran Garifuna only, highlighting their unique history. Unexpectedly we identified a new subgroup of MT-E1a1a that suggests a connection with the Malagasy slave trade. We apply a novel method to infer sex-biased demography during specific time epochs to autosomal and X-chromosomal site frequency spectra. CAAPA Latin Americans show evidence for a female bias over a longer time scale, male-biased bottlenecks Out-of-Africa and into the Americas, and male-biased admixture events. We analyze ancestry tracts with the program TRACTS to estimate timings and magnitudes of sex-biased admixture events. Overall, our findings recapitulate the complex history of the Americas and highlight key differences between populations based on their local admixture histories. As this is the first time some of these unique populations have been studied, this represents a valuable population and medical genetic resource.

The Demographic Patterns Revealed by New World African Diaspora Genome.

Abstract:
One of the great interests in human genetics research is to understand human population structure, demographic patterns, and evolutionary history. New World populations, such as African Americans and Latino Americans with African ancestry, provide good examples for studying large migration and admixture events in recent human history. Three questions in particular are: 1) where did different sources of admixture come from, 2) when did admixture happen, and 3) what is the difference among subpopulations. To answer these questions we will make use of the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA), which contains high coverage (~30× depth) whole genome sequence data of 952 individuals of African ancestry. These individuals were selected from populations in North and South America, the Caribbean and continental Africa to form a large spectrum of New World African Disapora. We merged all CAAPA data with 1963 individuals from the publically available Human Origins genotype data. After filtering rare variation (MAF < 5%), there are 389,397 SNPs in autosomal chromosomes left for the analysis of population admixture in this work. The filtered SNPs are first phased using Shapeit with the reference panel from 1000 Genomes Project. PCA-based local ancestry estimation on the CAAPA dataset is performed with PCAdmix, using the continental reference samples from Human Origin dataset. Ancestry-specific PCA (ASPCA) Analysis of PCAmask, in which ancestry specific regions from Europeans, Africans, and Native Americans are masked in PCA with sub-continental reference panels, reveals that the European ancestry in these New World African Diaspora populations comes from two main parts of Europe: Northwest (English/French) and Southwest (Spanish). We use Malder and Tracts to identify the timing of admixture in these populations. The African and Native-American ancestries admix with each other about 13-16 generations ago and later European ancestry entered into these populations 6-8 generations ago. We show that the origin and time of European introgression are different between New World African ancestry populations. Our results clearly reflect the ancestry patterns of African admixed populations in America and provide a general pipeline to study the evolutionary history of other New World populations.

Percent African admixture is associated with telomere length in a healthy adult population.

Abstract:
Africans and African Americans (AA) have been shown to have longer leukocyte telomere length (LTL) than persons with European ancestry (EA), but the extent to which this is a function of a finer scale of admixture (i.e. percent African ancestry) remains unknown. We examined whether the percentage of African admixture is associated with telomere length in 283 healthy subjects (average age = 42.3, age range = 21-80 years, n female = 161; n AA = 127). Telomere length was calculated from whole genome sequence (WGS) data (>30x coverage on the Illumina HiSeq platform) for 7 contiguous repeats of the telomere motif (TTAGGG or CCCTAA) using the approach of Ding et al (2014). Admixture was estimated using 50,000 LD-pruned SNPs in STRUCTURE using three ancestral groups to calculate the % African and % European ancestry. Standard linear models were used to evaluate the association between telomere length and admixture estimates. In the AAs, average African ancestry was 80% (range: 45-98%) and average European ancestry was 20% (range: 2-55%). In the EAs average European ancestry was 99% (range: 72-99%) and average African ancestry was 1% (range: 0.01-19%). As previously observed, an overall comparison between the two groups reveals longer telomere length in AAs than EAs (84364 vs 78560 kb, p=0.0008). On a continuous scale, % of African ancestry was significantly correlated with telomere length (r=0.22, p=0.0002). Furthermore, in AAs, there is a strong association between % African ancestry and telomere length (each percent increase in African admixture was associated with an increase in telomere length of 6284 kb, p=0.0036). Given minimal variation in the EAs, there is no observed association with % African ancestry and telomere length in this group (p=0.1552). We confirm the prior observations that telomere length is different between AAs and EAs, and show here that within African Americans the % of African admixture is a significant predictor of telomere length. Future studies of racial differences in telomere length may need to account for differences in the proportion of African ancestry among subjects, particularly within African Americans.

Founder effects and bottlenecks can damage fitness by letting deleterious alleles drift to high frequencies. This almost certainly imposed a burden on Neanderthals and Denisovans, archaic hominid populations whose genetic diversity was less than a quarter of the level seen in humans today. A more controversial question is whether the out-of-Africa bottleneck created differences in genetic load between modern human populations. Some previous studies concluded that this bottleneck saddled non-Africans with potentially damaging genetic variants that could affect disease incidence across the globe today (e.g. Lohmueller, et al. 2009; Fu, et al. 2014), while other studies have concluded that there is little difference in genetic load between Africans and non-Africans (e.g. Simons, et al. 2014; Do, et al. 2015). Although previous studies have devoted considerable attention to simulating the accumulation of deleterious mutations during the out-of-Africa bottleneck, none to our knowledge have incorporated the fitness effects of introgression from Neanderthals into non-Africans. We present simulations showing that archaic introgression may have had a greater fitness effect than the out-of-Africa bottleneck itself, saddling non-Africans with weakly deleterious alleles that accumulated as nearly neutral variants in Neanderthals. Assuming that the exome experiences deleterious mutations with additive fitness effects drawn from a previously inferred gamma distribution, we predict that the fitness of the average Neanderthal was about 50% lower than the fitness of the average human, implying the existence of strong selection against early Neanderthal-human hybrids. This is a direct consequence of mutation accumulation during a period of low Neanderthal population size that is thought to have lasted ten times longer than the out-of-Africa bottleneck (Pruefer, et al. 2014). Although our model predicts some transmission of deleterious Neanderthal variation to present-day non-Africans, it also predicts that many Neanderthal alleles have been purged away, depleting conserved genomic regions of Neanderthal ancestry as observed empirically by Sankararaman, et al. (2014). Our results imply that the deficit of Neanderthal DNA from functional genomic regions can be explained without the action of epistatic reproductive incompatibilities between human and Neanderthal alleles.

Y-chromosome diversity suggests southern origin and Paleolithic backwave migration of Austro-Asiatic speakers from eastern Asia to the Indian subcontinent.

Abstract:
Analyses of an Asian-specific Y-chromosome lineage (O2a-M95)—the dominant paternal lineage (60.65% on average) in Austro-Asiatic (AA) speaking populations, who are found on both sides of the Bay of Bengal—led to two competing hypothesis of this group’s geographic origin and migratory routes. One hypothesis posits the origin of the AA speakers in India and an eastward dispersal to Southeast Asia, while the other places an origin in Southeast Asia with westward dispersal to India. Here, we collected samples of AA-speaking populations from mainland Southeast Asia and southern China and then analyzed both the Y-chromosome and mtDNA diversities. Combining our samples with previous data, we generated a comprehensive picture of the O2a-M95 lineage in Asia, including both AA and Daic speaking populations. We demonstrated that the O2a-M95 lineage originated in the southern East Asia among the Daic-speaking populations ~20-40 thousand years ago and then dispersed southward to Southeast Asia after the Last Glacial Maximum before moving westward to the Indian subcontinent. This migration resulted in the current distribution of this Y-chromosome lineage in the AA-speaking populations. Further analysis of mtDNA diversity showed a different pattern, supporting a previously proposed sex-biased admixture of the AA-speaking populations in India.

Historical mating patterns in the U.S. revealed through admixture and IBD patterns from genome-wide data from over 800,000 individuals.

Abstract:
Within a diverse population like the United States, many individuals are admixed, with ancestry from many worldwide regions. Non-random mating and migration can result in non-random combinations of ancestries within ad­­­mixed individuals (i.e., certain sets of ancestries may be common, and others may be rare); such dynamics can also affect patterns of identity-by-descent (IBD) among admixed and non-admixed individuals. To shed insight into historical mating and migration, we study genome-wide genotype data of over 800,000 AncestryDNA customers, as well as a subset of over 400,000 born in the US. First, we use a supervised algorithm to estimate individuals’ genetic admixture proportions across 26 global regions. We measure correlations between the estimated ancestries, and find certain sets of ancestries to frequently co-occur in individuals’ estimates. Such relationships may reflect historical events; e.g., the association between ancestry from the Americas and the Iberian Peninsula could reflect Colonial Era admixture. In addition to historical mating patterns, however, the admixture inference procedure and the delineation of global regions could also impact such correlations. To disentangle whether these trends could reflect mating patterns and preferences, we examine associations between the estimated ancestries of the parents of over 10,000 trios. Observed correlations agree with many of those identified within individuals, and potentially reflect more recent historical trends. Thirdly, we extend our study to IBD patterns in an inferred IBD network among genotyped individuals. Sub-clusters of the IBD network, which can often be annotated by ethnicity or historical US migration, are often inter-connected by bridging IBD connections; we highlight several connected sub-clusters in light of findings from genetic ancestry. Finally, we corroborate findings from these three analyses, as well as their potential timescales, by examining over 500,000 AncestryDNA customer pedigrees. Associations of country-level birth locations between pairs of couples support many of the non-random associations of ethnicities and IBD connections identified using genetic data. Many of the associations we observe reflect historical phenomena, and while not conclusive about their cause, suggest that many individuals with admixed ancestry, including those in the US, have present-day genetic signatures reflecting the migration and subsequent non-random mating of their ancestors.

Discovery of a previously unknown ancestral origin of the modern Taiwanese population that is distinct from the north-south gradient seen in other Han Chinese populations using the Taiwan Biobank.

Abstract:
The aim of the Taiwan BioBank is to build a nationwide biomedical research database that integrates genomic profiles, lifestyle patterns, dietary habits, environmental exposure histories, and long-term health outcomes of 300,000 Taiwanese residents (representing almost 1.5% of the Taiwanese population). We describe here results from 8265 samples that were genotyped using the Taiwan BioBank array, which was specifically designed for the Taiwanese population. After data quality control, genotype data for 589,016 single-nucleotide polymorphisms (SNPs) in 7203 unrelated individuals were denoted as TWB7203 and further analyzed. The 7203 individuals were clustered into three cline subgroups: 4.5% were of northern Han Chinese ancestry, 77.6% were of southern Han Chinese ancestry, and 17.8% were an admixture of Han Chinese and a previously unknown ethnic group. This unknown group was genetically distinct from neighboring southeast Asian groups and Austronesian tribes, but was similar to the southern Han Chinese. Long-range linkage disequilibrium and flips of major alleles at about 400 SNPs across the major histocompatibility complex region suggested that the previously unknown group may have experienced evolutionary events different from those of the other southern Han Chinese. The difference was further supported by the unique pattern of body figures measures of this unknown group. Genome-wide summary statistics for the ethnic subgroups of TWB7203 were released through a publicly accessible web-based calculation platform, Taiwan View (http://taiwanview.twbiobank.org.tw/taiwanview/twbinfo.do), on which genome-wide association analyses can be performed using TWB7203 as the reference. The release of this large-scale population-level and subpopulation-level genomic information will greatly benefit human genetic research.

Fine scale population structure of Spain and the genetic impact of historical invasions and migrations.

Abstract:
As well as being linguistically and culturally diverse, the Iberian Peninsula is unusual among European regions in that its demographic history includes a prolonged and large-scale occupation by people of predominately north-west African origin. Therefore, the Iberian Peninsula provides a unique opportunity for studying fine-scale population structure and admixture, and to test cutting-edge methods of detecting complex or subtle population genetic patterns.Previous studies using Y-chromosome, mtDNA as well as autosomal data have detected limited genetic structure in Iberia. However, powerful new methods and larger datasets mean it has recently become possible to detect and characterise genetic differentiation at a sub-national level. We performed the largest and most comprehensive study of Spanish population structure to date by analysing a dataset of ~1,400 Spanish individuals typed at ~700,000 SNPs. Using the fineSTRUCTURE method we detected striking and rich patterns of population differentiation within Spain, at scales down to tens of kilometres. Strikingly, the major axis of genetic differentiation in Spain runs from west to east, while conversely there is remarkable genetic similarity in the north-south direction.To infer details of historical population movements into Spain, we analysed Spain alongside a sample of ~6,000 individuals from Europe, North Africa, and sub-Saharan Africa. Across Spanish groups, we identify varying genetic contributions from north-west African ancestral populations, at times that all fall within the period of Islamic occupation. We also identify Basque-like admixture within Spanish groups to the south of the Basque-speaking region, implying southerly gene flow from this region. This analysis has revealed details of the strengths and weaknesses of different approaches to investigating population genetic history, as well as providing important new insights into the complex genetic history of Spain.

Prevalence of an archaic high altitude adaptive EPAS1 haplotype in the Himalayas.

Abstract:
Genetic, biochemical and morphological changes have enabled humans to adapt to living at high altitudes in Asia, Africa and South America. High altitude adaptation in Tibetans is reportedly influenced by introgression of a 32.7 kb long haplotype from the Denisovans, an extinct branch of archaic humans. This haplotype lies within the endothelial PAS domain protein 1 (EPAS1), a transcription factor acting in the hypoxia inducible factor pathway. A parallel study indicated that the same haplotype had probably entered the Tibetan population from the Sherpa, a high altitude adapted population from Nepal, thus suggesting that most likely the Denisovan introgression occurred in a population ancestral to the Sherpa and Tibetans. We genotyped 22 single nucleotide variants (SNVs) in this region in 1,550 Eurasian individuals, including 1,233 from Bhutan and Nepal residing at altitudes ranging from 86 – 4,550 m above sea level. Derived alleles for 5 SNVs (rs115321619, rs73926263, rs73926264, rs73926265, rs55981512) that characterize the core Denisovan haplotype (AGGAA) were present at high frequency not only in Tibetans and Sherpa, but also among many ethno-linguistic groups from Bhutan and Nepal. The frequency of the Denisovan core haplotype in these populations shows a significant correlation with altitude (Spearman’s correlation coefficient = 0.797, p-value 6.996 x 10-12). The Denisovan derived alleles were also observed at frequencies of 3-14% in the 1000 Genomes Project African samples and an additional 10 East and South Asian samples shared the Denisovan haplotype that extends beyond the 32 kb region. These additional samples enabled us to refine the haplotype structure and identify candidate functional variants that might be driving the selection signal.

Abstract:
Skin lightening among Eurasians is considered an adaptation to high latitude environments, likely occurred independently in Europe and eastern Asia due to convergent evolution. In Europeans, several responsible genes for lightening have been found, but for East Asians the situation remains elusive. We conducted a genome-wide comparison between dark-skinned Africans and Austro-Asiatic speaking aborigines and light-skinned northern Han Chinese, and identified a pigmentation gene OCA2showing unusually deep allelic divergence between them. An amino acid substitution (His615Arg) of OCA2prevalent in most eastern Asian populations, but absent in Africans and Europeans, was significantly associated with skin lightening in northern Han Chinese. Further transgenic and targeted gene modification analyses in zebra fishes and mice both recapitulated the phenotypic effect of the OCA2 variant, resulting from a decreased melanin production. Our results indicate that OCA2 plays a key role in the convergent skin lightening of East Asians during recent human evolution.

Institutes
1) Department of Ecology and Evolutionary Biology, University of Arizona, Tucson, AZ; 2) Interdisciplinary Program in Statistics, University of Arizona, Tucson, AZ; 3) Arizona Research Laboratories, Division of Biotechnology, University of Arizona, Tucson, AZ; 4) Department of Molecular and Cellular Biology, University of Arizona, Tucson, AZ.

Abstract:
Siberia is one of the coldest environments on Earth and has great seasonal temperature variation. Recent archeological studies indicate that humans have occupied Siberia for at least ~45,000 years, and persisted through the Last Glacial Maximum in North Eurasia. As early modern humans dispersed from their ancestral tropical African homeland into much cooler environments, long-term settlement in Siberia undoubtedly required biological adaptation to severe cold stress, dramatic variation in photoperiod, as well as limited and highly variable food resources. Humans are the only primate species other than the Japanese macaque that has adapted to boreal conditions—where temperatures remain far below freezing for more than half the year—pointing to intense selection pressures that likely drove the enhancement of physiological processes that generate and conserve heat. Physiological evidence, such as differences in basal metabolic rates and brown adipose tissue, suggests genetic adaptions in Arctic populations to life at high latitude. Because many of these physiological traits, including body mass and metabolic processes, are highly polygenic, we sought signatures of polygenic selection in Siberian populations. We sequenced exomes of individuals from two indigenous Siberian populations: the Nganasans (N = 21), who are the northernmost indigenous group in the world, and the Yakuts (N = 21), who live in the coldest regions on our planet. To detect polygenic selection, we performed gene-set enrichment analysis using pathways from the NCBI Biosystems database as well as a set of candidate genes that have been previously implicated in cold adaptation. The significance of the candidate gene sets for polygenic selection was assessed using whole-exome coalescent simulations to account for potential biases caused by demographic processes and heterogeneity in mutational and recombination rates across the entire genome. Our results thus give insight into the complex polygenic basis of adaptation to life in cold environments in human populations.

Abstract:
Complete high coverage individual genome sequences carry the maximum amount of information for reconstructing the evolutionary past of a species in the interplay between random genetic drift and natural selection. Here we use a novel dataset of 447 human genomes sequenced at 40X on the same platform (Complete Genomics) and uniform bioinformatic pipelines. Based on SNP-chip data we generally chose three samples to represent each population of interest. We cover a wide range of mostly Eurasian populations with additional populations from Oceania, South America and Africa.Here we describe the dataset in terms data quality and new recovered genetic variation that originates predominantly from previously subsampled continental regions.Using MSMC, D-statistics and Finestructure we have shown that peopling of the World from Africa is best explained by at least two migration waves (See Lawson et al abstract nr …). Here we expand on these conclusions by investigating short IBD segment sharing patterns using diCal, Hapfabia etc. We also disentangle split times involving the two migrations out of Africa (OoA), by running MSMC separately on genome chunks derived from OoA1 and OoA2. We also present detailed regional population histories in reconstructions of past dynamics of effective population size and population split times.

Abstract:
The ability to taste phenylthiocarbamide (PTC) and 6-n-propylthiouracil (PROP) is a classic polymorphic trait that is mediated by the TAS2R38 bitter taste receptor gene. These taste phenotypes have been shown to be correlated with the ability to taste other taste-active compounds, as well as with food habits. Nonetheless, several features of its evolutionary significance and population dynamics are still unresolved. In particular, it is not clear why the worldwide frequency of the TAS2R38 non-taster AVI haplotype is very high, almost equivalent to that of the taster PAV haplotype. While the long-standing hypothesis suggests that balancing selection has been acting on this locus, other theories have emerged more recently. We performed a detailed analysis of the TAS2R38 gene and its surrounding regions in a sample of 5511 individuals belonging to 104 different worldwide populations. Our results show no departures from neutral expectations. This suggests that recent demographic events have had a major role in shaping the genetic diversity at this locus, suggesting a reconsideration of the classic hypothesis. We also hypothesize that interactions with the adjacent maltase-glucoamylase (MGAM) gene may have contributed to the current distribution of PAV and AVI haplotypes. One hypothesis is that the distribution of the uncommon TAS2R38AAI haplotype is interpretable as the product of a recent recombination event that occurred in Africa, after the Out Of Africa (OOA) event. Collectively, our results offer novel insights into the evolutionary history of the TAS2R38gene, showing a relaxation of the selective forces previously acting on this gene, and providing a new hypothesis for the observed present-day worldwide distribution of AVI and AAI haplotypes.

Abstract:
Adaptive evolution in recent human history remains poorly characterized. Human population genetics has focused on strong selective sweeps. However, our understanding of other selective patterns and their effects on patterns of human genetic diversity is still limited. Although there is compelling evidence for recent selection pushing increased height in northern Europe, the literature is devoid of other strong notable examples of recent polygenic selective events. We develop a non-comparative scoring method for individual polymorphisms to detect signals of recent adaptation, based on using singleton density to approximate haplotype age. Simulations suggest that this method preferentially detects recent evolutionary events with several different sweep patterns. We apply this method to 1,600 individuals from the ALSPAC cohort and confirm known selection signals in Northern Europeans, as well as broader signals of polygenic selection. We investigate associations with these signals and demonstrate that these signals are robust to population allele frequency differences in Europeans. We use this method in combination with population allele frequency differences to identify novel signals of polygenic adaptation in modern Europeans.

Mexicans are recent admixture of Amerindians, Europeans and Africans. We performed local ancestry analysis of Mexican samples from two genome-wide association studies obtained from dbGaP and discovered that at the MHC region Mexicans have excessive African ancestral alleles compared to the rest of the genome, which is the hallmark of recent selection for admixed samples. The estimated selection coefficients are 0.07 and 0.09 for two datasets, which put our founding among the strongest known selections observed in humans, namely, lactase gene in northern Europeans and sickle-cell allele in Africans. Inaccurate Amerindian training samples was a major concern for the credibility of previously reported selection signals in Latinos. Taking advantage of the flexibility of our statistical model, we devised model fitting method that can learn Amerindian ancestral haplotypes from the admixed samples, which allow us to infer local ancestries for Mexicans using only European and African training samples. The strong selection signal at MHC remains without Amerindian training samples. One wonders why such a strong selection signal was not discovered by 1000 Genomes project in their analysis of Mexican samples using other competing local ancestry inference models. Our simulation studies suggested that the approach adopted by 1000 Genomes admixture analysis group, which used consensus estimates from four methods, is perhaps to blame. Finally, we pointed out that medical history studies suggested such a strong selection signal is plausible in Mexicans.

Abstract:
Genetic variants with strong reproductive disadvantages are evolutionary constrained and remain generally rare in population. However, these variants can still exist at higher frequencies in young populations, such as Finns, when the negative selection hasn’t had time to counteract the effect of genetic drift on rare alleles. Thus, population isolates provide a valuable study design to explore the role of rare genetic variants in complex traits. In Finland, the youngest settlement is in the north and east parts of the country dating back to a small number of founder families only few centuries ago. In addition this region has higher prevalence of schizophrenia and intellectual disability (ID). We exploited this hypothesis by producing whole exome sequence (WES) and GWAS data from 352 patients from Northern Finland with ICD-10 diagnosis of ID of unknown etiology, and their 293 family members (97 trios, 109 duos and 146 index cases). The Northern Finland Intellectual Disability Project (NFID) exomes were combined with 8000 Finnish exomes sequenced in the Sequencing Intitative Suomi project (SISu, http://sisuproject.fi/).As expected, we observed comparable amount of large CNVs and de novo mutations as reported in similar patient collections, both of these categories being enriched in the NFID patients. Given the genetic origin of NFID, we expected to observe variants enriched in Finland that are 1) strong acting recessive variants that seem Mendelian but account for ~1% of a 1% phenotype rather than all of a 1/10000 phenotype and 2) dominant alleles with odds ratios in the range of 2-5.As per our hypothesis we discovered a Finnish-specific recessive cause of ID in 4 cases, homozygosity of a variant in CRADD (p=4e-8). The variant is not observed in homozygous state in 61 000 individuals worldwide (http://exac.broadinstitute.org/) or in 8000 Finnish individuals. We also observed Finnish enriched dominant missense variants in multiple genes (OR range 3-6) including a gene encoding for TUBA1A1 (OR 5.2, p:4e-8). Significant and promising variants are replicated by sequencing additional Northern Finnish ID cases and their family members (n=315; 150 cases; 51 trios).In conclusion, we demonstrate young founder populations as a powerful resource to study rare variants. Specifically, we show that an enrichment of deleterious alleles increases power to detect causal and disease associated variants that would require very large sample sizes in more diverse populations.

Abstract:
Genome wide association studies (GWAS) have begun uncovering the genetic architecture of a wide array of human quantitative traits, including morphological traits like height and BMI as well as complex diseases like schizophrenia and diabetes. Interpreting GWAS results in evolutionary terms can provide us with an unprecedented insight into the evolution of quantitative traits in humans and may help guide the design of future mapping studies. Evolutionary processes, such as mutation, selection, drift and pleiotropy, shape the genetic architecture of quantitative traits but very few existing models incorporate them in a way that’s meaningful for GWAS interpretation. We extend Fisher’s Geometric Model to quantitative variation and use it to obtain predictions of the genetic architecture of quantitative traits under different evolutionary scenarios. Under this model, we relate the phenotypic effects of variants to the selection acting upon them and we see that weakly and strongly selected variants are expected to carry more of the variance than nearly neutral sites and therefore be easier to detect in GWAS. However, variants under very strong selection would be too rare to be detected. Pleiotropy is represented by the dimensionality of the trait space and pleiotropic effects weaken the relationship between effect size and selection. The overall effect of pleiotropy on GWAS success is to effectively increase GWAS power. Our analysis suggests that the increase in GWAS success with increasing study size may be highly sigmoidal for some traits and that the increase may be quite dramatic once a large enough study size is reached. This model may provide the basis for using GWAS results to infer the strengths of evolutionary forces shaping quantitative traits in humans.

Abstract:
Allele frequency estimations in Chinese people were an important factor for the genetic map in Chinese population and other epidemiology studies including molecular prevalence of genetic diseases. Chinese Han populations in the 1000 Genomes Project released in 2012 have been the most widely-used database for variants especially SNPs in Chinese, providing an intact and precise map of genome variation in Chinese people and accelerating tons of Chinese population studies. With 90 Han people in China sequenced, the sampling proportion was quite small compared with billions of Chinese people and the estimated allele frequencies may be deviated from those of large-scale Chinese population due to sampling randomness. It was estimated that over 1,000,000 pregnant women took sequencing-based non-invasive prenatal testing for fetal aneuploides screening in China from January 2011 to June 2015. Low-coverage (~0.1X) WGS strategy was mostly used in clinical labs in China, which presented a large-scale and randomly sampled population and constitutes significant genetic databases for Chinese populations with informative phenotypes including territory distributions, maternal age, nationality and regions. Till now, most population-based algorithms for allele frequencies computation were developed and validated in 30X WGS data but were not appropriate for low-coverage sequencing data. It was hard to discover population knowledge from such big but ultra-low coverage data which request specific models to deal with population SNP calling, demanding computational tools and mass storage. Here in this study we developed a maximum-likelihood method to estimate allele frequency in Chinese population and applied it to NIPT data in over 150,000 samples. A Chinese genetic map of over 150,000 Chinese people was built and the allele frequencies in the whole genome in the large-scale Chinese people were studied. We also analyzed prevalence of common single-gene disorders such as thalassemia, DMD, SMA and hearing loss in different regions in China from 2011 to 2015. Our findings were compatible with current epidemiology reports in Chinese populations and showed the pictures of molecular prevalence of genetic diseases in China. It was the greatest population studied with millions of orders of magnitude to our knowledge. Our studies improved the understanding of variants in Chinese populations, promoting more potential uses of NIPT samples in population genetics.

Abstract:Purpose: Historical and linguistic studies have suggested that Roma people, living mainly in Europe, migrated into the continent from South Asia about 1000-1500 years ago. Genetic studies, based on the examination of Y chromosome and mitochondrial DNA data, confirmed these findings. Recent genetic studies based on genome-wide Single Nucleotide Polymorphism (SNP) data further investigated the history of Roma and, among many other findings, suggested that the source of South Asian ancestry in Roma originates mainly form the Northwest region of India.Methods: In this study, using also genome-wide SNP data, we attempted to refine these findings using significantly larger amount of European Roma samples. We also had the opportunity to use more data of distinct Indian ethnic groups, which provided us a higher resolution of the Indian population. The study uses several ancestry estimation methods based on the algorithmic method principal component analysis and model-based methods that apply Bayesian approach and uses Markov chain Monte Carlo or maximum likelihood estimation.Results: According to our analyses, Roma showed significant common ancestry with Indian ethnic groups of Jammu and Kashmir, Punjab, Rajasthan, Gujarat, Uttarakhand states, e.g. with Kashmiri Pandit, Punjabi, Meghawal, Gujarati and Tharu. However, we found strong common ancestry with Pashtun and Sindhi, ethnic groups living in Pakistan. Populations of Northeast India have also strong common ancestry with Roma. These ethnic groups are Brahmin, Kshatriya, Vaish.Conclusion: We can conclude, that Northwest India plays an important role in the South Asian ancestry of Roma, but they have similarly strong ancestry with some Pakistani ethnic groups and we can find populations in the east region of North India, which also could function as a source of Indian ancestry of Roma. However, ethnic groups of the southern region of India do not show strong relationship with Roma people, living in Europe.

Abstract:
Modern India is a region of remarkable cultural, linguistic, and genetic diversity with over 4,500 anthropologically well-defined groups. Large genetic differentiation has been observed between many of these groups, reflecting strong founder events with effects that have been preserved in some cases for thousands of years due to low genetic exchange between groups. We undertook a systematic survey to assess the strength of founder events in over 1200 individuals from over 230 Indian groups genotyped on Affymetrix (6.0 and Human Origins) and Illumina (650K) arrays. These groups include tribes, castes, and religious groups with a wide-range of census sizes and spanned every state in India. We also analyzed Ashkenazi Jews and Finns, two groups known to have high rates of recessive diseases due to strong founder events. To determine the severity of founder events, we measured the total length of the genome inherited identical-by-descent (IBD) in each group. The data were phased with Beagle 3.3.2, and detection of IBD fragments was performed using FastIBD and GERMLINE. The HaploScore algorithm was used to filter out false positive fragments. To reduce the influence of recent consanguinity, we excluded closely related individuals detected by the presence of very long IBD segments. We quantified the IBD score for a group as the combined length of IBD segments between 3 to 20cM long, averaged over all pairwise comparisons within the group. We find that over 100 Indian groups in our dataset have founder effects stronger than in Ashkenazi Jews and Finns, including many groups with large census sizes (>1 million). This represents an extraordinary opportunity for biological discovery and potential reduction of genetic disease burden through mapping of recessive disease genes and prenatal counseling. Future work should focus on better characterization of the history and relationships amongst the founder events, as well as mapping variants associated with genetic diseases in the groups with the strongest founder events.

Abstract:
The UK’s 100,000 genomes project has begun consenting participants to its main programme through a network of 11 NHS Genomic Medicine Centres (GMCs) spanning over 70 local delivery partner institutions. It has four main aims: (1) to bring benefit to patients; (2) to create an ethical and transparent programme based on consent; (3) to enable new scientific discovery and medical insights; and (4) to kick-start the development of a UK genomics industry.The project focuses on patients with a rare disease and their families (approximately 50,000 genomes), as well as patients with certain common cancers (about 25,000 tumour-normal pairs). Whole genome sequencing of DNA extracted from blood samples is performed to at least 30x depth for germline samples and 75x for tumour samples, using Illumina’s HiSeq X Ten sequencing platform. PCR-free library preparation is employed when feasible. All germline samples are required to achieve at least 15x high-quality coverage over 95% of the autosomal genome.One of the innovations of the project is the collection of phenotype data from the GMCs in a comprehensive and standardised manner, either through direct data entry or by the population of data models directly from Electronic Health Record systems. Data models for each of the 120 currently eligible rare genetic conditions, as well as for cancers, have been developed in consultation with clinical experts to support clinical data capture and phenotyping. This approach was designed to enable clinical interpretation and large-scale genomic research. Clinical data models include questions about the presence or absence of human phenotype ontology (HPO) terms, additional clinical tests, and family history. To date, data models include 1370 different HPO terms, with a median 37 terms per condition and range 2 (hyperinsulinism) to 116 (mitochondrial disorders). Over 200 phenotypes have been proposed for addition to the HPO. One of the important benefits of the programme will be to obtain HPO term frequencies based on large numbers of patients, rather than occurrence in OMIM entries, informing more powerful variant prioritisation strategies.

FOXP2 is an important gene, and mutation of it has strong phenotypic consequences. This is probably because of genetic interactions, as well as physical interactions with other proteins.

Neanderthal alleles of this protein may have had effects large enough to be strongly selected against in a background of mostly modern human alleles at other genes.

Neanderthals had increased genetic drift due to long times at very low population density. This increases the chance that slightly negative alleles become fixed. Transferred to modern humans, they probably were more than slightly negative.

These buttons register your public Agreement, Disagreement, Troll, or LOL with the selected comment. They are ONLY available to recent, frequent commenters who have saved their Name+Email using the 'Remember My Information' checkbox, and may also ONLY be used once per hour.

I wonder why Neanderthal FOXP2 segments failed to survive in modern human genomes following admixture, seems to suggest there was some crucial difference. From this paper:

A strong depletion of Neandertal lineages spanning ~17 Mb on 7q encompasses the FOXP2 locus (Fig. 2A), a transcription factor that plays an important role in human speech and language (13).

Is there some explanation available?

FOXP2 is an important gene, and mutation of it has strong phenotypic consequences. This is probably because of genetic interactions, as well as physical interactions with other proteins.

Neanderthal alleles of this protein may have had effects large enough to be strongly selected against in a background of mostly modern human alleles at other genes.

Neanderthals had increased genetic drift due to long times at very low population density. This increases the chance that slightly negative alleles become fixed. Transferred to modern humans, they probably were more than slightly negative.

“Genome-wide data on 34 ancient Anatolians identifies the founding population of the European Neolithic.”

Great paper and very timely. The title is a bit presumptuous, however as the abstract says, “[it] is genetically a *plausible* source for the first farmers of Europe..”

“Ancient European haplotype enrichment in modern Eurasian populations.”

“[..]the Stuttgart individual had the lowest PWD disparity between all modern populations for the SNP blocks that contain the IL17R and CD3 genes, which potentially indicates selection acting on these immune system haplotypes from the Stuttgart individual consistent with the Stuttgart farmer and modern Europeans’ continual close interaction with animals and zoonotic disease exposure.”

This is pretty interesting. I get a lot of comments from people about HLA haplotypes being under selection and therefore useless for population genetics studies, however I have not yet seen any plausible suggestions of such selection among prehistoric Europeans [B*53 recent selection in West Africa due to malaria is well documented]. But this paper offers a good example, whether or not it also applies to MHC locus or not I don’t know.

“We propose an empirical approach that is able to infer demographic histories and to detect IBD regions simultaneously. This approach comprises of an empirical model of recombination and an IBD detection algorithm. The empirical model builds coalescent trees with recombination events based on genomic similarities of individuals, and the detection algorithm incorporates the information of coalescent trees with recombination events to identify IBD regions. These two procedures can be executed iteratively till no new IBD regions found and no new changes in coalescent trees. In addition, the two procedures can be in parallel in each iteration to improve computational efficiency.We applied our method in simulated data and two real datasets: the 1,000 genomes and the HLA alleles in Taiwan populations.
[..]Finally in the HLA alleles in Taiwan populations, we demonstrate the pure utility of the empirical recombination model for recent demographic inference. Therefore, our proposed method is capable of detecting IBD regions efficiently and making demographic inference comprehensively.”

This sounds not unlike what I have been doing with HLA allele/haplotype data. Very interesting to read this piece.

“Strong selection at MHC in Mexicans since admixture.”

I had a look at the top 100 HLA haplotypes of Mexicans/Chicanos [N= 261,235] taken from the USA National Marrow Donor Programme. Of the top 10 haplotypes, 6 are of European and 4 are of Native American origin, so this is probably correct.

There are so many interesting papers here, thanks for posting them! Will take me a while to go through them all.

Regarding the abstract on ancient Anatolians, it’s great to finally have some DNA from that region. The FSTs are small, but not minuscule like the differences between modern English and modern French. It’s more like English vs Belorussian and English vs Corded Ware DNA. So is that the kind of results you’d expect to find between early Anatolians, early Germans, and early Spaniards? I have little concept of what would be the “right” number if EEFs were direct descendants of these samples.

I am curious, do you know anything about what was going on around other parts of the Black Sea at this time? It was a freshwater lake until 5600 BC, was it not? It seems like an odd coincidence that the transition to saltwater took place right around the time the Linear Pottery Culture moved up the Danube.

Neanderthal article by Harris and Nielsen – Neanderthals as accumulating higher % of deleterious mutations due to low population size. Maybe raises questions of whether specific Neanderthal vs modern human anatomical and genetic adaptations actually have much to do with what we presuppose makes modern humans special, or whether we’re mostly just a small faced, narrow bodied bunch who happened to live in a region suited to maintaining a high population size and avoided a lot of population crashes. Also seems they should consider likelihood of archaic admixture within Africa.

Adaptation in global human populations – always seems to make intuitive sense to me that a more genetically diverse population should have fewer sweep of particular variants to very high frequency as there is more “competition” for selective pressure (to a lesser extent why there may seem to be more derived variants in East Asia vs West Eurasia as well, on a smaller scale of difference).