Paleo-Eskimo and Siberians

Tatiana Tatarinova

USC, CHLA, USA/Russia

The ability to trace individuals to the point where their DNA was formed at the population level poses a formidable challenge in genetic anthropology, population genetics and personalized medicine [1]. The vast progress accomplished in developing resources for identifying candidate gene loci for medical care and drug development [2] was largely unmatched by the ﬁeld of biogeography and ancestral inference. Only in the past decade have researchers begun harnessing high-throughput genetic data to improve our understanding of global patterns of genetic variation and its correlation to geography. This is not surprising, because the genetic variation is largely determined by demographic history of inbreeding or admixture which often vary between geographic regions. Although in the past few years we have witnessed a growing interest in biogeography methods, only a few computational tools exist, particularly for analysis of mixed individuals [3-7]. These methods can be either local (focusing on origin of chromosomal segments), such as Lanc-CSV [8], LAMP-LD [9], and MULTIMIX [10], global (average ancestral proportions across the genome), such as ADMIXTURE [11], STRUCTURE [12, 13], and reAdmix [7] or both, such as HAPMIX [14], LAMP [9, 15]. Some popular applications are PCA-based. For humans, PCA was shown to be accurate within 700 kilometers in Europe [3]. The Spatial Ancestry Analysis (SPA and SPAMIX) [16] is an advanced tool that explicitly models allele frequencies. SPAMIX is has to have an accuracy of 550Km for two-ancestral admixtures. Algorithms like mSpectrum [18], HAPMIX [13] and LAMP [8] achieve good accuracy at a continent resolution [18], but do not achieve country-level resolution.

Related tools like BEAST[17], STRUCTURE [13], and Lagrange [18] are either inapplicable to autosomal data or cannot be used to study recent admixture in humans, animals, and plants. We note that looking at Y chromosome and mtDNA alone is insuﬃcient for detailed biogeographic analysis, since closely related populations have similar distributions of haplogroups. To address these limitations, we have recently developed an admixture-based tool, Geographic Population Structure (GPS) that can accurately infer ancestral origin on unmixed individuals [19]. GPS infers the geographical origin of individuals by comparing their “genetic signatures” to those of reference populations known to exhibit low mobility in the recent past. GPS’s accuracy was demonstrated by classifying 83% worldwide individuals to their country of origin and 65% to a particular region of the country. Applied to over 200 Sardinian villagers, GPS placed 25% of them in their villages and ≈ 50% within 50 kilometers of their villages. However, contemporary individuals are not necessarily docile and often migrate to diﬀerent areas and bear oﬀspring of mixed geographical origins. GPS would incorrectly predict such oﬀspring to the central point between the parental origins, which would be unsuitable for pharmacology, forensics, and genealogy, and therefore GPS is not equipped to handle mixed individuals. Moreover, often individuals have an indication of at least one of their possible origins, which can be used to improve the prediction, but existing tools are not designed to consider such information. To address these limitations, we developed reAdmix [7], a tool that models individuals as a mix of populations and can use user input to improve prediction accuracy.

Upon demonstrating accuracy of reAdmix on simulated datasets we applied this algorithm to analyze individuals of presumed Ket origin. The Kets, an ethnic group in the Yenisei River basin, Russia, are considered the last nomadic hunter-gatherers of Siberia, and Ket language has no transparent affiliation with any language family. We have collected data from 46 unrelated samples of Kets and 42 samples of their neighboring ethnic groups (Uralic-speaking Nganasans, Enets, and Selkups). We compared the GenoChip SNP array data for the Ket, Selkup, Nganasan, and Enets populations to the worldwide collection of populations based on 130 K ancestry-informative markers [20]. We applied GPS and reAdmix algorithms to infer provenance of the samples and confirm self-reported ethnic origin. Combining the output from the two algorithms, we identified a subset of non-admixed Kets among self-identified Ket individuals, and nominated two individuals for whole-genome sequencing. Analysis of these carefully selected individuals enabled us to establish that Kets belong to a group of modern populations closest to an ancient source of Siberian ancestry in Saqqaq.