April 28, 2011

In my comments on the Moorjani et al. (2011) I argued that admixture proportions presented in the paper are inaccurate, and gave my reasoning behind this claim. Moorjani et al. (2011) also present STRUCTURE 2.2 results.

Naturally, I wanted to see whether independent admixture estimates on some of the same populations had been estimated in the literature. This brought me to Pugach et al. (2011) which introduced a wavelet-based admixture estimation method called StepPCO. In that paper, the authors presented estimates of the extent and timing of admixture for some populations also included in Moorjani et al. (2011). They also compared with HAPMIX, a well-known method using a completely different methodology, and presented their comparative data in this table and in the body of their paper.

Hence, we now have 4 different estimates of admixture for some populations. To these, I decided to add supervised ADMIXTURE 1.1 results. I used CEU and YRI as "West Eurasian" and "Sub-Saharan" references so that I would be in accordance with these other methods.

(The ADMIXTURE results were obtained by merging the datasets in PLINK with the --geno 0.001 option, then pruning the combined set for LD with --indep-pairwise 50 5 0.3)

The following table summarizes the estimates.

Note that these are estimates of Sub-Saharan admixture assuming two parental populations; also, Moorjani et al. break up the Bedouin sample into the two distinct groups it is composed of, so I have taken a weighted average of the figures in their paper.

It is difficult to make meaningful statistical inferences on only a few comparison points, but I do observe that STRUCTURE 2.2 on 13,900 markers gives the higher estimates, followed by Moorjani et al.

The other three methods cannot be ordered, giving higher estimates in some populations and lower in others. They all give, however, lower estimates than both Moorjani et al. (2011) and STRUCTURE.

As I've explained in my earlier post, Moorjani et al. (2011) have higher estimates of admixture because they measure it by comparing populations' shift on the East Eurasian-African axis, ignoring the Asian-shift of North Europeans and adding it to the African-shift of southern Caucasoids. This leads them to conclude a few percentage points of African admixture in populations that have virtually none (such as Sardinians and North Italians, even Swiss French). For populations that do have noticeable African admixture (such as those on the table) their overestimates amount to a a few percentage points.

Three of the methods also provide an estimate of time since admixture:

There is no simple relationship between these times, but an obvious pattern is that the dates of Moorjani et al. are younger, perhaps less than 50% of the other two methods.

Conclusion

Clearly, the art of admixture estimation is still in its infancy, and different methods provide different results even with a simple 2-population model. I've argued how the results of one method can be harmonized with those of the others, but I don't have a ready explanation about the substantial age differences. Pugach et al. argue that their method is better than HAPMIX, but the differences between the two seem small compared to the differences of both methods to ROLLOFF (the method of Moorjani et al.'s paper).

The discrepancy is even more interesting if one takes into account the fact that HAPMIX and ROLLOFF were done by many of the same people. Hopefully, someone will be able to figure out the cause behind the discrepancy. A commenter in my earlier post suggested that ROLLOFF produces younger ages because its age estimation is tied to its inflated admixture proportions; this could be true, however, the discrepancy exists even in populations where the relative difference is small.

A speculative historical coda

Historical explanations about the circumstances of this admixture need to be made with some caution, due to the uncertainty about admixture times.

For example, a doubling of the Moorjani et al. age estimates would disentangle the Sub-Saharan element in Levantine Arabs from the Islamic epoch. A doubling of the admixture date for Jewish populations, as presented by Moorjani et al. would bring that admixture's age to the middle of the 2nd millennium BC, a period in which the Hebrews were said to be in Egypt, where potentially they may have collectively acquired a small African admixture.

Hopefully, with time and full genome sequencing, we will get a better idea of what these African signals in some West Eurasian populations represent.

April 27, 2011

The single recent Out-of-Africa (SROOA) model has been the "default position" in the scientific community for a long time, and it has found its way to popular consciousness by a number of TV documentaries and magazine articles.

This model has recently been shaken by the discovery that some modern humans are closer to extinct hominins: the Neandertals and Denisovans. Within SROOA this is quite unexpected, as these groups were thought to be irrelevant side branches of the human family tree: differential affiliation of some modern humans to them means that they are not.

This has invariably led to an acceptance that SROOA is basically wrong in its purist form, and an alternative form including some assimilation of archaic hominins is moving to become the new default position.

Now, a new paper that works within the SROOA tradition itself finds another surprising piece of evidence that contradicts SROOA. But, first, let's examine what SROOA entails:

A human group (Proto-Eurasians) moved Out of Africa, starting to become differentiated from Africans

At a later time, different Proto-Eurasian groups became effectively separated by geography and started diverging from each other, eventually evolving to become Europeans, East Asians, Australian Aborigines, etc.

Under this model, different Eurasian groups are expected to have exactly the same position vis a vis Africans, once drift is accounted for. Each population's divergence from Africans can be analyzed into two components, e.g., an African-Proto-Eurasian one, and a Proto-Eurasian-Chinese one. The genetic drift (away from Africans) in the first part is the same for all non-African populations (e.g., Chinese or Tuscans), while in the latter it is dependent on population history.

The new paper shows why this model is wrong: Europeans are closer to Africans than East Asians are. This is totally unexpected under SROOA, as Europeans are expected to be as distant from Africans as any other Eurasian group (once drift is accounted for).

I had noticed this phenomenon in my experiments for quite some time, and highlighted the possibility that haplogroup E1b1b, which is shared by West Eurasians and Africans may have something to do with it.

If autosomally African-like populations moved out of Africa carrying these Y-chromosomes, then we would expect only E1b1b-bearing populations to be shifted towards Africans relative to the Chinese. The authors rightly reject this idea because they discover that all European populations, including a wide variety of North European ones, where haplogroup E is non-existent, are closer to Africans than the Chinese are.

So, we are left with a bit of a problem:

There is clearly something "common" between West Eurasians (even North Europeans) and Sub-Saharan Africans that is not shared by East Asians, a factor X which brings them closer to each other than would be expected by the SROOA

The best candidate of this "common" element (Y-haplogroup E) does not have the expected distribution.

The problem can be solved in either of two ways. I have argued for both of them at some time or another, so it would be a good idea for a paper to come out and consider how one could be preferred over the other.

Solution #1: West Eurasian Back-Migration into Africa

The first solution is based on the idea that there has been a major episode of back-migration into Africa that is not captured by the standard model. Clearly, this cannot have been a recent event, as Sub-Saharan Africans largely lack (except in the North and East) certifiably West Eurasian derived markers. But, the event need not have been particularly recent: as long as it occurred after West Eurasians began to diverge from East Asians it would have established the genetic closeness observed by the authors.

I believe that the best signal of such a potential back-migration involves haplogroup E. This is the dominant patrilineage of black Africans by far, and almost certainly had an African (and probably an east African) origin. However, its sister haplogroup D occurs as a relic in Asia, among people such as Tibetans, Ainu, or Andaman Islanders. Where did the ancestral clade DE develop? If I was a betting man, I would say that somewhere between the Indian Ocean (where the Andamanese live), and East Africa.

An early movement of DE-bearing men from Arabia into East Africa would serve to bring a West Eurasian autosomal component into Africa. That component would then evolve into E in East Africa itself, and go on to (almost) completely replace pre-existing African Y-chromosomes, leaving haplogroups A and B at high frequencies in a few relic African hunter-gatherer populations.

From the Eurasian perspective, the problem would evaporate: West Eurasians' autosomal shift to Africans is not correlated with haplogroup E frequencies, because the latter was not initially associated with a Sub-Saharan-like autosomal gene pool.

Solution #2: Archaic admixture in East Asians

A second potential solution would interpret African-West Eurasian closeness not as evidence of a common population element in West Eurasians and Africans, but as a consequence of a population element in East Asians that both West Eurasians and Africans lack.

The obvious candidate for such an element involves archaic admixture in East Asians. This is no longer an exotic possibility, given the evidence for such admixture that the sequencing of the Neandertal and Denisovan genome has produced.

Under this scenario, Europeans and East Asians are genetically close because of their common Proto-Eurasian ancestry, and they both shared an initial same distance to Africans, but East Asians diverged from both West Eurasian and African populations by admixing with archaic humans they encountered in the East.

A test of solution #1

Suppose that solution #1 is correct. Consider also that:

West Eurasians are shifted relative to East Eurasians by x% to San Bushmen (who have some of the highest frequencies of non-E chromosomes), and are expected to be least affected by any sort of West Eurasian->Africa back-migration. C

West Eurasians are shifted relative to East Eurasians by y% to Yoruba (who have extremely high E-haplogroup frequencies), and are expected to be more affected by West-Eurasian->Africa back-migration than the San are.

If y is larger than x, then the prediction of the theory is supported.

It is not clear to me whether East Asian archaics/West Eurasian back-migration into Africa, or a combination of these factors may account for the observed phenomenon. I should point out a second Out-of-Africa expansion into West Eurasia (the authors' preferred model) is also not out of the question, but this cannot be easily tied to a particular event in prehistory or uniparental marker.

The study of human origins has just gotten even more interesting...

Genome Research doi:10.1101/gr.119636.110

Human population dispersal “Out of Africa” estimated from linkage disequilibrium and allele frequencies of SNPs

Brian P. McEvoy et al.

Abstract

Genetic and fossil evidence supports a single, recent (less than 200,000 yr) origin of modern Homo sapiens in Africa, followed by later population divergence and dispersal across the globe (the “Out of Africa” model). However, there is less agreement on the exact nature of this migration event and dispersal of populations relative to one another. We use the empirically observed genetic correlation structure (or linkage disequilibrium) between 242,000 genome-wide single nucleotide polymorphisms (SNPs) in 17 global populations to reconstruct two key parameters of human evolution: effective population size (Ne) and population divergence times (T). A linkage disequilibrium (LD)–based approach allows changes in human population size to be traced over time and reveals a substantial reduction in Ne accompanying the “Out of Africa” exodus as well as the dramatic re-expansion of non-Africans as they spread across the globe. Secondly, two parallel estimates of population divergence times provide clear evidence of population dispersal patterns “Out of Africa” and subsequent dispersal of proto-European and proto-East Asian populations. Estimates of divergence times between European–African and East Asian–African populations are inconsistent with its simplest manifestation: a single dispersal from the continent followed by a split into Western and Eastern Eurasian branches. Rather, population divergence times are consistent with substantial ancient gene flow to the proto-European population after its divergence with proto-East Asians, suggesting distinct, early dispersals of modern H. sapiens from Africa. We use simulated genetic polymorphism data to demonstrate the validity of our conclusions against alternative population demographic scenarios.

April 26, 2011

Let me preface this by saying that I don't doubt that there exists some Sub-Saharan admixture in some West Eurasian (Caucasoid) groups, and I've quantified the different types of African admixture that can be found in many such groups, most recently here.

However, there are serious methodological flaws in a new paper by Moorjani et al. which render its estimates unreliable. This is unfortunate, as the authors assembled an important dataset, but they only consider a very simplistic model of 2-population admixture which is completely inappropriate for the problem they are studying.

Caucasoids on the Chinese-San axis of variation

Moorjani et al. motivate their study by projecting various West Eurasian groups from Europe and the Near East onto the first principal component of variation defined by CHB (Chinese) and San (Bushmen). The reasoning is the following:

To study the signal of African gene flow into West Eurasian populations, we began by computing principal components (PCs) using San Bushmen (HGDP-CEPH- San) and East Eurasians (HapMap3 Han Chinese- CHB), and plotted the mean values of the samples from each West Eurasian population onto the first PC, a procedure called ‘‘PCA projection’’ [17,18]. The choice of San and CHB, which are both diverged from the West Eurasian ancestral populations [19,20], ensures that the patterns in PCA are not affected by genetic drift in West Eurasians that has occurred since their common divergence from East Eurasians and South Africans.

This is indeed a good idea: if some Caucasoid group A has a common ancestral element with Sub-Saharans that is lacking in another Caucasoid group B, then A is expected to be shifted towards the San side of the first PC relative to B. Indeed, this is what the authors observe:

We observe that many Levantine, Southern European and Jewish populations are shifted towards San compared to Northern Europeans, consistent with African mixture, and motivating formal testing for the presence of African ancestry (Figure 1, Figure S2).

However, this is clearly a case of seeing the glass half full. The authors prefer the hypothesis that some Caucasoid groups have African ancestry, although the hypothesis that other Caucasoid groups have East Asian ancestry can equally well explain the observed pattern. Indeed, both hypotheses may explain the phenomenon they observe.

For example, African ancestry in Palestinians has been well-documented, so Palestinians are expected to be San-shifted relative to northern Europeans. On the other hand, East Eurasian ancestry has also been well-documented in HGDP Russians, so we expect them to be CHB-shifted relative to southern Europeans.

Things are not that clear for other Caucasoid populations, e.g., southern Europeans or northwestern Europeans. The authors assume that the different position of these two groups on the San-Chinese axis is due only to Sub-Saharan admixture in southern Europeans. This implicit assumption is the Achilles' heel of the paper.

Tests of population admixture

Because of genetic drift, two populations that diverged from a common ancestor will have different allele frequencies. However, imagine if we looked at these allele differences and saw that a population A not only had different frequencies than B, but also the difference in frequencies tended to be in the direction of a Sub-Saharan population. For example, at some locus f(A)=0.4, f(B)=0.3, and f(Sub-Saharan)=0.1. You can see that B's frequency deviates from A's in the direction of Sub-Saharans. This may occur due to random drift for one particular marker, but if it occurs systematically across the genome, then admixture is a likely explanation. This is the basis of the 3-population test used by the authors.

Another idea is to see whether frequency differences between A and B are correlated with frequency differences between Sub-Saharans and another Eurasian population unrelated to either A or B. Differences between Caucasoids and Sub-Saharans are (in part) due to divergence between Sub-Saharans and ancestral Eurasians. Suppose, for example, that we've identified a group (e.g., Papuans) unlikely to have admixed with Caucasoids. If B differs from A (over many markers) in the same direction that Sub-Saharans differ from Papuans, this is consistent with the notion that B has some Sub-Saharan admixture that A lacks. This is the basis of the 4-population test.

Note that because of symmetry, a highly negative value in their 4-population test (x, CEU, Papuan, YRI) indicates Sub-Saharan admixture, while a highly positive one would indicate "Papuan" admixture! The authors do observe positive values, suggesting that some northern European populations are Papuan-shifted even with respect to CEU, most notably Russia with a Z-score of 11.4. Thankfully, we are spared a paper on Papuan admixture in Russia.

Comparison to the Indian Cline work

These tests are an important statistical tool, and many of this paper's authors have used them before to study the IndianCline of populations. However, the current paper has two important shortcomings in comparison to Reich et al. (2009).

In their study of the Indian Cline, Reich et al. (2009) excluded groups that were shifted towards CHB, thus ensuring that they were left with groups that could be modeled as a simple mix of two ancestral population elements.

Moreover, they used the Onge a relatively isolated population from the Indian Ocean as a control group that could be said to form a clade with Ancestral South Indians at the exclusion of West Eurasians. In the current paper it is simply assumed that northern Europeans have no African admixture.

Application of the test to each West Eurasian population (using A = YRI and B= CEU) finds little or no evidence of mixture in North Europeans but highly significant evidence in many Southern European, Levantine and Jewish groups (Table 1).

In other words: taking CEU (a northern European population) as the standard, northern Europeans have no evidence of African admixture.

Sardinians: an important test case

Sardinians are an important test case for the authors' model. Their 3-population test shows no evidence of admixture, while the 4-population test does. Moreover, their STRUCTURE analysis shows a trivial 0.2%, whereas the authors estimate their Sub-Saharan admixture as 2.9%.

Let's begin by performing a PCA analysis of Sardinians, CHB, and CEU, which is shown below.

(All PCA analyses are done in smartpca as implemented in EIGENSOFT 4.0 beta, withnumoutlieriter set to 0. All analyses are performed over datasets merged in PLINK with the --geno 0.001 flag, which effectively keeps only common markers and ensures a high quality dataset)

CEU is shifted towards CHB relative to Sardinian. This is made more visually obvious if we blow up the CEU/CHB portion of the above plot:

CEU is shifted towards CHB by 2.4% relative to Sardinians. This is quite close to the 2.5% East/South Asian K=3 admixture for Britons in my most recent analysis, done with a different East Asian reference and a different method (ADMIXTURE); the CEU sample of White Utahns has been repeatedly shown to be most similar to people from the British Isles or Northwestern Europe.

Now, let's look at Sardinians, CHB, and YRI:

and a blowup:

Sardinians are shifted 1.1% relative to CEU towards YRI. Again, this is close to the 0.9% K=3 Sub-Saharan ADMIXTURE result I recently obtained.

So, where does the 2.9% Sub-Saharan admixture in Sardinians come from? Moorjani et al. estimate this percentage under the assumption that Northern Europeans are not shifted towards Chinese, i.e., that East Eurasians are irrelevant. Clearly, as we have seen, this is wrong. As we shall see, this erroneous assumption leads to the erroneous admixture estimate.

2.9% Sub-Saharan admixture in Sardinians (?)

Now, I will demonstrate how the spurious 2.9% result can be obtained. By doing so, it will become obvious why Moorjani et al. obtained this result as a result of ignoring the eastern Asian shift of their northern European sample in their analysis.

Here is a PCA plot of Sardinians, CEU, CHB, YRI:

and the blowup:

When we run all four populations together, Sardinians are shifted towards YRI along Dimension 1, and CEU are shifted towards CHB along Dimension 2. Given that the eigenvalue for PC1 is approximately twice (50.15) that for PC2 (25.31), and doing a little high school geometry on the triangle (Sardinian, CEU, YRI), we project Sardinian onto the CEU-YRI line, intersecting at point X. We thus obtain the estimated "CEU" admixture as:

The example of the Sardinians showed how lack of controling for East Eurasian shift tended to overestimate the degree of Sub-Saharan admixture. Another test case is that of Ashkenazi Jews. The authors find no evidence of admixture with their 3-population test, but do find such evidence with their 4-population test, as well as with STRUCTURE.

On a PCA plot of CHB, Ashkenazi (Behar et al. 2011), and CEU, the Ashkenazi are shifted 3.3% towards CHB along eigenvector 1.

On a PCA plot of YRI, CEU, and Ashkenazi, the Ashkenazi are shifted by 5.3% towards YRI.

In the case of the Sardinians, their African-shift together with CEU's Asian-shift caused Sardinians/CEU to diverge on the African-Asian axis, and Moorjani et al. took the entirety of this divergence to represent African admixture in Sardinians.

In this case Ashkenazi are both Asian- and African-shifted relative to CEU. The two shifts partially cancel each other out: Ashkenazi are pulled towards Africans on the YRI-CHB axis because of their YRI-shift, and away from them because of their CHB-shift. Failing to account for these processes, the authors assume that only Sub-Saharan admixture in Ashkenazi can accont for the different position of CEU and Ashkenazi on the Asian-African axis, coming up with a 2.8-3.2% "Sub-Saharan admixture" in two different samples.

And, here is a second way of seeing how this spurious admixture estimate follows from the phenomenon I am describing. CEU are (in terms of Fst) 0.76 times distant from CHB as they are from YRI (Fst=0.17 and 0.129). In other words, Sub-Saharan admixture is more "potent" at shifting a population than East Eurasian ancestry is. Ashkenazi are YRI-shifted by 5.3%, and they are CHB-shifted by 3.3%. Multiplying the latter by 0.76 we obtain: 5.3-0.76*3.3 = 2.8%!

In other words, the 2.8% Sub-Saharan admixture in Ashkenazi Jews is a compromise between two different phenomena in a tug-of-war. It is not an accurate estimate of admixture.

Papuans

I have also carried an experiment with Sardinians, Ashkenazi Jews, CEU, and Papuans, instead of CHB, as Papuans are also used in the paper as an outgroup population.

and the blowup:

It is clear that the populations show differential shift towards Papuans that is concordant with their above-described shift towards the Chinese.

Luhya and Bilala

Failure to correct for differential shift towards Chinese/Papuans is problem enough, but the paper also fails to properly take into account non-West African populations. North African groups are conspicuous in their absence, while the HapMap3 Luhya (LWK) and a Bilala sample are used to represent East Africa.

Henn et al. (2011) contains Tuscan, Yoruba, Maasai, Bulala samples, so I ran the Tuscans as test data in a supervised ADMIXTURE 1.1 analysis together with these African groups, HGDP-CEPH North_Italian, and HapMap3 CEU. That is, I'm playing along -for the sake of argument- with the idea that East Eurasians are irrelevant, and Tuscans can be seen as a mixture of CEU "Europeans" and African groups.

The results are unambiguous: Tuscans/North Italians are found to be 2.1%/1.2% "Maasai" and 0% of all the other African groups. In other words whatever element there is in common between Tuscans and Africans is not particularly West African.

The inclusion in the paper of HapMap3 Luhya Bantu but not of HapMap3 Luhya Maasai is puzzling, and the choice of one group over the other is passed in silence.

In my own experiments, I distinguish between North, Sub-Saharan, and East African ancestral components.

Beyond a binary worldview

Much more can be said, but let's summarize: the model of Moorjani et al. (2011) fails because:

It does not account for the West-East Eurasian axis, folding everything onto the North European-Sub-Saharan African one

It undersamples African diversity by excluding both North African and East African populations

Perhaps I'll add more in the future, but I believe I've already said enough to cast serious doubt on this paper's conclusions.

PLoS Genet 7(4): e1001373. doi:10.1371/journal.pgen.1001373

The History of African Gene Flow into Southern Europeans, Levantines, and Jews

Previous genetic studies have suggested a history of sub-Saharan African gene flow into some West Eurasian populations after the initial dispersal out of Africa that occurred at least 45,000 years ago. However, there has been no accurate characterization of the proportion of mixture, or of its date. We analyze genome-wide polymorphism data from about 40 West Eurasian groups to show that almost all Southern Europeans have inherited 1%–3% African ancestry with an average mixture date of around 55 generations ago, consistent with North African gene flow at the end of the Roman Empire and subsequent Arab migrations. Levantine groups harbor 4%–15% African ancestry with an average mixture date of about 32 generations ago, consistent with close political, economic, and cultural links with Egypt in the late middle ages. We also detect 3%–5% sub-Saharan African ancestry in all eight of the diverse Jewish populations that we analyzed. For the Jewish admixture, we obtain an average estimated date of about 72 generations. This may reflect descent of these groups from a common ancestral population that already had some African ancestry prior to the Jewish Diasporas.

April 24, 2011

April 23, 2011

I have decided to generate a new major data dump of ADMIXTURE results. In comparison to previous such experiments:

The focus is entirely on West Eurasians (Caucasoids).

I have excluded all potential relatives from the source datasets, as well as several populations that tend to create uninformative clusters of their own (e.g., Druze or Ashkenazi Jews); exceptions are populations of great anthropological interest (e.g., Basques).

I have included all relevant Dodecad Ancestry Project populations with 5+ participants.

I have developed a new way of "framing" the region of interest by choosing appropriate sets of individuals from outside of it.

"Framing" populations

I have, since the beginning of my ADMIXTURE experiments, emphasized the importance of including appropriate population controls designed to squeeze out minor distant admixture in populations of interest, so that it does not confound the inference of region-specific components.

This leads to a problem: there are many possible sources of admixture. For example, we do not know a priori which set of African populations may have contributed to Caucasoid populations, or which set of East Asian ones. We could choose e.g., the Yoruba and the Chinese to represent Sub-Saharans and East Asians, but that might exclude possible sources of variation, and lead to Yoruba- and Chinese- specific clusters rather than more general Sub-Saharan and East Asian ones. If we included more population controls, we would cover more possible sources of variation, but ADMIXTURE would infer components of little interest (e.g., between Pygmies vs. Bushmen or Mongols vs. Chinese)

To avoid this, I propose to create meta-populations consisting of a single individual from many populations, i.e., a Yoruba, a Mandenka, a San, a Mbuti Pygmy, etc. for Sub-Saharan Africa, or a Miaozu, a Han, a Mongol, a She, a Hezhen, etc. for East Asia. That way we are both helping ADMIXTURE infer general components, while at the same time preventing it from inferring non-region specific ones.

Results

The entirety of the results presented here can be downloaded. They include:

At K=3, we observe the emergence West Eurasian, Sub-Saharan, and East/South Asian components.

The impact of the Sub-Saharan component is felt most distinctly in North Africa and the Near East, especially among Arabs; the impact of the East/South Asian one in West Asia and Northeastern Europe, especially among Finnic and Turkic speakers.

It is interesting to note that 39.8% of the Indian_D sample is assigned to the E/S Asian component. I had previously estimated in a roundabout way, and in a slightly smaller sample that the Ancestral South Indian component in Project participants was 33.3%, so ADMIXTURE has roughly managed to infer correctly that about 1/3 of this Indian sample's ancestry is more closely related to East Asians than to West Eurasians.

At K=4, the first split within the Caucasoid group appears: a component centered onn Europe, and one on West/South Asia.

Many populations possess both these components in clinal proportions.

The European component shrinks to insignificance in Arabians, such as Saudis and Yemenese.

The West/South Asian component shrinks to insignificance in Northeast Europeans, such as Finns, Lithuanians, north Russians, and Chuvash.

At K=5, a new Mediterranean component emerges. This is highly represented in populations to the North, South, and East of the Mediterranean sea.

This component is noteworthy for its absence in India and Northeastern Europe.

In Northeastern Europe, the Mediterranean component is hardly represented at all, whereas the West/South Asian component, freed of its K=4 Mediterranean associations now makes its appearance.

Conversely, in the West Mediterranean, among Basques, Sardinians, Moroccans, and Mozabites the West/South Asian component vanishes to non-existence.

At K=6, a North African component emerges.

Notice its presence in the Near East and parts of Southern Europe.

The two regions can be contrasted in terms of their African components, with very high North/Sub-Saharan African ratio in Europe vs. much lower in the Near East.

The explanation for this seems straightforward, as Europe was affected by North Africa in prehistoric and historic times, whereas the Near East also shares a border with more southern parts of the African continent, as well as the potential influence of the medieval slave trade that seems to have affected Muslim Near Eastern populations disproportionately.

At K=7, a Southwest Asian component emerges which is highest in Arabia and East Africa. I could've called this Red Sea, but I've reserved this name for a similar component that emerges at higher K.

It is clear that this is the main Caucasoid component present in East Africa.

It vanishes to non-existence in the Northern fringe of Europe, in the British Isles, Scandinavia, and among the Finns and Lithuanians.

Another interesting aspect of its distribution is its presence in Pakistan but not India. Perhaps, in this case, it reflects historical contacts between the Islamic Near East and parts of South Asia.

At K=8, we observe most of the familiar components from the K=10 analysis of the Dodecad Project. However, the use of the framing populations has meant that these components emerge before either Africans or East Eurasians split.

Now, the South Asian component appears, which swallows up most of the E/S Asian component that previously linked South with East Asians. This component extends a great way to the Near East and eastern parts of the Caucasus.

Quite interestingly, the remainder of the Caucasoid component in South Asia that is not absorbed by the new South Asian component seems to be split between the West Asian and North/Central European components, with an absence of the South European component.

It is among the Lezgins of the Caucasus that such a combination occurs, on the western shore of the Caspian Sea. The same combination of Caucasoid components also occurs in Uzbeks and Chuvash.

I conclude from this that the Caucasoids who entered South and Central Asia were probably derived from the eastern fringes of the Caucasoid world where only the West Asian (in the south) and North/Central European (in the north) are in existence. The area around the Caspian Sea seems like an excellent candidate for their origin, as I have speculated before, as that region has two important properties:

It is transitional between predominantly N/C European populations to the north and predominantly W Asian populations to the south

It is the border of the influence of the S European element, with Georgians possessing some of it, while Lezgins do not.

At K=9, we see the emergence of specific Sardinian and Basque components. Normally this is undesirable, but, I believe this breakup serves to divide the previously inferred South European component meaningfully.

What was South European in lower K seems to have an Atlantic vs. Mediterranean dimension, with the Basque/Sardinian ratio being particularly high in the Atlantic facade of Europe. Conversely, this ratio is low in the Mediterranean as we move eastwards: it is already low in Italy and the Balkans and becomes virtually zero in Cypriots, Armenians, and Levantine Arabs.

North Africa is also particularly interesting in having a low Basque/Sardinian rate, even in Morocco. It appears that Sardinians are a much better proxy of European influences in the region than Basques are.

K=10 is particularly exciting because, for the first time, there is clear evidence of structure in the North/Central European component that can now be split, for the first time, into Northwestern and Northeastern ones.

The NW European component is maximized in Orcadians, and people from the British Isles in general, as well as in Scandinavia.These populations have a low NE/NW ratio, as do the French, Iberians, and Italians.

Conversely, Balto-Slavs have a high NE/NW ratio.

Interestingly, Greeks have a balanced NE/NW ratio (1.2), intermediate between Italians and Balto-Slavs. Similar balanced ratios are also found among Lezgins (1.08), Turks, and Iranians. I conclude that Slavic or other Eastern European admixture cannot account for the totality of presence of this component in Greeks.

Indians have a 1.8 NE/NW ratio. In Pakistan this is 6.5, in Uzbeks it is 2.9, and in the North Eurasian_Ra it is 14.2. My conclusion is that a single migration of steppe people from eastern Europe cannot account for the presence of North European-like genes in Asia.

I propose that a palimpsest of population movements has brought such elements into the interior of Asia: the migration of the early Indo-Iranians from West Asia or the Balkans with a balanced NE/NW ratio, and, the migration of steppe people from Eastern Europe with a high NE/NW ratio. The latter, did affect much of Asia, but it is in India, where Iranian groups did not penetrate in great numbers the lower ratio of the Indo-Aryans has been best preserved.

The case of the Finns is also interesting, as there is a surplus of NE over NW European elements. Their position is intermediate between Scandinavians and Lithuanians/Russians but toward the latter. So, Finns appear to (i) have a substratum similar to Balto-Slavs, (ii) to be influenced by Scandinavians, and (iii) with a balance of East Eurasian elements (5.8% at this analysis) preserving the legacy of their linguistic ancestors from the east. At present it is difficult to determine how much of the NE European component in Finns is due to their eastern ancestors who were presumably mixed Caucasoid/Mongoloid long before they arrived in the Baltic, and how much was absorbed in situ.

At K=11 the Ethiopian/East African component emerges, absorbing some of the Red Sea and Sub-Saharan components from the previous K=10 run.

In comparison to the East African component of the Dodecad Project analysis, this component is closer to West Eurasians than to Sub-Saharan Africans, and a residual Sub-Saharan element remains in the two East African (Ethiopian and East_African_D) population samples. Presumably this is due to the more complete sampling of Sub-Saharan genetic diversity using the Sub_Saharan_H "framing" population.

Outside Africa, both E African/Sub-Saharan components are present in the Near East and North Africa with higher E African/Sub-Saharan ratios in the Near East and lower ones in North Africa.

In Europe, there are low such ratios in the few populations where African admixture is present, together with some N African. We can probably conclude that African admixture is mostly due to North Africans, and African-influenced Near Eastern populations, rather than directly from Sub-Saharan Africa.

At K=12 the first uninformative cluster emerged, centered on Iraqi Jews, hence I decided to stop the analysis at this point.

Population Portraits

There is a plethora of population portraits in the download bundle, showing how admixture proportions vary in individuals within populations, and how they vary between successive K.

Here is, for example, the K=11 portrait of Cypriots. A picture of overall homogeneity of this sample emerges, but notice how the NW European and NE European have disjoint presence in the Cypriot individuals, with 5 having some of the former, 6 having some of the latter, and only 1 of these having both.

Compare with Lezgins (right) where these two components occur in all individuals. Whatever this admixture represents, it must be old enough if it is so uniformly distributed in the population.

Here are the Georgians at K=10. Notice that their NE European component is unevenly distributed, and in every case where it occurs it is accompanied by a thin slice of East Asian. This may well indicate partial Russian or other Eastern European ancestry in these individuals.

Notice how Lezgins, who live north of the Caucasus mountains possess some of the N/C European component, which the Armenians, who live to the south of them lack. This should come as no surprise, as the Lezgins inhabit parts of the ancient Sarmatia Asiatica. Compare with Iranians, who are differentiated by their Indo-European Armenian neighbors by the presence of a "S Asian" component, which, in turn, ties them to their Indo-Aryan linguistic relatives.

Much more can be said, but I'll let readers explore the data on their own, and draw their own conclusions from them.

April 20, 2011

An updated tree of Y-chromosome Haplogroup O and revised phylogenetic positions of mutations P164 and PK4

Shi Yan et al.

Y-chromosome Haplogroup O is the dominant lineage of East Asians, comprising more than a quarter of all males on the world; however, its internal phylogeny remains insufficiently investigated. In this study, we determined the phylogenetic position of recently defined markers (L127, KL1, KL2, P164, and PK4) in the background of Haplogroup O. In the revised tree, subgroup O3a-M324 is divided into two main subclades, O3a1-L127 and O3a2-P201, covering about 20 and 35% of Han Chinese people, respectively. The marker P164 is corrected from a downstream site of M7 to upstream of M134 and parallel to M7 and M159. The marker PK4 is also relocated from downstream of M88 to upstream of M95, separating the former O2* into two parts. This revision evidently improved the resolving power of Y-chromosome phylogeny in East Asia.

April 19, 2011

The authors make a good argument that if hunter-gatherers had adopted agriculture early on, then there would be a signal of population growth in hunter-gatherer derived haplogroups at the same time as the onset of the farming economy. I don't follow mtDNA age estimation as closely as that of the Y-chromosome, but the fact that their method estimates demographic growth for Neolithic haplogroups that coincide with the onset of the Neolithic, while it estimates earlier growth for Paleolithic haplogroups, does suggest to me that they're onto something.

They write:

Our results are consistent with recent publications using ancientDNA to assign the maternal affinities of early agriculturalistsand hunter-gatherers. Our LGP European sample includesthe U5a and U5b1 haplogroups, associated with Mesolithic hunter gatherersat the majority of archaeological sites in Bramantiet al. (3) dated to older than 5,000 ya. The presence of similarmtDNA haplogroups in Mesolithic hunter-gatherers and contemporaryEuropeans supports a model involving (maternal)population continuity, even if this Mesolithic ancestry makes uponly a fraction of contemporary European genomes. U5a, U5b1,V, and 3H combined account for ≈15% of western EuropeansmtDNA haplogroups (7) (Table S1).

As I've argued in this blog over the last few years, the mtDNA evidence speaks of near total replacement of forager mtDNA in Europe. In a sense there is continuity, as the forager mtDNA has not gone extinct, but I guess that is a matter of words. We could just as easily say that there is population continuity in parts of South America where a predominant Caucasoid population has absorbed a small native population.

What's more important is to see that modern and ancient mtDNA evidence are in agreement. This is slightly ironic, as it was modern mtDNA evidence -as interpreted in the late 90s- that largely launched the once successful and now semi-retired "Europeans are largely Paleolithic" meme.

PNAS doi: 10.1073/pnas.0914274108

Rapid, global demographic expansions after the origins of agriculture

Christopher R. Gignoux et al.

The invention of agriculture is widely assumed to have driven recent human population growth. However, direct genetic evidence for population growth after independent agricultural origins has been elusive. We estimated population sizes through time from a set of globally distributed whole mitochondrial genomes, after separating lineages associated with agricultural populations from those associated with hunter-gatherers. The coalescent-based analysis revealed strong evidence for distinct demographic expansions in Europe, southeastern Asia, and sub-Saharan Africa within the past 10,000 y. Estimates of the timing of population growth based on genetic data correspond neatly to dates for the initial origins of agriculture derived from archaeological evidence. Comparisons of rates of population growth through time reveal that the invention of agriculture facilitated a fivefold increase in population growth relative to more ancient expansions of hunter-gatherers.

April 17, 2011

While it has been generally accepted that the Philistines originated in the Aegean, new archaeological research from the Levant shows that they were not the first Aegean peoples to influence the area of Canaan. How strange that we've gone from a "legendary" Minos, to the excavation of the Bronze Age Minoan civilization, to a gradual confirmation of its thalassocracy, as described by the ancient authors.

A recent and ongoing excavation at the remains of an expansive Middle Bronze Age Canaanite palace in the western Galilee region of present-day Israel is opening a new window on the possible presence of ancient Minoans at an ancient Canaanite palace, revealing what may be the earliest known Western art found in the eastern Mediterranean.

Known as Tel Kabri (located near its namesake kibbutz not far from historic Acco and the resort town of Nahariya on the coast of Israel), the site features an early Middle Bronze Age (MB I) palace dated to the 19th century B.C.E., making it, along with ancient Aphek and possibly Megiddo, the earliest MB palace discovered in present-day Israel. This conclusion was drawn as a result of excavations conducted there as recently as December 20, 2010 to January 10, 2011. But the tell-tale signs of an Aegean presence or influence at the site show up in a later developmental phase of the palace structure some 150 to 200 years later in the overlying MB II palace dated to the 17th century. Reports Dr. Eric Cline of George Washington University and Co-Director of the excavations along with Assaf Yasur-Landau of Haifa University, "Excavations conducted by [Aharon] Kempinski and [Wolf-Dietrich] Niemeier from 1986 to 1993 at the site of Tel Kabri -- now identified as the capital of a Middle Bronze Age Canaanite kingdom located in the western Galilee region of modern Israel -- revealed the remains of a palace dating to the Middle Bronze (MB) II period (ca. 1700 - 1550 B.C.E.). Within the palace, Kempinski and Niemeier discovered an Aegean-style painted plaster floor and several thousand fragments originally from a miniature Aegean-style wall fresco."(1) The new excavations under the direction of Cline and Yasur-Landau have added to the discovery. Reports Cline, et al., "During the 2008 and 2009 excavations at Tel Kabri more than 100 new fragments of wall and floor plaster were uncovered. Approximately 60 are painted, probably belonging to a second Aegean-style wall fresco with figural representations and a second Aegean-style painted floor."(2)

April 15, 2011

Recently, I've presented some conflicting evidence about the origin of modern humans in either East, or Central, or South Africa. A new paper looks at the decay of phonemic diversity in world languages, coming to the conclusion that language itself may have originated in Southwest Africa.

I will only point out two caveats:

We cannot assume that click speakers of the African Southwest are necessarily indigenous to that region, and

It is possible that, the greater phonemic diversity is due to ancient admixture between quite divergent peoples who possessed two different types of phonemic inventories, while most Africans inherited only the phonemic inventory of one of these peoples, which then decayed as per the author's theory away from Africa.

In fact, I suspect that something much like #2 (the Afrasian-Palaeoafrican hypothesis) probably took place. I'll leave an evaluation of this paper to the experts, but it seems to me that South Africans do not only have a richer phonemic inventory, but also a set of phonemes (clicks) that seems peculiar to them, and may, possibly be part of a different language system than the one that most other languages use.

Human genetic and phenotypic diversity declines with distance from Africa, as predicted by a serial founder effect in which successive population bottlenecks during range expansion progressively reduce diversity, underpinning support for an African origin of modern humans. Recent work suggests that a similar founder effect may operate on human culture and language. Here I show that the number of phonemes used in a global sample of 504 languages is also clinal and fits a serial founder–effect model of expansion from an inferred origin in Africa. This result, which is not explained by more recent demographic history, local language diversity, or statistical non-independence within language families, points to parallel mechanisms shaping genetic and linguistic diversity and supports an African origin of modern human languages.

April 14, 2011

Young infants are known to prefer own-race faces to other race faces and recognize own-race faces better than other-race faces. However, it is entirely unclear as to whether infants also attend to different parts of own- and other-race faces differently, which may provide an important clue as to how and why the own-race face recognition advantage emerges so early. The present study used eye tracking methodology to investigate whether 6- to 10-month-old Caucasian infants (N = 37) have differential scanning patterns for dynamically displayed own- and other-race faces. We found that even though infants spent a similar amount of time looking at own- and other-race faces, with increased age, infants increasingly looked longer at the eyes of own-race faces and less at the mouths of own-race faces. These findings suggest experience-based tuning of the infant's face processing system to optimally process own-race faces that are different in physiognomy from other-race faces. In addition, the present results, taken together with recent own- and other-race eye tracking findings with infants and adults, provide strong support for an enculturation hypothesis that East Asians and Westerners may be socialized to scan faces differently due to each culture's conventions regarding mutual gaze during interpersonal communication.

Our main analysis gives a 95% highest posterior probability density interval of 7110-9750 years Before the Present, in line with the so-called Anatolian hypothesis for the expansion of the Indo-European languages.

...

The reconstruction of known ages presented in Section 4.3 further validates

our ability to predict time depths. After several analyses of two data sets (Chapter 5), all our results agree with the Anatolian hypothesis that the spread of the Indo-European family started around 8000 BP. None of our analyses agree with the Kurgan theory that the spread started between 6000 and 6500BP.

Missing data in a stochastic Dollo model for binary trait data, and its application to the dating of Proto-Indo-European

Robin J. Ryder, Geoff K. Nicholls

Summary. Nicholls and Gray have described a phylogenetic model for trait data. They used their model to estimate branching times on Indo-European language trees from lexical data. Alekseyenko and co-workers extended the model and gave applications in genetics. We extend the inference to handle data missing at random. When trait data are gathered, traits are thinned in a way that depends on both the trait and the missing data content. Nicholls and Gray treated missing records as absent traits. Hittite has 12% missing trait records. Its age is poorly predicted in their cross-validation. Our prediction is consistent with the historical record. Nicholls and Gray dropped seven languages with too much missing data. We fit all 24 languages in the lexical data of Ringe and co-workers. To model spatiotemporal rate heterogeneity we add a catastrophe process to the model. When a language passes through a catastrophe, many traits change at the same time. We fit the full model in a Bayesian setting, via Markov chain Monte Carlo sampling. We validate our fit by using Bayes factors to test known age constraints. We reject three of 30 historically attested constraints. Our main result is a unimodal posterior distribution for the age of Proto-Indo-European centred at 8400 years before Present with 95% highest posterior density interval equal to 7100–9800 years before Present.

That, at least, is the hope. But a comparative study of linguistic traits published online today (M. Dunn et al. Nature doi:10.1038/nature09923; 2011) supplies a reality check. Russell Gray at the University of Auckland, New Zealand, and his colleagues consider the evolution of grammars in the light of two previous attempts to find universality in language.

The most famous of these efforts was initiated by Noam Chomsky, who postulated that humans are born with an innate language-acquisition capacity — a brain module or modules specialized for language — that dictates a universal grammar. A few generative rules are then sufficient to unfold the entire fundamental structure of a language, which is why children can learn it so quickly. Languages would diversify through changes to the 'parameter settings' of the generative rules.

The second, by Joshua Greenberg, takes a more empirical approach to universality, identifying traits (particularly in word order) shared by many languages, which are considered to represent biases that result from cognitive constraints. Chomsky's and Greenberg's are not the only theories on the table for how languages evolve, but they make the strongest predictions about universals.

Gray and his colleagues have put them to the test using phylogenetic methods to examine four family trees that between them represent more than 2,000 languages. A generative grammar should show patterns of language change that are independent of the family tree or the pathway tracked through it, whereas Greenbergian universality predicts strong co-dependencies between particular types of word-order relations (and not others). Neither of these patterns is borne out by the analysis, suggesting that the structures of the languages are lineage-specific and not governed by universals.

Languages vary widely but not without limit. The central goal of linguistics is to describe the diversity of human languages and explain the constraints on that diversity. Generative linguists following Chomsky have claimed that linguistic diversity must be constrained by innate parameters that are set as a child learns a language1, 2. In contrast, other linguists following Greenberg have claimed that there are statistical tendencies for co-occurrence of traits reflecting universal systems biases3, 4, 5, rather than absolute constraints or parametric variation. Here we use computational phylogenetic methods to address the nature of constraints on linguistic diversity in an evolutionary framework6. First, contrary to the generative account of parameter setting, we show that the evolution of only a few word-order features of languages are strongly correlated. Second, contrary to the Greenbergian generalizations, we show that most observed functional dependencies between traits are lineage-specific rather than universal tendencies. These findings support the view that—at least with respect to word order—cultural evolution is the primary factor that determines linguistic structure, with the current state of a linguistic system shaping and constraining future states.

April 13, 2011

I had linked to the conference versions of this work in ISABS 2007 and AAPA 2008. Now there is a new paper in American Anthropologist on the topic of variation in four Central Anatolian settlements with very different origins. As I've mentioned before, this is a great illustration of the problem of uncritically treating modern Anatolian Muslim samples as representatives of the Neolithic population, for at least two reasons:

Modern Anatolian Muslims are only a part of the recent population of Anatolia, with a great part of the Christian population exchanged, killed, or forced to leave.

Modern Anatolian Muslims are ethnically, linguistically and religiously heterogeneous, and many of them have historical memories of descent from elsewhere, e.g., the Balkans, the Caucasus, or Central Asia

So, I have always been skeptical of grand reconstructions of prehistory based on treating modern Anatolian Muslims as little modified descendants of the Neolithic population. I don't doubt that a great deal of their ancestry, averaged over broad scales, is Neolithic Anatolian, but extracting this ancestry from the mosaic of heterogeneous Muslim groups inhabiting the modern Republic of Turkey is no trivial task.

From the current paper, in support of point #1:

Even though the data from official Ottoman records and other sources such as local church accounts are contradictory (Charanis 1972; Vryonis 1986), it is clear that Greek and Armenian peoples comprised the majority, or at least a very sizeable minority, of the late-19th-century Anatolian population (Finkel 2005; Levy 2002; Shaw et al. 1976). Muslim Turkic and Kurdish groups from different ancestral clans and, more importantly, from different sects of Islam made up the largest remaining part of the population (Cahen 1968, 2001; Finkel 2005). Therefore, the population of the Ottoman Empire came from different ancestral backgrounds, lived together, and constructed communities based on their religious affiliations.

In support of point #2, the authors have studied the inhabitants of a Central Anatolian region:

We worked in four geographically proximate Central Anatolian settlements located southeast of Ankara (see Figure 1) to test the abovementioned hypotheses and elucidate the regional complexity of Anatolian population history. Because of the current political sensitivities concerning ethnic–religious identity in Turkey, especially those relating to the Alevis and Kurds, the names of the specific settlements we visited are not identified in this article. Instead, for the sake of clarity, pseudonyms are used. This region, which we refer to as “Yuksekyer,” was selected for the study because, based on oral traditions and available historical records, it is home to linguistically, ethnically, and religiously distinct groups that live in close geographic proximity to one another. To assess the possible regional, religious, and ethnic differentiation in Central Anatolia, we collected additional samples from a settlement from Kizilyer, another region located about 500 kilometers east of Yuksekyer, the inhabitants of which are predominantly Alevi Turks.

"Yuksekyer" is 2,500 sq. km in size, so it's quite small in the broader Anatolian context. Here are details on the studied settlements:

Merkez is the current political and bureaucratic center with about 6,500 inhabitants, making it the most populous settlement in the region (Devlet İstatistik Enstitüsü 2001). It was probably founded by Cerkez (Circassian) people who had migrated there from the Caucasus region in the 14th century.

...

The inhabitants of the oldest known settlement, Eskikoy, claim a pre-Ottoman Karaman ancestry that traces back to a Turkic population that occupied the Konya region during the 13th century (Finkel 2005). To date, no historical records that confirm the connection between Eskikoy and its putative Karaman origin have been found. However, Ottoman records mention the presence of the Eskikoy settlement in the Yuksekyer region and place its foundation at around 1500 C.E.

...

The residents of Gocmenkoy identify themselves with the Afsar clan of the Oguz tribe, to which the Kayi and Turkmen lineages also belong (Cahen 1968). Their oral history, supported by local historical records, indicates that these people came from Central Asia in the 16th century.

...

The Kurdish-speaking inhabitants of Dogukoy were the last immigrants to populate Yuksekyer. They purportedly came into the region around 200 years ago from southeastern Turkey.

...

For comparative purposes, we also collected 30 additional samples from another Central Anatolian region, Kizilyer, which lies about 500 kilometers east of Yuksekyer. This region is roughly comparable to Yuksekyer region in its size and population density.

There are a lot of interesting genetic data, but I will focus on the Y-haplogroup profiles of the different populations:

A few things stand out:

Haplogroup L is limited to Gocmenkoy. L is divided into informative subclades and is one of the less studied and more mysterious Caucasoid haplogroups; the authors erroneously state that it is more frequent in East Eurasians. In the Gocmenkoy sample it could very well be the legacy of Turkicized Central Asian Iranian speakers in which it is found at a high frequency. Most of the haplogroup Q is also found in Gocmenkoy and this may represent a genuine Turkic element.

Haplogroup J1 is largely limited to Merkez; this is not surprising as Merkez is said to have been founded by Circassians, and J1 occurs at a substantial frequency in parts of the Caucasus

Haplogroup N is mostly found in Eskikoy, and this is also a likely marker of Central Asian Turkic groups

I find the high frequency of J2a (64.5%) in Dogukoy, the Kurdish settlement to be noteworthy. This haplogroup is also probably found at a high frequency in Parsis (although technically only J was studied in the relevant study), and I've noted before that a high J2/J1 ratio contrasts West Asian Indo-Europeans from Semitic groups. J2a also occurs at a high frequency in Indian upper caste populations, whereas it is virtually absent in low castes and tribals.

The fact that the names of the settlements had to be obfuscated by the authors speaks volumes about the political sensitivity of the subject. Hopefully this line of research will be allowed to continue.

ABSTRACTPrevious population genetics studies in Turkey failed to delineate recent historical and social factors that shaped Anatolian cultural and genetic diversity at the local level. To address this shortcoming, we conducted focused ethnohistorical fieldwork and screened biological samples collected from the Yuksekyer region for mitochondrial, Y chromosome, and autosomal markers and then analyzed the data within an ethnohistorical context. Our results revealed that, at the village level, paternal genetic diversity is structured among settlements, whereas maternal genetic diversity is distributed more homogenously, reflecting the strong patrilineal cultural traditions that transcend larger ethnic and religious structures. Local ancestries and origin myths, rather than ethnic or religious affiliations, delineate the social boundaries and projected identities among the villages. Therefore, we conclude that broad, ethnicity-based sampling is inadequate to capture the genetic signatures of recent social and historical dynamics, which have had a profound influence on contemporary genetic and cultural regional diversity.

April 12, 2011

An interesting problem in population genetics is the following: how oftenare two individuals from a population A more similar to each other than either of them is to an individual from another population B?

Suppose you have a similarity function for individuals, sim(a, b)

(In my experiments I will use identity-by-state (IBS) as calculated by PLINK (--cluster --matrix) as a measure of similarity, but any symmetrical similarity function (that is, sim(a,b)=sim(b,a)) will do.)

We want to calculate the rate at which the following condition occurs:

If this condition holds, then a and a' from population A are more similar to each other than either of them is to an individual b from population B. We then say that the trio of individuals is concordant.

I will use the indicator function I(a, a', b) = 1 in case of a concordant trio, and =0 otherwise.

I can then estimate the probability of concordance, if I have n individuals from A and m from B as follows:

The rationale behind this formula is straightforward: we are counting the number of all concordant trios, and dividing by n(n-1)m/2 since there are n(n-1)/2 pairs of individuals from A, and each pair is compared against all m individuals from B.

The expected value of this concordance ratio can vary between 0.25 and 1:

It is 1 if the two populations are so well-differentiated so that every trio is concordant.

On the other hand, if the two populations are genetically identical, then each similarity comparison is equivalent to a coin toss (probability = 0.5) and we are testing this condition for two different individuals from A: hence the probability of concordance for each trio is 0.5*0.5 = 0.25.

In a finite sample of individuals it is possible that the concordance ratio estimate may actually be lower than 0.25.

An interesting property of the concordance ratio is its asymmetry, that is:

We will see how this property gives some useful insight in some of the following examples.

#1. Han vs. Yoruba

This is based on the Stanford HGDP set, including only individuals as recommended by Rosenberg (2006) in his H952 set. For each experiment only SNPs with at least a 99% genotyping rate have been retained.

The first experiment is designed to showcase the concordance ratio in two well-differentiated human populations, 44 Han Chinese and 21 Yoruba Nigerians. The analysis is based on 617,602 SNPs.

That is, two Han Chinese are always more similar to each other than to a Yoruba and vice versa.

I repeated the experiment, sequentially thinning the market set randomly by a factor of 10 using PLINK's --thin 0.1 argument. Concordance remained 1 for ~61k, ~6k, and ~600 markers, and became different than 1 (but still greater than 0.95) with 50 SNPs.

Genome-wide, or, across a sizeable number of markers concordance of Han vs. Yoruba and vice versa is perfect, while for a few random SNPs it may not be.

#2. Britons vs. Mexicans

In the next experiment, I have used 90 Britons (GBR) and 69 Mexican Americans from Los Angeles (MXL) from the 1000 Genomes Project. I have included only SNPs with 99%+ genotyping rate that are also included in the HGDP Stanford data, for a total of 351,521 markers.

Notice the previously mentioned asymmetry: two Britons are virtually always closer to each other than they are to a Mexican, but a Mexican is sometimes closer to a Briton than he is to a fellow Mexican. This is due to the fact that Mexicans have variable European admixture, so a substantially European-admixed Mexican may be closer to a Briton than he is to one of his substantially Amerindian-admixed compatriots.

#3: Various Europeans

In the final experiment, I included HGDP European populations, together with 12 Dodecad Project Greeks. The analysis is based on 492,176 SNPs. Each row in the following table represents the first argument of the concordance ratio function, and each column the second one.

Here is a way to read the table, using Greek_D as an example:

The last row represents a test in which pairs of Greeks were compared against individuals from the population of each column. These comparisons were concordant 84.4% of the time against Russians (the most distant population), and 37.5% of the time against Tuscans (the closest one).

The last column represents in which Greeks were used as an "outgroup" for comparison against pairs of individuals from each row. These comparisons were concordant (~1) for most populations, except the Tuscan (47.9%), North Italian (71.5%), and Adygei (81.9%)

Let's do the same using French as an example:

With a pair of French individuals against individuals from other populations, concordance ranged between 36.2% (for North Italians) to 95.8% (for Adygei)

When French individuals were used as an outgroup to compare against pairs of individuals from other populations, concordance ranged between 63.1% (for North Italians) to 100% (for Sardinians). The latter means that a pair of Sardinians is always closer to each other than to a French sample (or, at least, an HGDP French one).

Conclusion

The study of concordance is an interesting thought experiment that illustrates how genome-wide comparisons between individuals show the following:

Two individuals from a homogeneous population are virtually always more similar to each other than to an individual from a genetically differentiated population

Two individuals from a population may, or may not, be more similar to each other than to an individual from a genetically related population

More variable populations are usually more discordant with respect to other populations, whereas very homogeneous populations tend to be concordant

The concordance ratio is useful for personal genomics customers, as it puts their IBS similarities to various other individuals in perspective. For example, a Greek should not be surprised if he matches a particular Tuscan more than he does a fellow Greek, nor should he seek mysterious Italian ancestors because of it, as such a discordant result occurs frequently.

The concordance ratio is also useful because it provides a truly model-free test of population differentiation:

It is different from techniques such as PCA which allow the separability of individuals from different populations by projecting them on a number of dimensions, the first few of which are usually correlated with the inter-population fraction of genetic diversity. Hence, individuals that appear well-separated on a few PCA dimensions may in fact be overall more genetically similar to individuals from other populations across the full marker set. The concordance ratio avoids any accusation of privileging aspects of the genome (the ones that differentiate populations), as it is based on a single genome-wide similarity function for individuals.

It is also different from clustering algorithms such as ADMIXTURE that infer allele frequencies in putative ancestral populations, again implicitly using markers with high frequency differences to estimate admixture proportions. Hence, a "match" in a marker with low population differentiation is treated differently as a source of evidence than a match in a marker of strong population differentiation. The concordance ratio avoids this issue by using a single similarity function for individuals that does not privilege one marker over another.

R Code

Code for the calculation of the concordance ratio ratio can be downloaded from here as an R function. Two files are required:

A symmetrical similarity matrix, as output by plink --matrix --cluster command. Any similarity matrix file in the PLINK MIBS format will do.

A file in which each row has a population name and the number of individuals from that population.

An example of the latter file count.txt for Experiment #3 is:

North_Italian12

Russian25

Orcadian15

Sardinian28

Tuscan8

French28

French_Basque24

Adygei17

Greek_D12

Of course, individuals must appear in that order in the plink file, i.e., first the 12 North Italians, then the 25 Russians, etc.

Assuming you have such a file, e.g., in binary BED/BIM/FAM format, you first calculate the IBS matrix:

plink --matrix --cluster --bfile datafile

This creates a plink.mibs file. Then, in R, after changing to the appropriate directory, where plink.mibs, count.txt and the source code gamma.r is, you enter:

source("gamma.r")

gamma(simfile="plink.mibs", popfile="count.txt")

PS: The concordance ratio should not be confused with Witherspoon's ω fraction. That is defined by comparing all pairs of between- and within- population distances, and ranges between 0 (highest concordance in my terminology) and 0.5 (lowest concordance). The concordance ratio, on the other hand, tests all possible trios of individuals, and it also has the asymmetrical property explained above.

Old Blog Archive

Dienekes' Anthropology blog is dedicated to human population genetics, physical anthropology, archaeology, and history.

You are free to reuse any of the materials of this blog for non-commercial purposes, as long as you attribute them to Dienekes Pontikos and provide a link to either the individual blog entry or to Dienekes Anthropology Blog.

Feel free to send e-mail to Dienekes Pontikos, or follow @dienekesp on Twitter.