The heart of the argument Pereltsvaig and Lewis present seems to be that some key assumptions in the model that Bayesian phylogeneticists are using to make inferences about the emergence and spread of Indo-European languages are wrong. And, those incorrect assumptions lead to empirical results which are also wrong. Though it was difficult for me to follow much of the deep dive into technical linguistics (thanks for that Asya!), some of the problems with inferences are pretty easy to see. They note that in the supplements of the 2012 paper (second one above) the Romani language is placed as an outgroup to the other members of the Indo-Aryan family. This seems wrong to Pereltsvaig and Lewis, and from what I know it is wrong. Linguistic consensus is that Romani dialects are related to those of Northwest India. It turns out that the genetics favors this, as their South Asian ancestry does seem to derive from Northwest Indian populations. We can go on with details in this vein, and the authors do, assembling a list of fallacious inferences, but what’s the root of the problem?

One of the major weaknesses brought up in The Indo-European Controversy: Facts and Fallacies in Historical Linguistics is that these Bayesian phylogenetic models utilize lexical information as data inputs. In particular, a set of a few hundred cognates. There are two elements to the objection. First, the choice of cognates might be biased, or at least bias the output. Second, vocabulary may not be the best foundation on which to generate a phylogeny of language. Rather, something like grammar may be more phylogenetically informative. The authors of the above works under criticism actually state they’re trying to use grammar as an input too. But in any case, the tendency for vocabulary to be exchanged between nearby groups, irrespective of their phylogenetic origin, is presumably the reason that the Romani languages drifted far enough away from the other Indo-Aryan languages to seem like an outgroup. No matter how ingenious your method, if your input data is biased or not informative, your output is not likely to be useful. Pereltsvaig and Lewis allude to the fact that linguistics has not found their “atoms” yet. I’d state it differently: linguistics lacks its DNA sequence. Using a biological analogy, these linguistic applications of Bayesian phylogenetics are attempting to discern evolutionary history from phenotype.

The second major problem with the papers coming out of the Bayesian phylogenetic tradition in linguistic history is an incorrect model assumption: that populations expand purely through diffusion-like processes. If you read the detailed methods it’s pretty clear that they’re converging on the joint posterior probability of tree given the data as well as the geographic distribution assuming a demic diffusion framework. The Indo-European Controversy tackles extensively the historiography of migrations, or lack thereof. Before World War II archaeologists naively traced migrations through the change in cultural forms, while after World War II the backlash became so strong that the null was always that pots, rather than people, were on the move. And, when people were on the move in pre-state societies, it was envisaged in almost a mechanical fashion, as individuals on the farming frontier had higher fertility, and so endogenous growth simply swamped out other groups like European hunter-gatherers. Part of its appeal isn’t just ideological, it’s an elegant model. Historical detail and contingency isn’t relevant, and inter-group conflict can be sidestepped. It’s all about endogenous growth of a population assuming particular resources, until it hits a Malthusian limit in the locality.

Unfortunately this model is almost certainly wrong for human history. Ancient DNA has revolutionized everything, because it is shown just how punctuated demographic shifts can be.Ancient DNA reveals key stages in the formation of central European mitochondrial genetic diversity highlighted this dynamic a few years back. More recently, Population genomics of Bronze Age Eurasia and Massive migration from the steppe is a source for Indo-European languages in Europe indicate discontinuity. I want to emphasize the term discontinuity, as this is very different from gradual diffusion. Rather than a methodologically individualistic model, where higher fertility in farmsteads or at least villages gradually resulted in the transition from one group to another, a more likely in my opinion is inter-group tension, conflict, and amalgamation. In some cases, near total replacement. It may not have been always violent, rather, agriculturalists on the Malthusian margins may not have been able to withstand the shock of a new culture arriving and sequestering critical resources (an analogy I’m thinking is the massive collapse of Roman culture in the Balkans whenever the imperial limes withdrew toward the coasts; without state support and scaffold the way of like the Latin peasantry just wasn’t feasible, so they quickly migrated or died off).

For example, it looks as if the Uygurs are not descended in large part from the first Indo-Europeans on the fringes of western China. I took the data the Reich lab posted and ran TreeMix on it. After reducing the number of populations, I ran TreeMix on it. Below are 10 plots. The West Eurasian ancestry of the Uygurs is not overwhelmingly Northern European-like. Weirdly the graphs below suggest it is somewhat less Northern European than the West Eurasian ancestry contributing to the Hazara! Though that may be an artifact of some sort. The point is that as suggested by many scholars it seems highly likely that the Indo-European population of the Tarim basin was a composition, and that Tocharians and Indo-Iranians were both present. And, probably did not appear at the same time.

So a second question that came to has to do with the origin of the Indo-Aryans, and the genetic history of the Indian subcontinent. About five years ago I told John Hawks that I was skeptical of too much European-like contribution to the Indian population because not enough European pigmentation alleles were segregating in the population. My inference was based on a wrong assumption. It turns out that the earliest steppe dwellers were not particularly pale of mien going by their genetic architecture on pigmentation loci. My objection has no basis, because the modern European phenotype is very new, and likely post-dates the arrival of Indo-Europeans to India. Additionally, there is suggestive evidence of a steppe connection, such as the widespread presence of the “European” allele for lactase persistence in Northwest India. This allele is new, and swept up in frequency very recently. Its presence in Northwest India almost certainly indicates non-trivial demographic connections.

The blogger at Eurogenes has illustrated the dynamic, but it’s pretty obvious that Northwest Indian populations have some affinity to the Yamnya population in particular. Below are the results from TreeMix using a narrower set of population than above. Notice how Pathan tends to move toward the Yamnaya…..

But why the affinity to the Pathan, and not the Iranian samples? Who knows. I’ll pull down the data set from the Willerslev lab soon, but I think ancient DNA from India is going to have to answer the question. But I’m curious how the “Out of India” people spin this, because they will have a ridiculous rationale….

A new paper in Science claims to have ascertained the locus of origin of the Indo-Europeans, Mapping the Origins and Expansion of the Indo-European Language Family. These are bold claims, and naturally have triggered a firestorm. No surprise, the same happened with these researchers when they published the result in 2003 that Proto-Indo-European flourished ~9,000 years ago, in alignment with an “Anatolian hypothesis,” as opposed to a “Steppe/Kurgan hypothesis.” The original paper in 2003 utilized phylogenetic methods which are common within biology, and applied them to linguistics. This second paper now incorporates spatial information into their model, to generate an explicit locus of origination, in addition to the dates for the bifurcations of the node.

In relation to results I think that the figure to the left is the most important, because it gives us their inferred dates of separation between various Indo-European language families. Observe that Italic and Celtic did not diverge in prehistory, but in history (i.e., the Sumerians and Egyptians were flourishing at the time). Additionally, the diversification pattern is not a simple “rake,” there is internal structure. They may date the origin of Indo-European languages to the early Holocene, but the diversification seems to have happened in steps and pulses. Though the authors support the Anatolian hypothesis, they also seem quite comfortable acknowledging that the real story is more complex, though you wouldn’t get that from the media.

Dr. Anthony, noting that neither he nor Dr. Atkinson is a linguist, said that cognates were only one ingredient for reconstructing language trees, and that grammar and sound changes should also be used. Dr. Atkinson’s reconstruction is “a one-legged stool, so it’s not surprising that the tree it produces contains language groupings that would not survive if you included morphology and sound changes,” Dr. Anthony said.

Dr. Atkinson responded that he did indeed run his computer simulation on a grammar-based tree constructed by Don Ringe, an expert on Indo-European at the University of Pennsylvania, but that the resulting origin was, again, Anatolia, not the Pontic steppe.

There’s an asymmetry here. The historical linguists have compelling and transparent rationales to make for why the Steppe thesis should be preferred over the Anatolian one. Lay persons can make assessments about historical linguistic models which are based on common sense such as words which span all Indo-European languages, and might give clues to the geographical and temporal point of origin. In response, you have Bayesian phylogenetics. At some point in the future I suspect all of this research will make recourse to Bayesian phylogenetics, but at this stage of the game even most people who use Bayestian phylogenetic packages don’t really understand how they work.

I may not grok the methods in detail, but I do appreciate that the authors simulated data to test their methods, and, that their methods worked for cases where we know the answer. For example, the method correctly inferred the geographical origin of the Romance languages, and their time of diversification. But in this situation we know the answer. How about in cases where we don’t?

I noticed this strange plot in the supplements. I’ve highlighted Romani, the language of the Roma. The fact that Romani is an outgroup to Indo-Aryan langauges, illustrates some deep problem with their method. Romani did not start diverging from other Indo-Aryan languages 3-3,500 years ago. It started diverging 1-1,500 years one. We know this because that’s when the Roma start showing up in the Islamic world and parts of southeast Europe. It may be that it just happens to be that the most diverged Indo-Aryan language also happened to be the one which migrated out of India, but I don’t think that’s the case. Rather, the non-Indo-Aryan influences on Romani must be impacting its affinity to other Indo-Aryan languages, even if they are core words.

With that skepticism entered into the record, I can broadly credit the possibility proposed here in the most general sense. We know from genetic clustering algorithms that Indo-European populations within Europe seem enriched for a “West Asian” element vis-a-vis their non-Indo-European neighbors. I’m talking here mostly about the Basque and Finns, though arguably the Sardinians were Indo-Europeanized only during the Roman era, and they should count as well. But, I’m pretty sure that the Indo-Aryans are the ones who brought the “European” component found in low levels across northwest South Asia to the subcontinent. The Indo-Iranians diverged from the European Indo-Europeans ~4,000 BC, and I’m suspecting this may have happened along the broad trans-Caucasian and Russian fringe. This is where contact was made was Uralic peoples. The authors of the paper themselves point to the viability of the Kurgan hypothesis in this modified form in the text. I don’t see why the archaeologist are all worked though (unlike the historical linguists).

(Republished from Discover/GNXP by permission of author or representative)

A few days ago I observed that pseudonymous blogger Dienekes Pontikos seemed intent on throwing as much data and interpretation into the public domain via his Dodecad Ancestry Project as possible. What are the long term implications of this? I know that Dienekes has been cited in the academic literature, but it seems more plausible that this sort of project will simply distort the nature of academic investigation. Distort has negative connotations, but it need not be deleterious at all. Academic institutions have legal constraints on what data they can use and how they can use it (see why Genomes Unzipped started). Not so with Dienekes’ project. He began soliciting for data ~2 months ago, and Dodecad has already yielded a rich set of results (granted, it would not be possible without academically funded public domain software, such as ADMIXTURE). Even if researchers don’t cite his results (and no doubt some will), he’s reshaping the broader framework. In other words, he’s implicitly updating everyone’s priors. Sometimes it isn’t even a matter of new information, as much as putting a spotlight on information which was already there. Below is a slice of a bar plot from Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation. It uses STRUCTURE with K = 7. To the right of the STRUCTURE slice are two plots of individual data on French and French Basque from the same HGDP data set using ADMIXTURE at K = 10 from Dodecad.

Repeated runs and higher K’s make it clear that the French Basque lack a “West Asian” aspect which other French, and Iberians as well, have. Some of this is clear in the paper I referenced above as well…the key is you have to look at the supplements at K = 6. Because the Basque are the only native non-Indo-European speakers in Western Europe, their origin and relationship to nearby populations has always been of interest (they also have the highest Rh- frequency of world populations). Granted, the French Basque are very similar genetically to the French as a whole. But, it is obviously highly informative that they lack an ancestral component in totality which seems to exist at low but consistent levels across Western European populations. The only other European population at K = 15 who lack the West Asian component in totality are Finns (the Lithuanians come very close).

This is all preamble to a discussion of a post Dienekes put up today, A solution to the problem of Indo-Aryan origins. Remember that Dienekes has been “playing” with ADMIXTURE for only a few months. To claim to have found a ‘solution’ to a problem as intellectually and politically intractable and explosive as this is rather bold. The crux of the matter is that at a certain confluences of K’s and population sets Dienekes has discovered a distinctive signature of ancestry which seems to be modal on the north slope of the Caucasus, and spans India and Europe. He terms this “Dagestani,” due to the fact that among a population sample from this province in Russia this ancestral component is overwhelmingly dominant. The patterns of Dagestani admixture in Europe and India are curious and suggestive.

1 – In Europe the frequencies are low, but irregularly distributed (excepting around the North Caucasus). Scandinavians and British have appreciable fractions, Finns and Southern Europeans do not. Here’s Dienekes:

Interpreting this pattern is not easy, but it does seem that this component seems to have a V-like distribution, achieving its maximum in Caucasus and its environs, then undergoing a diminution, and achieving a secondary (lower) frequency mode in NW Europe.

The surprising appearance of the homonymous Dagestan component in India suggests a widespread presence of a common ancestry element. The West Asian element, by comparison seems to have a more normal /-like distribution around its center in Anatolia-Caucasus-Iran region. It does reach the Atlantic coast, but is lacking in Scandinavia and Finland, and also in India itself.

2 – South Indian Brahmins have appreciable fractions, but non-Brahmins in the same region do not. In contrast, those who come from Indo-Aryan speaking backgrounds do seem to have Dagestani ancestral components, irrespective of other aspects of ancestry. For example Pakistanis don’t have that much more Dagestani than South Indian Brahmins or Gujaratis. Also compare the relatively narrow window of Dagestani ancestry variance among Dodecad South Asians (I’m DOD075). DOD088 is from what I recall a Reddy from Andhara Pradesh, a non-Brahmin but non-low caste. It is interesting that they have a high proportion of “Pakistan,” but no Dagestani. I have ~10% Dagestani, but no Pakistani.

Below is K = 10 for a selection of populations. Dienekes has now included in two non-Indo-European speaking Pakistani populations: the Brahui (Dravidian) and Burusho (linguistic isolate in the mountains of Pakistan):

Some general patterns are evident. The light blue is indicative of generic “Indian” ancestry. It is not found in appreciable proportions outside of subcontinental populations (or those of recent subcontinental origin). The same with the red, and light orange. For your reference the dark orange is a “Northern European” component, modal in Lithuania. The light and dark Green are both East Asian components. The dark blue is a “West Asian” component modal in Georgia, and prominent across Europe with declining as a function of distance from the eastern shore of the Black Sea (this is surely the West Asian which distinguishes the French from the French Basque). I believe that the light purple dominant in the Brahui and the light red dominant in the Burusho probably form as a compound the aforementioned Pakistani component. The dark purple is the Dagestani.

First, a word on the Brahui. These are a group of tribes who reside in northern Balochistan in Pakistan. A small number are even to be found in Afghanistan. Historically they have had close relations with the Baloch, an Iranian speaking cluster of tribes who totally envelop the Brahui. The Brahui do speak a Dravidian language, of a family dominant in South India and found in isolated regions of Central and Eastern India. There are two broad models for the existence of a Dravidian language in Pakistan. The first is that the Brahui are remnants of more widely spoken Dravidian languages which date back to the Indus Valley civilization. The second is that the Brahui arrived during the medieval period from another region of South Asia where Dravidian languages were more common. Assuming either model, it has long been presumed that their involution by the Baloch has had a strong impact on the Brahui genetically; the two groups are very close. This is evident in Dienekes’ results as well. But observe that the Baloch are the group which seems more cosmopolitan in ancestry than the Brahui. If the Brahui were Dravidians from deep in India it seems that they would have a greater residual component of India-specific ancestry (light blue and orange). This is not so. In fact the Baloch have more of the Indian ancestral component than the Brahui. The Brahui component is found across Pakistan, and into India, albeit at lower proportions. Naturally, the Baloch have the second highest fraction. I believe these results should shift us toward the position that the Brahui are indigenous in relation to the Baloch, and that the Baloch ethnic identity emerged through the shift of a Brahui substrate, as evidenced by the greater cosmpolitanism of the Baloch. Additionally, Dienekes observes that the Brahui have a lower proportion of the Dagestani component than most other Pakistani groups, and several Indo-Aryan groups in India proper.

The Burusho are event more interesting than the Brahui. Unlike the Brahui the Burusho are very isolated in the mountainous fastness of Baltistan in northern Pakistan. Additionally, their language, Burushashki, is a linguistic isolate. Others of the class are Basque and Sumerian. In general it is assumed that linguistic isolates were once part of broader families of languages which have gone extinct. Burushashki probably persists in large part because of the geography which its speakers inhabit. Mountainous areas often preserve ethnic and linguistic diversity because the terrain allows for the persistence of local variety. I believe it is plausible that the Burusho have been far more isolated than the Brahui. This seems to show up in the ADMIXTURE plot, the Burusho have a greater proportion of their modal ancestral component than the Brahui. Additionally, the Burusho have even an smaller component of Dagestani than the Brahui.

Below is a chart Dienekes constructed ordered by proportion of Dagestani for his South Asian populations. Next to it I’ve placed a chart from a PCA which has some of the same population samples. Compare & contrast:

The PCA is looking at between population variation in totality. So naturally the Dagestani component isn’t going to be predictive of that. Rather, it speaks to the possibility which Dienekes is mooting: that the Dagestani component spread in the India subcontinent with the Indo-Aryans specifically, overlying the local resident substrate. In South India this meant that Brahmins brought this, mixing with the indigenous Dravidian population. In Pakistan the Indo-Aryan, and Iranians, were overlain on a substrate which were the ancestors of the Burusho and Brahui. The dominant signal of genetic relationship has to do with the substrate, not the Indo-Aryans. So that’s what’s going to show up on the PCA. In other PCA plots the model where South Indian Brahmins are a linear combination of a Pakistani-like population and a Dravidian population becomes clearer. But when you look at ancestry using something like ADMIXTURE you have the potential to tease apart different components, and so uncover relationships which may have been obscured when looking at aggregate variation.

Dienekes’ model seems to post three steps in rapid succession ~4,000 years ago. A background variable which must be mentioned is that one must account for the Mitanni, a dominant Syrian power circa 1500 BC where a non-Indo-European language was the lingua franca, and yet a definite Indo-Aryan element existed within the elite. Indo-Aryan specifically because the Indo-European element within the Mitanni was not Iranian, but specifically Indo-Aryan. An easy explanation for this is that the Indo-Aryan component of the Indo-Iranian branch of the Indo-European languages crystallized outside South Asia, and independently reached Syria and India. In Syria it went extinct, while in India it obviously did not. By Dienekes’ model the Mitanni would be rather closer to the urheimat of the Indo-Aryans.

An aspect of his model which I do not understand is why it has to be Indo-Aryan, instead of Indo-Iranian. The South Asian population which the Dagestani component is modal, the Pathans, are Iranian, not Indo-Aryan. Additionally, this model seems to not speak in detail to the existence of the Dagestani element among Europeans. Here is a sorting of European populations (with Iranians included) by the Dagestani component:

Population

Dagestan

Urkarah

93

Lezgins

47.9

Stalskoe

38.7

Adygei

16.4

Orcadian (Orkney)

12.6

Georgians

12.4

White_Utahns

11.2

Iranian

10.9

Scandinavian_D

10.2

Armenian_D

9.9

German_D

9.1

Turks

8.8

Armenians

8.4

French

7.9

Hungarians

7.5

Russian_D

6.3

Spanish_D

4.6

North_Italian

4.5

Spaniards

4.4

Romanian

4.1

Finnish_D

4.1

Russian

4

Greek_D

3.8

Portuguese_D

3.6

Tuscan

3.5

Tuscans

3.4

Lithuanians

2.9

S_Italian_Sicilian_D

2.8

Belorussian

2.5

Cypriots

2

Sardinian

1.5

French_Basque

0.7

There is here a strange pattern of rapid drop off from the Caucasus, and a bounce back very far away, on the margins of Germanic Northwestern Europe. This to me indicates some sort of leapfrog dynamic. A well known illustration of this would be the Ugric languages. The existence of Hungarian on what was Roman Pannonia is a function of the mobility and power of Magyar horseman, and their cultural domination over the Romance and Slavic speaking peasantry (their genetic impact seems to have been slight). No one believes that Germanic languages are closely related to Indo-Aryan (rather, if there is structure in Indo-European beyond Indo-Iranian, Celtic, etc., it would place the Indo-Iranian languages with Slavic). So what’s going on? I think perhaps the Dagestani component is part a reflection of the common Indo-European origin in that region. For whatever reason that signal is diminished in much of the rest of Europe. Perhaps Southern Europe was much more densely populated when the Indo-Europeans arrived. Additionally, it seems highly likely that in places like Sardinia, much of Spain, and Cyprus, Indo-European speech came through cultural diffusion (elite emulation) and not population movement. Or perhaps we’re seeing the vague shadows of population admixtures on the Pontic steppe, where distinct Germanic and Indo-Iranian confederations admixed with a common North Caucasian substrate.

Going back to India, let’s revisit the model of a two-way admixture between “Ancestral North Indians,” who were genetically similar to Europeans and West Asians, and “Ancestral South Indians,” who were closer to, but not very close to, East Eurasians. The ANI & ASI. The ASI were probably one of the ancient populations along the fringe of southern Eurasia, all of whom have been submerged by demographic movements from other parts of Eurasia over the past 10,000 years, excepting a few groups such as the Andaman Islanders and some Southeast Asian tribes. The model was admittedly a simplification. But taking that model as a given, and accepting that the Dagestani element is in indeed Indo-Aryan, we can infer that the ANI were not Indo-European. It is notable that the South Indian Brahmins have elevated fractions of both the Brahui and Burusho modal components. This is probably indicative of admixture of the Indo-Aryan element in the Indus Valley, prior to their expansion to other parts of India. I assume one of the languages spoken was Dravidian, though if ancient Mesopotamia was linguistically polyglot at the dawn of history I would not be surprised if the much more geographically Indus Valley civilization was as well.

Aishwarya Rai

The irony is that today when someone refers to a “Dravidian” physical type, they’re not talking about someone who looks like a Pakistani. They’re talking about someone who looks South Indian, where most Dravidian languages are spoken. But combining the inference from Dienekes’ model and the previous two-way admixture model, you reach the conclusion that lighter skin and more West Asian features among South Asians may be more due to Dravidian-speaking ancestors in the Indus Valley, not Indo-Aryans! It goes to show the wisdom of differentiating linguistic classes from biological ones when discussing historical population genetics. Unfortunately wisdom most of us interested in these topics do not show, alas.

As I like to say, interesting times….

Note: If you leave a comment, please don’t be smarter-than-thou in your tone. I have stopped publishing those sorts of comments because the reality is that most of them have not been that smart or informed. At least by my estimation. If you actually are smarter than the average-bear, and impress me with your erudition and analysis clarity, I’ll probably let your comment through no matter your attitude. But I wouldn’t bet on it if I were you, so show some class and humility. Most of us are muddling through.

Image Credit: Georges Biard, iStockPhoto

(Republished from Discover/GNXP by permission of author or representative)