search this blog

Saturday, November 12, 2016

Days of high adventure

I've redesigned and streamlined my Principal Component Analysis (PCA) plot of West Eurasia in anticipation of the arrival of many more ancient samples. Rumor has it we'll not only get stuff from the Balkans, but also finally from the steppes north of the Black Sea and South Asia.
I'd say my new plot does a better job of highlighting relationships between the different prehistoric groups and population shifts across space and time. The datasheet is available here.

It should be pretty clear from this plot how the modern-day European gene pool came about. So I don't expect any major surprises when the new samples come in. Nevertheless, the wait is killing me, and many others I'm sure.
Update 13/11/2017: A year on, I've acquired many new samples and, as a result, streamlined the PCA some more, see...
Who's your (proto) daddy Western Europeans?

235 comments:

In fact, many scholars have hypothesized a connection with the Dardic speakers.

Also, factually speaking, much what was Gandhara is today occupied by small, isolated Dardic ethnic groups and Nuristani ethnic groups (Kohistanis, the Tirahi, the Pashayi, the Kalasha, the Torwali people, all the Nuristani groups of northeastern Afghanistan, etc. Pashtuns are very late arrivals in these parts, and still speak of their supposed migration from the Kabul-Ghazni area).

Factually speaking, the Hindkowan have always lived in the far eastern portion of this region, and that too only in the urban centers (Peshawar, Kohat, Swabi, etc).

Hindkowan are culturally very similar to Punjabis, and their language is usually confused with it (even though it is actually closer to Sindhi). They clearly exemplify a history of "Indian" settlement in this region. Their roots here are likely to be almost as recent as Pashtuns, who again are also understood to be from "somewhere else".

Basically, the Dardic and Nuristani people have had a much more ancient history in much of this area, compared to both Pashtuns and Hindkowans.

On a different note, I finally found a modelling setup that is (to my eyes) absolutely perfect.

Once David gets his fundraising project going, I have some new individuals for him to run in his global PCA, which I then want to test with this new setup.

After that, I'll post the results (so, later this week, or early next week).

Very airtight stuff, extremely consistent, and quite unambiguous. I'd say it is a huge improvement on my previous modelling setup.

I am not sure whether you have seen the file fuzzy8.csv inhttps://www.dropbox.com/sh/4njpy307pui1kpq/AACx32eLw9s363IrqhsW4wifa?dl=0This fuzzy clustering is done with fclust, so it cannot have exactly the same clusters as in mclust.But the cluster plots are similar.You will understand that these unsupervised clusters are not located at exactly the positions where we ideally expect them.An important factor is the unbalanced sampling of the populations.This fuzzy clustering should more or less get the same results as nMonte.But an important advantage is that all the populations are jointly estimated with a standard specification.That is far better than everybody rolling his own nMonte specification.One might characterize this clustering as a 'big data 8Mix'.

A limitation of fuzzy8 is that is limited to West_Eurasian populations.At the moment I am studying the cluster structure of global Dstats.It is evident that the SSA pops are gigantic outliers. I really had to drop them.And then new extreme values arise. For instance Papua, Onge and Dravida on dimension 3. But they are too few and too far apart to form their own cluster.I will see what I can do. But the clusters will not nearly so nice as in fuzzy8.

@Matt, SeinMaybe I was too pessimistic about the global Dstats clusters.There is a nice Austronesian cluster: Atayal, Dai, Onge, Papuan and Ulchi.Also I find a cluster from SC-Asia plus Caucasus_HG plus Tajiks; plus Ust_Ishim as a satellite. Is there any way this is plausible, or should I remove Ust_Ishim as an outlier, like I did the Africans?

Seems like too much Bonda and CHG. I guess the imputation probably isn't as good at the distance for a far out group like Austroasiatic Bonda as I would've thought from the neighbour joining, and/or there's a difference in the scale of the two PCA being imputed together that causes distortion.

@huijbregts „The only problem was assigning labels to the clusters. Maybe somebody has better ideas.”I have reproduced your results with mclust and fclust.It is very interesting.

http://s18.postimg.org/t4f6pq1k9/screenshot_93.png

mclust plot:

http://s18.postimg.org/6v19j663t/screenshot_94.png

fclust plot:

http://s18.postimg.org/g5ddmpgtl/fclust.png

Clasters Clus5 in fclust and V5 in mclust are dominated by Slavic and Baltic populations. They correspond to Baltic component in K13 or K15 Eurogenes Admixture runs.That Sintashta and Andronovo are also in this cluster very well corresponds with Indo-Iranian, Slavic and Baltic languages being closely related.

The pashtuns and baloch seem to be outliers among Iranic speakers that have retroflex plosives in speech. Is this true of other languages of afghanistan,What bout tajik and Dariwhat about uzbegs and hazaras?

I'll have time to make more substantive comments later this week, but I don't want to leave you hanging, so I'll quickly chime in.

Balochi, being a Northwestern Iranian language (like the Kurdish dialects/languages), probably developed retroflex plosives under the influence of Indo-Aryan languages.

The same is quite plausible for Pashto. But, I have read papers in which it is argued that this is an archaic feature preserved in Pashto, not a later development.

Besides Pashto, there is another "East Iranian" language which has retroflex plosives, Ormuri/Burki.

This is spoken by an Iranian people in the Pakistani area of northern Waziristan, very close to the Afghanistan border. There used to be substantial Ormuri/Burki communities in Afghanistan, but I've been told they've "gone Pashtun", and the only true Ormuri/Burki speakers are now to be found solely in the tribal region of northwestern Pakistan.

This family of languages/dialects seems to be a substratum across Afghanistan/northwestern Pakistan. For example, many place names as far north as Mohmand, and as far west as Kabul, are words from Ormuri/Burki cluster of languages. And all the Karlani Pashtuns speak Pashto in a manner that displays a clear imprint from Ormuri/Burki.

In Waziristan, these people are surrounded by some of the most volatile, violence-prone Pashtun tribes around. Yet, they have still managed to preserve their very distinct Iranian language, although it seems the number of speakers is dwindling (I've heard their language, doesn't sound like Pashto, but again, it has retroflexion, and is construed as being East Iranian, just like Pashto).

Fun fact, the Pashto alphabet was developed by a man of Ormuri origins (he was also a religious heretic, and launched a rebellion against the Mughals).

Anyway, besides Pashto, Ormuri, and Balochi, no other non-Dardic/Nuristani languages in Afghanistan have retroflex plosives.

Tajiki/Dari/Hazaragi, and the Turkic language of the Uzbeks, are (to my knowledge) free of retroflex plosives.

@EastPoleThis is a nice way to compare the two sets of clustering results.Your mclust results closely match my fclust results in fuzzy8.csv.But your fclust results are different and even implausible. They imply that all the Eastern Europeans are 100% cluster 5 !Something must have gone wrong.

I did not know that fuzzy clustering is possible with mclust. After all, in mclust the mixing probablities are defined on the dataset level, not on the individual level.Also I could not find documentation or references.But your fuzzy mclust data closely match my fclust results so it seems that I have missed something.How did you do it? Could you post the most relevant mclust lines of your code?Thanks.

@HuijbregtsI didn’t do any fuzzy clustering with mclustI removed outliers, averaged the data, did PCA and used the 4 relevant dimensions PC1-PC4 for clustering as a data set named CWEtestPCA.First I did mclust to determine the number of clusters:

here CWEtestPCAMCLUST$z is a matrix of probabilities defined as:“A matrix whose [i,k]th entry is the probability that observation i in the databelongs to the kth class, for the initial solution (ie before any combining). Typically, the one returned by Mclust/BIC”

z-matrix of mclust output is the table on the right hand side with V1-V8 clusters:

http://s18.postimg.org/t4f6pq1k9/screenshot_93.png

The table on the left hand side is clust$U i.e. the result of fclust applied to the same CWEtestPCA data set:

@EastPoleThis if very helpful.So the left table should be identical to fuzzy8.csv because it is also produced by fclust.I understand the definition in the right table as meaning that it these are the initial values of an iteration process of mclust. Judging from the many 100% values it is pretty useless.Thanks again.

@aniasi ".. what exactly was the difference between Iran Neolithic and Iran Hotu?

Iran Neolithic comes from the Central Zagros eastern foothills. Actually, there have been several Central Zagros samples from different periods/ cultures tested. Dave in his AdmixQ13 refers to I1945 and I1290 as Iran_Neol. Both are from around 8,000 BC or a bit younger, and represent the pre-ceramic Zagros "Sub-Neolithic", with evidence of goat domestication but little signs for cereal cultivation except for possibly two-rowed barley.https://en.wikipedia.org/wiki/Ganj_Dareh

Iran_Hotu stems from Hotu Cave on the SE shore of the Caspian Sea, ca. 850 ENE of the Central Zagros Ganj Dareh site across the Iranian Plateau and the Alborz mountains. Both should have been a formidable migration barrier during the LGM and later cold phases (Dryas etc.). IIRC, the sample hasn't been directly AMS dated. Based on the stratigraphic context, it is assumed to represent Hotu Cave's "Maritime Mesolithic" (ca. 9,100 - 8,600 BCE), with a diet strongly based on seal hunting and other aquatic ressources (fish, waterfowl etc.). Aside from the lack of direct AMS dating, the aquatic diet suggests further caution on the dating due to possible reservoir effects. In any case, the "Sub-Neolithic" (domesticated animals, but little farming) appears to have reached the Caspian coast only during the 7th mBC, roughly a millennium after it appears in the Central Zagros.https://en.wikipedia.org/wiki/Huto_and_Kamarband_Caves

This is really nice.I have done a fuzzy cluster analysis on the Dstats.This time I like the fuzzy clusters better than the mclust.CHG is now in a Caucasian cluster as it should be and the Siberian cluster is broken up in East and West.

Go to the file Dstats_fuzzy7.csv in https://www.dropbox.com/sh/4njpy307pui1kpq/AACx32eLw9s363IrqhsW4wifa?dl=0You can paste it in a spreadsheet.139 rows of joint nMonte results.Don't take the labels of the clusters too literally. I made them up to give some idea what is in the clusters.

Pashto has a surprisingly conservative vocabulary (it is one of the most "archaic" of all contemporary Iranian languages), and words of Indic origin are quite small in number. So, retroflexion isn't restricted to the few words of Indo-Aryan origin.

That being said, the Persian borrowings are not affected by retroflexion.

Also, I must take back my comments about retroflex consonants being restricted to Pashto, Ormuri/Burki/Parachi, and Balochi, among the Iranian languages.

Rather, it seems all the Pamiri langauges, and Wakhi, also have retroflex consonants.

So, retroflex consonants seem to be found in all East Iranian languages, although I'm sure Ossetian is an exception.

@Dave: "According to the Basal-rich K7, Iran_Hotu packs more ANE and more of some other, perhaps Central Asian, stuff that often looks like Andamanese-related admixture."

Your AdmixQ13 tells a slightly different story. It neither shows any Andamese-related stuff for Zagros SubNeolithic, nor for Iran_Hotu. Here is where both differ (Bedouin, Anat_Neol, Andamese and San are <0,01% for both):

So, Iran_Hotu comes around more "cosmopolitan" in general, but especially more North Eurasian (plus Amerindian) than Iran_Neol. The Steppe_EMBA/ Amerindian combo is reminiscent of AG3 (83% Steppe/ 17% Amerind), but not so much of MA1 for its significant Andamese and Beringian components. Okunevo isn't too bad a fit either here, bringing in Siberian (20%) on top of Steppe (53%) and Amerindian (11%), though also Beringian (9%) that is lacking with Hotu.The additional West Eurasian in Iran Hotu looks somewhat like Karelia/ Samara HG, which both compromise approx. 4% Amerindian, though Western (Villabruna) -shifted.

I dare to say that Iran_Hotu incorporates Trans-Caspian Epi-paleolithic arrivals from the Volga and Ural river basins. Remember that during periods of glacier melt, the Caspian Sea extended far further North, up to Samara and Uralsk, giving inhabitants of its glacial shores every reason to sail (paddle) somewhere else before their feet get too wet. The same applies to the Pontic shores, once the overflooded Caspian Sea spilled over into the Black Sea.

From what I understand, the rise in water levels was gradual, not catastrophic. In fact, the Khvalynian region appeared to have been locked by Ice during the LGM, meaning contact between western Siberia & eastern Europe could only have occurred more north, through the Urals, and indeed there is some evidence for this. The Caspian transgression might have only accentuated the 'no man's land' status of the south Ural- Caspian corridor. After 12, 000 YBP, the water regressed, but the north Caspian was left a semi-arid steppe, possibly repopulated from the Caucasus. The 'North Eurasian' sensu 'ANE" probably arrived from north central Asia/ southwest Siberia independently to south central Asia/ Iran, and eastern Europe (? in Late Glacial), not one from the other. Indeed, it is already present in Satsurblia c. 18kya - before the 'flooding'.

In a previous paper by the same authors, they have examined the Y-DNA of 78 members of Bantu populations in Maputo Province (the southernmost province of Mozambique) and found that most of them belong to the E1b1a1a1-M180 (56/78 = 71.8%) or B2a1a-M109 (11/78 = 14.1%) clades typical of Bantu populations. The other clades observed are as follows:

The members of E2b-M54 (6/78 = 7.7% total) also may be considered typically Bantu, so really only the pair of R-M198 males and the A-M118, B-M112, and E-M34 singletons stand out in this context. They seem quite purely Bantu (at least in the Y-DNA line) in comparison to the Bantus from the Comoro Islands between northern Mozambique and northern Madagascar who have been examined by Said Msaidie et al. (2010); those Bantus had 115/381 = 30.2% probable West Asian/South Asian influence, 39/381 = 10.2% probable Indonesian/(proto-)Malagasy influence, and 8/381 = 2.1% members of subclades of E that are typical of neither Bantus nor West Asians (E*-SRY4064(xM33, M75, M2, M35) and East African/Khoisan E-M293).

So retroflexion is not superficial but quite rooted in afghanistan overall.

dder is of course ddhEr in hindi/bengali etc meaning much/a lot of/heap/pileif you notice the difference is its a voiced aspirated retroflex as opposed to just a voiced retroflex.

The word is one of the many that shows the deep rootedness of both initial retroflexion and voiced aspirates in north India. I don't think it has a good etymology from sanskrit. In fact sanskrit seems more dravidian in not having much word initial retroflexion.

regarding Krram, krri: here the original is a non retroflex dental and is derived from sanskrit root

But perhaps pashto retroflexion here resembles the Jat dialect's penchant for retroflexion of many dentals even the word for no in hindi nA becomes NA(nnA) a nasal retroflex flap.

regarding reflexes of english sounds like in television. Throughout india alveolars of english are mapped to retroflexes and dental affricates like this are mapped to dental plosives. Arabic and Iranian speakers in contrast either map both to dentals or more rarely over compensate and make all of them alveolars.

Looking at the composition of Sintashta and Czech and at the character of the clusters it seems that Sintashta was shifted towards Central Europe.Did Sintashta people actually come from Central Europe or are there other explanations of such cluster composition of Sintashta population?

" Direct Steppe admixture" is not very relevant for the core of the West Iranic speakers. Iron Age and Safavvid era samples actually indicate that "Steppic admixture" was very low from the beginning on. Their ethnogesis appear to have formed via the Yaz culture at the Northeast and Kura Araxes at the Northwest. At least their genetic make up looks like Kura Araxes + something else but quite similar, which I predict is the Yaz component. The reason why Kurds score more drift with Steppe populations is not based on their West Iranic core ancestry but the various Scythian/Sarmato-Alan and Cimmerian settlements in Kurdistan.

Indians, Tajiks, Pashtuns, Pamirs have more direct Steppic admixture, yet based on fst distances interestingly they are still equidistant (some South Asians even more distant) to Steppic groups because some of their non "Steppe derived" DNA is more divergent than anything found in West Iranics. To give a more extreme example. A half Swede half Kyrgyz will certanly show more direct Viking ancestry via admixture calculators and oracles but ultimately a Sicilian will autosomally look closer to a real Viking. Not saying that East Iranics and Indo_Aryans are half Swedes half Krygyz, far from it. Just using an much more extreme example to make my point clear.

There is allot of historic background for this genetic difference. The East Iranics were often "labeled by the West Iranics as " Turans" in contrast to Aryans (themselves as the noble civilized people. Despite the East Iranics actually preffering to being called Aryans over Turans too. Turanians means people of the dark lands (still living nomadic without civilization).

The East Iranics tried to take control of the West Iranic empires quite a few times but most of the time failed and got absorbed into the local populations (mostly among the Medes and their sucessors the Parthians).

"Turan and Turanian can designate a certain mentality, i.e. the nomadic in contrast to the urbanized agricultural civilizations. This usage probably matches the Zoroastrian concept of the Turya, which is not primarily a linguistic or ethnic designation, but rather a name of the infidels that opposed the civilization."

Even in historic time there was a geopolitical divide and as we know often geopolitical divides can turn or have an ethno_cultural background. That explains the genetic difference between West and East Iranics. I mean South and North Slavs are not very akine either.

So the Steppe admixture never have been very high among the West Iranic groups, because it never really played a big role in their formation and ethnogenesis.

As some other users have also pointed out, it is possible and not unlikely that the Anatolian_Neo admixture in some South_Central Asians is linked to West Iranic admixture there. At the end of the day the Persians expanded their tongue all the way into Tajikistan and half of Afghanistan while the Medes and their successors the Parthians did the same with Pakistan (Balochistan). There is definitely some post pre modern age West Iranic admixture in South_Central Asia.

In my most recent k=8 model I too noticed a relation between Sintashta and Czech.However I prefer a different interpretation. In my model both pops are present in a cluster 2 (Czech=0.897, Sintashta=0.648) I labeled this cluster as a N_European cluster.Several other Bronze Age steppe-like pops are also present in this cluster: Unetice(0.830), Bell_Beaker_Germany(0.605), Hungary_BA(0.526), Corded_Ware_Germany(0.515).Compare, the two Yamnaya's have 0.245 and 0.200; they do have greater presence (0.430, 0.558) in my cluster 6, which I labeled Russian.More knowledgeable people than me may put this into a prehistoric perspective.I am not blind to the possibility that we witness an artefact due to the very imbalanced sampling distribution; the N_Euro cluster is both very numerous and very tight.My results are in the file Dstats_fuzzy8c.txthttps://www.dropbox.com/sh/4njpy307pui1kpq/AACx32eLw9s363IrqhsW4wifa?dl=0paste them into your spreadsheet.

I also have a few methodological comments.My models have not been very stable. This is caused by the fact that we have less than 150 Dstats to work with. I expect that stability will be better when we have 200+ Dstats.One of the nice things of fuzzy clustering is that you can relax this problem by increasing the amount of fuzziness.Until yesterday I have used the default value m=2 for the fuzzifier parameter; I suspect you have done the same.In my latest model I have increased the fuzzifier to 2.5 . I am pleased with the result; I did not see the N_Euro cluster before.

@hujbregts:Interesting that all of the Pops listed by you as clustering with Sintashta - CW_Germ., BB-Germ., Unetice, Hung_BA - are from Central Europe. I had earlier remarked that CW_Poland (misnomed, archeologically it is clearly early Unetice) seems to have quite the admixture required, and chronological placement, to explain the additional CE-like element in Sintashta. Any chance to get D-Stats on that sample (maybe others here can help..). In that respect, breaking down Hung_BA in its EBA and LBA components, which differ quite a lot from each other, might also help for a clearer perspective.

Interesting also that Unetice and modern Czechs seem to cluster together closely. Unetice is a northern suburb of Prague. There are various indications that Bohemia ("home of the [celtic] Boii") has been far less affected by Germanic and Slavic migrations than most other parts of Central Europe.

A note on terminology: The standard Central/ Western European (HU/IT to British Isles) chronology places the CA-BA transition at around 2,100 BC (Hungary a bit earlier, Ireland and Italy a bit later than that). Hence, CW and early BB aren't Bronze Age but Chalcolithic (~late/ final Neolithic in German chronology).

Even though this is two weeks late, I want to say thank you again regarding answering my inquiry about the West Eurasian admix in SE Asians.

So it seems despite SE Asians like Burmese, Cambodians, Malays, Thais often show "South Asian" in ADMIXTURE components, there is actually very minor to negligible West Eurasian. It makes me thing now that the "South Asian" is not really admixture from South Asian-populations at all but might be ASI/ASI-like which is probably ancient and native component to SE Asia.

Or is it possible that geography like mountainous, remote terrains prevent the ANI gene flow from areas like Bengal to Burma?

If geography is the case for the very little to none West Eurasian admix in SE Asians, this might be the same for many West Africans like Senegambians, Malians, Nigerians who despite are actually in close geographical proximity to North Africa have very little to none Eurasian admixture from them. And this probably has to do with the Sahara being the geographical barrier preventing most of the gene flow as in the mountains between Bengal and Burma preventing the gene flow.

These are all my assumptions based on what I understand. If I am wrong, please correct me.