search this blog

Tuesday, October 13, 2015

New PCA format

From now on, every time a new dataset of ancient West Eurasian samples is made available online, I'll run it like this.

Please note that the plots above include the majority of recently published ancient samples, and yet they are not affected by projection bias, otherwise known as shrinkage. If you're confused by some of the acronyms in the PCA key, see here.

My only nitpick/suggestion would be to use a wider range of colors. For example, it's impossible for me to tell apart Corded_Ware and Sintashta, or Armenia_BA and Europe_EN, etc. Yes I'm colorblind, but I think it would benefit normal color vision people too?

A bit off-topic but isn't there a way to get NW Africans in one PCA, just to check the [dis-]similarity with Early Neolithic European Farmers? It wasn't possible with ANE K8 earlier this year, so it might still not be for now.

The first one looks more familiar, just rotated 90º counterclockwise. But it "fixes" something that Maju has complained about before: the amplified east-west differences vs. the north-south ones (I'm personally neutral to either representation).

A few small questions:- Are Sardinians and Basques included in this PCA? If so, they seem to cluster with modern populations and quite apart from neolithic ones.- I assume that in Western Europe the cut between northern and southern Europeans is French vs. Spaniards/Basques (if present), but moving to the east of Europe, Hungarians and Croatians are northern or southern Europeans?- The second PCA format seems to be less about ANE (in the Y axis) and more about ENA and SSA? Caucasus seem to plot "south" of EFFs (or South Europeans).

If anyone has trouble with the colours I recommend just to download it and invert colour http://i.imgur.com/FgPh9D3.png

As noted, with this plot, the underlying West Eurasian modern shape seems quite different. E.g. no Sardinian and Basque distinction if they're included, for one, no real separation of two clines through modern Europe for another.

The plot showing eigenvectors 1 and 2 is very similar to the old one, it's just that the axes are flipped. ENA influence and maybe SSA (or some kind of Arabian signal) are indeed visible in eigenvector 3.

Basques and Sardinians weren't included in this run. I'll update the PCA later today with them, and will also post a flipped version of the 1&2 plot.

Alberto: But it "fixes" something that Maju has complained about before: the amplified east-west differences vs. the north-south ones (I'm personally neutral to either representation).

Looking at these graphs (and the same compressed analogues in the papers), it kind of does strike me that the visual impression is that most European populations should be closer to EHG than WHG, based on position.

Yet when we looked at the stats like D(EHG,WHG)(X,Chimp), the signal is that many populations should be closer to WHG than EHG on this axis.

There could other reasons why a PCA based on IBS might not match up with that, probably drift, of course. Ultimately, maybe there is not any idealised solution that perfectly implies all the relationships and this sort of presentation with a long PC1 axis and shorter PC2 strikes me as pretty good.

@Matt:It's because PC2 is "squished" compared to PC1 - if you read the numbers on the axes the distance left to WHG is really about the same as the distance up to EHG, even though it looks bigger... since EHG is also half-way of the left distance, it's really further away. In (rough) numbers: WHG is 0.064 away in PC1 and 0 away in PC2, so is 0.064 away. EHG is 0.032 away in PC1 and 0.064 away in PC2 so is *more than* 0.064 away.

I'm still curious about that phenomenon of dilution of EHG vs. WHG affinity in Europe. By observing it and by the (apparently reasonable) assumption that ANE in Europe (or at least north and western Europe) came from Yamnaya, one could assume also that stats like:

D(Mbuti, X)(Georgian, LBK_EN)

Would show a similar pattern, that is: Yamnaya and CW being clearly closer to Georgian but modern Europeans shifting back to LBK_EN. But while I haven't seen a set of stats showing this, I think that it's clear that all modern European populations (except maybe Sardinians) are going to fall in the side of Georgian (or maybe I'm wrong?). If so, does it mean something? Would there be some genetic explanation for it or we would need a historical one?

Of subject: About Steppe ancestry in SC Asia. TreeMix say Bronze age Steppe-genomes are a better proxy for the Western ancestry in SC Asians than anyone around today, but all their Western ancestry can't be Steppe-derived. And I think we don't have good proxies for West Asian or unknown types of Western ancestry in SC Asia. So, even qpADM estimates are probably off. What do you think. Is estimates like 43% EEF, 47% Yamnaya, and 10% Dai reliable?

@Tobus, I've never been 100% sure about whether the eigenvectors scale that way, I think you are right however, following your suggestion, I had a go at rescaling the plot - http://i.imgur.com/P8w9oPS.jpg (or in a sort of rotational view - http://i.imgur.com/qQtkAuV.jpg)

@ Alberto, I don't know what that D(Mbuti, X)(Georgian, LBK_EN) would show for most populations, whether most modern European populations would fall closer to Georgian or not. Be interesting to see.

The closest thing I could havbe a go at doing was plotting the difference between the pairs

D(JHN,EHG)(BedB,X)-D(JHN,WHG)(BedB,X)*D(JHN,Yamnaya)(BedB,X)-D(JHN,Copper Age Europe)(BedB,X)

on a graph http://i.imgur.com/GFuzOgG.png

Where Europeans have generally got an increase in the level of Yamnaya affinity relative to CE European affinity, compared to BedouinB, and have all at the same time got a greater increase in WHG affinity compared to EHG affinity, compared to BedouinB.

*D(JHN,EHG)(BedB,X)-D(JHN,WHG)(BedB,X) seems fairly linear with D(EHG,WHG)(X,Chimp) - - http://i.imgur.com/vrRuRcm.png although when I compared the pairs including ancient populations some were a little off the line.

NW Africans here appear to be not very similar to even European_EN and instead very close to Near East. It's unlike before when they appeared more distinct from everyone and in a more or less intermediate position between Bedouins and Sardinians.

Yes, in general there is an increase in affinity to Yamnaya vs. Copper Age, and to WHG vs. EHG. Though Corded Ware is missing, so it's hard to say if its affinity to Yamnaya vs. CA would be much higher than that of Lithuanians or not. So I'm not sure we can extrapolate and draw any conclusions for now.

But if we get one of those "Georgian-like" samples it would be interesting to check if the affinity to them vs. EEF has a similar pattern as the one to EHG vs. WHG or not.

@Matt: based on your (very interesting) f4 graphs, which suggest that the WHG/EHG ratio remains constant in the Spain_EN → modern Basques "evolution" but that there is an increase in both the WHG and EHG component in these relative to the former, do you think this can be interpreted as a simple {WHG+EHG} extra admixture (from a mystery population) into modern Basques? At least that's what I would conclude, judging on the lack of Caucasus component among my people, which precludes Kurgan influence, but I'd appreciate second opinions.

@Chad: can you document your claim?, because in everything I've seen that is just not correct: Alentoft gets zero, Günther & Valdiosera just negligible individual tiny bits, etc. Not to mention that, in PCAs (more apparent in Europe-only ones), Basques are always at the opposite polarity to Caucasians (or Eastern Mediterraneans, if the former are missing in the PC2 axis, which seems very much correlated to the Caucasus component).

Sort of on topic, sort of off, may be of interest, I was having a look at the FST table from the revised Mathieson et al paper, and I was interested to see if the FST distances from WHG, EHG and Motala from the paper could recreate the same effect as the D stats in creating a PCA that looks like IBS based PCA of West Eurasia.

So - http://i.imgur.com/yeL5EwG.png

Looks like they sort of do, but it seems like there is an effect from very high drifts in some of the samples, like Remedello for the most extreme example, which may be artefacts of inbreeding or the way samples are recovered, pushing a population away from expected position towards Africa, or pushing populations with low drifts towards the HGs.

So I had a look to see if you could adjust for this effect using a regression equation - http://i.imgur.com/StCH7at.png, which is a regression equation connecting distance from Africa (as in theory the mostly neutral outgroup) to distance from WHG, EHG, and Motala, in other unrelated outgroups, Han, Onge and Papuan (note with this I've also included BedouinB, because weirdly it seems to be perfectly fit the same regression equation for these unrelated outgroups).

Result comparing the real FST - predicted, as a signal of greater closeness than predicted based on distance from Africa and the relationship between that and FST with WHG, EHG and Motala is - http://i.imgur.com/2BeHht9.png.

Which looks a bit closer, although you can only get so much information out of these FSTs...

A 50/50 mix of Iberian Copper Age and Central European Bell Beakers (who had R1b-P312) seems to work well for Basques. I don't think any mysterious SHG-like population could arrive to Iberia after 2000 BC unadmixed, and carrying those same R1b subclades. In the first PCA above they plot where they should. Far from the Caucasus because they have half ANE than most European populations and they have high WHG ancestry. There's no other mystery to it.

@Alberto: I don't see why Iberian Chalcolithic samples, with their lack of lactose tolerance and very different mtDNA pool from modern Basques can be considered better partial ancestors than Early Neolithic samples, with their typical EEF baggage which is equally distant from modern Basques. But more crucially I don't see why you and others insist on considering the Bell Beaker epiphenomenon as an homogeneous population (someone said on Mathieson's BBs, who are still all from East Germany, "they are all over the place" - naturally: they are not a population but a sect or fashion), nor much less East German Bell Beakers as ancestral to anything (the BB phenomenon expanded in the opposite direction in fact, being much older in SW Europe, a well known fact). If there is a BB individual with that lineage, he probably imported it from SW Europe, although obviously not from Chalcolithic inland Iberians from the 'páramo' (they are somewhat close and clearly transitional between Mediterranean farmers and Atlantic ones but they are not substantively direct ancestors of modern Basques, nor of modern Iberians probably either).

So I disagree with your assessment 100%.

"There's no other mystery to it".

Yes there is a mystery of extra EHG without Caucasus, i.e. non-Kurgan, in Basque. And probably also affecting North Europe, from the Baltic to Britain: although in these places there is some Caucasus-type ancestry, they still deviate from the simple Neolithic-Kurgan admixture axis and seem to require extra "pure" EHG (or similar), whose source is not yet known. The deviation is different for Basques and North Europeans and the former also require significant extra WHG relative not just to Spain_EN but also to Spain_CA.

There is also the mystery of modern-like mtDNA pools in Paternabidea (middle zone Neolithic Basques) and Gurgy (Neolithic of the Seine Basin, France). The Gurgy study showed that these were closer to post-Gurgy populations than to pre-Gurgy or contemporaries in other parts of Europe, so we have an "arch of modernity" minimally defined by Paternabidea and Gurgy. Sadly we don't have much data from this huge area, nor from other probably related ones like Britain or West Iberia, just some scattered mtDNA and the peripheral Gökhem sample from Sweden, which is often ignored altogether.

There is a big MYSTERY with capital letters and a long row of question marks after it. It's like a huge black hole of missing info, whose reality we can only infer from very fragmentary data, almost all of it mtDNA.

Ok, I guess I deserve that for over generalizing and simplifying. But I equally think you're not seeing the forest because of the trees.

There was an old study of the Basque Country Chalcolitic that found high levels (for the time) of Lactose Tolerance. So obviously those should be better ancestors than the ones from this cave in Atapuerca. But nothing should make us think that autosomally they were very different (and that's what my point was about, mainly, not about Lactose Persistence of specific mtDNA (though a quick look at the Mathieson samples reveal half of them are H3+H1, isn't that close to Basques? Not too important anyway).

About the Bell Beakers, yes, they were not completely homogenious, but the ones from Germany are (at least many of them) similar enough to get the idea. And again, the origin of (part of) the culture might be in Iberia and expanded to Europe, but I'm not referring to the culture, but to the people who were in Germany at that time and are classified as Bell Beakers. Somehow they moved to Iberia and had a huge impact in the Y-DNA pool (and noticeable autosomal impact too, apparently). The idea that R1b-P312 went from SW Europe to Central Europe was ok until a year go or more. But not now. 95% chance is that it was the other way around. Let's wait for 100% confirmation.

About the Caucasus component you're still confused by a purely technical problem. Basques have the amount of it that corresponds to their general genetic structure, which is half of that of most other Europeans. And about the extra WHG you are correct, there is extra WHG around Europe when comparing EEF and Yamnaya, or even when comparing Middle Neolithic + Yamnaya. It must have come from unsampled populations that remained mostly WHG along the Neolithic. But no, no pure EHG anywhere near Iberia after 2000 BC. Not possible and not necessary.

In any case, if you want an answer about from which exact people/place modern Basques descend, their exact mtDNA and Y-DNA, etc... I certainly can't answer that. A lot more samples are needed to find the exact ones. I just think there won't be any mystery about it. They will match the general picture we can already see. The fine details still have to wait.

@Alberto: It's not any "old" study but from 2012, on Chalcolithic Basque (Upper Ebro) "military" cemeteries, which was very apparently fixated in a subpopulation (plausibly more "Atlantic") but totally absent in the other (probably more typical Neolithic, like Spain_MN for example), with only two individuals being heterozygous: http://forwhattheywereweare.blogspot.com/2012/01/caught-in-act-lactose-intolerant-and.html

This is a most important datum that can't just be ignored (but Mathieson does, for instance). Nobody else has a homozygous TT population been detected so early, although it is indeed possible that there were others in the Atlantic "information black hole" area, for example, in Britain and Ireland, where massive milk consumption began in the Neolithic and continued uninterrupted for many millennia (http://forwhattheywereweare.blogspot.com/2014/02/neolithic-peoples-from-britain-and.html).

So, as far as I can discern, European LCT originated (or got fist fixated) in the Atlantic populations and one of them was present in the Basque Chalcolithic, although we still see a lot of CC individuals, so the the spread of the allele was still incipient and various subpopulations had strikingly different genetic pools re. this aspect - and therefore should also be different in other aspects, like greater or lesser WHG or even EHG admixture, mtDNA pools (sadly not directly compared to LCT alleles) and maybe even Y-DNA.

It's not about Basques or not only, but about Atlantic Europeans and Western Europeans in general. It just happens that there are some informative data points from the Basque Country that are missing in most other comparable areas of Atlantic Europe.

"About the Bell Beakers, yes, they were not completely homogenious"...

LOL, we are talking of BB-labeled people from a small European region like is the Upper Elbe and Saale, and they are just wildly different from each other. What will it be like when we finally have BB samples from other European regions? Much much more scattered, no doubt, because there is no "Beaker People" but rather Bell Beakers (pots and related items) in many different ethno-cultural realities. It's a cultural phenomenon, not any "people"! And in any case it clearly spread from south to north, largely as reaction to the Corded Ware disruption, and not at all the other way around.

"About the Caucasus component you're still confused by a purely technical problem. Basques have the amount of it that corresponds to their general genetic structure, which is half of that of most other Europeans".

Again no sources? What are you getting that silly idea from: a zombie-based "calculator" or something? I'm baffled.

"But no, no pure EHG anywhere near Iberia after 2000 BC. Not possible and not necessary".

It does seem necessary judging on the data (just check Matt's graphs, which are very informative), particularly in the case of Basques: the WHG/EHG ratio seems to remain constant from Neolithic to present but bot WHG and EHG (but not 'Caucasus' component) are greater among Basques than among Spain_EN, hence extra WHG seems clearly not enough.

"In any case, if you want an answer about from which exact people/place modern Basques descend, their exact mtDNA and Y-DNA, etc... I certainly can't answer that".

I actually asked Matt, not you (no offense meant, you are of course free to try), and I was only considering autosomal DNA in my question (although mtDNA and LCT are indeed related issues, as discussed above - I'm totally ignoring Y-DNA for lack of sufficient data, although there is a lot of people unhealthily obsessed with it).

@Matt: could you run your f4 analysis for a vertical "Caucasus" affinity axis? That could clarify much of the discussion Alberto and I just had. I personally don't trust zombie-based calculations, they can only be rough approximates because they rely too much on the priors, which in turn are heavily dependent on the subjective criterion of the "expert" pre-selecting them.

If it's those ones, and if you can explain what do you mean by: "the WHG/EHG ratio seems to remain constant from Neolithic to present but both WHG and EHG (but not 'Caucasus' component)", maybe I will understand your point.

@Maju,"This is a most important datum that can't just be ignored (but Mathieson does, for instance)."

They didn't ignore that data. It's that they only looked at the data from their genomes. They didn't use other data in their analysis when looking for selection in the last 8,000 years. If they did they would have seen LCT was at modern frequencies in Poland in 500 BC, which is big news, it means sometime between 2000 and 500 BC the frequency went way up.

I like the conversation you guys are having. You're bringing up a lot of questions that are hard to explain. I doubt small details, like in a PCA, ADMIXTURE test, D-stat, F4, etc. can be taken too literally though. IMO it should be taken relatively. Basque have some-type of ANE ancestry that much is clear.

50% replacement in SW France or Iberia after Copper Age seems unlikely to me because the EEF/WHG-signal is so strong. There's a low chance any EHG were anywhere near Iberia. ANE can only come from eastern Europe and must have also carried Teal.

@Krefter: "They didn't ignore that data. It's that they only looked at the data from their genomes."

What is exactly the same: ignoring the data.

And inferred a whole narrative about LCT "selection" out of them, which is my greatest complaint. You can't discuss LCT in Europe ignoring the Upper Ebro Chalcolithic data. Or you can (because of freedom, stupidity and all that) but you will reach to conclusions that just don't make sense.

"If they did they would have seen LCT was at modern frequencies in Poland in 500 BC, which is big news, it means sometime between 2000 and 500 BC the frequency went way up".

Much of the same seems to have happened in Hungary (but not in Germany, where modern LCT levels are quite older), so what we are seeing is that, even if selection is implicated, there is a lot of relevant info that cannot be ignored when we try to make sense and build a coherent narrative, i.e. a good theory.

"I like the conversation you guys are having. You're bringing up a lot of questions that are hard to explain".

Thank you.

"ANE can only come from eastern Europe"...

We don't know that. Olalde got that Ma1 had 40% Motala admixture, so what if what we call ANE is just a subset of European genetics into Ma1 (Gravettian flow eastwards)? I personally prefer to disregard ANE and focus on more obvious plausible sources such as EHG. A question to ask is if EHG or something like that (maybe a population equidistant between EHG and WHG) existed somewhere not in Eastern Europe (Western Europe, Northern Europe, Italy, even Portugal, who knows?), or migrated westward some time before Kurgan peoples did. So far there's no direct evidence for that but the data seems to point in that direction and the blanks of aDNA are still huge, allowing for a lot of speculation.

@Rob: partly I replied to your question in the last part of my reply to Krefter. But anyhow, my big frustration is that there is a huge blank of data of great potential interest in Atlantic Europe: from Portugal to Denmark there is nothing at all sequenced for autosomal DNA, Gökhem is the only exception but it is just one site on a very specific area. Some of the blanks are huge: most of France, 2/3 of Iberia, all Britain, Ireland, Belgium, Netherlands, Low Germany, Denmark... and the Basque Country too (we have a lot of mtDNA and some for other markers like LCT but not full nuclear sequences).

From the Basque Neolithic mtDNA one can easily infer a rather abrupt cline in just some 100 Km between the Ebro and the coast: one is roughly like ATP, the other totally Epipaleolithic, in between there is one site with a totally "modern" mtDNA pool (Paternabidea). Until some months ago it was the only one of that age, now there is another some 1400 Km NE, in Gurgy, North France, in between? No data whatsoever. And in any case we only have mtDNA, not nuclear DNA.

Another datum we get from the fragmentary ancient Basque data is the presence of an LCT+ subpopulation, with the allele clearly fixated (homozygous). So we are getting a lot of tidbits of info that suggest that the oldest area of "modern" genetics was maybe between Pamplona and Paris... among other areas, as the data blank also extends to the north and south along the Atlantic coasts. And these Atlantic areas were clearly central in the Megalithic period particularly but also later in the Bell Beaker and Bronze ages. And in all that time there is nothing obviously Kurgan in the area (except North/East of the Rhine/North Sea).

So nope I'm not thinking in any mythical island now sunk (actually I suspect Atlantis is on land, right before everyone's eyes, not far from Lisbon, but it's much less impressive than all the romantic hype "made in America" - so just a side note in this discussion) but on real peoples with complex realities of the Atlantic Neolithic and Chalcolithic.

... "you appear to have some pretty deep-seated misunderstandings (based on that Balto-Slavic Genetics paper and impact on the Balkans)".

You may want to discuss on that in that entry instead of placing defamatory labels on me just because you disagree with something I said. That's why there is a comment section.

As for the other aspect , it was certainly not my intention to "defame" you . But your summary on your blog missed a very big point - the Iron Age Bulgarian which clusters near modern Spaniards . On the other hand; modem Balkan geoups are shifted toward east Central Europeans. This can only mean a substantial Slavic impact; admixed with Paleo-Balkan groups. I think you mistakenly took modern Greeks as a kind of relict population, not realising that they too absorbed considerable northern elements.

That Balto-Slavic study was strongly criticized on the Russian forums as low quality study which is about nothing. Actually they didn't even use some aDNA to compare the pre-Russian population with modern Russians in East Europe.

@Rob: Do you want to discuss Medieval Slavs here or there. I ask because there is a (most reasonable) clause on top of the comment box that says: "Stay strictly on topic at all times", and it is my intention to respect it (within reason). I fail to see how Medieval Slavs are relevant here, really.

@Krefter: you are right about the blanks being more widespread than the notorious Atlantic one I mentioned. But I must also say that, in Europe, this seems to be the most relevant one of all: it's very large and it should be very informative once we begin getting data. In the wider context probably the West Asian Neolithic blank is the other huge blank with major implications (the NW Anatolian new sample does not clarify anything from Central Anatolia to the East).

The Balcans are important but I miss their Epipaleolithic data rather than Neolithic one (Hungary's Starcevo and Turkey's NW early farmers give a good bracket of confidence about what was in between), but then again the Dimini-Vinca intrusion probably altered the genetic landscape and we know nothing about them in the genetic aspect either. Other regions (Italy, West Germany, more Baltic stuff, North Africa even) could also provide useful info but they seem less urgent to me (except Michelsberg culture, that I believe quite important for Central and mid-West European Chalcolithic genetic changes but would loosely link with that wider "Atlantic" blank).

There's a lot of basal R1, R1a, and R1b in Iran. I have huge blanks in data but I doubt anyone will top Iran. It's pretty interesting. I expect "Steppe" R1b-L23 and R1a-M417 to be from EHG but I wouldn't be surprised if it is from West Asia considering the basal R1s in Iran.

Where R1a-M417 and R1b-L23 ultimately originated doesn't matter much, because it seems very likely they both expanded from Russia/Ukraine in circa 3000 BC. 90%+ of R1b in West Asia is R1b-L23(Z2103?) which could have moved south from Russia with some IE-languages(Anatolian, Armenian). R1a-Z93 takes up 90%+ of R1a in Asia and quite obviously is from the Steppe.

Thank you. I think this was more or less known, but it's great to have this good summary right there.

An origin of R1 around Iran is indeed the best candidate. I do expect that modern R1a (M417) and R1b (L23) also came from somewhere near the South Caspian area, though they could be from EHGs too. It won't really matter that much (at least for what I've been saying), but I think that for others it will matter a lot if indeed it turns out to have a southern origin. We should know soon enough.