This seems like a good starting point for the new EURASIAN-DNA-CALC I have in the works.

Relative to the existing EURO-DNA-CALC, doubling the number of ancestral populations (from 3 to 6), and increasing the number of SNPs (by 3 orders of magnitude) introduces some obvious computational problems. I have some ideas on how to resolve them, so stay tuned.

APPENDIX

For the sake of completeness here are the ADMIXTURE runs for K=3 to K=5.

At K=3, the three major races (Caucasoid: green, Mongoloid: red, Negroid: blue) emerge.

At K=4, the Caucasoids are split into West Eurasians (red) and Central Asians (purple)At K=5, the West Eurasians are split into Europeans (yellow) and West Asians (blue)

PS: I will probably do some ADMIXTURE runs for K=7 and higher in the next few days; the results will be posted in this blog post as an update.

UPDATE (K=7)

The Druze get their own cluster (pink) with an average membership of 65.4% of Druze individuals

29 comments:

The central asian in French seems very high. Overall this admixture levels are similar to the admixture analysis of Davidski (Polako). And Spain in this experiment would have around ~64-65 european, ~3-4 north african, ~17 west asian, ~13 CAsian and the rest between SSA and E.Asian

It seems that the population clines are rather geographical not ethnolinguistic (due to the fact that even until very recently the very majority of the populations were illetrate countryside inhabitants who will ,with time, adopt the language of the ruler class who afford to built a solid state[except of course very remote areas and areas inhabited by the same linguistic group of the ruling class newcomer]) which goes back at least to Neolithic times and bronze age migration[and to letter extent historical migration waves]waves could only diffuse some newer component with a gradient decreasing as far as we go away from the "bronze age newcomers homeland" but still, all this being said, the Central Asian green component disparity between indo-european speaking Frenches and Vasconic speaking Francian Basques is rather suggestingful.

My question is:*Could we determine the spatial as well the temporal "homeland" of each component? and could we consider the homogeinity of Yorubas (100% Sub-Saharan African) and Hans (99% East Asian) as a "proof" that the Africanid (aka "negroid") and the Mongoloid biocultural "homeland" would be , respectively, Africa and Eastern Asia.Thanks

It is obvious (once again) from the uniform distribution of a very small and very similar amount of the "Mongoloid component" in most of the West Eurasian populations at K=3 that it isn't actually a Mongoloid component in them but a misleading result of using relatively isolated and drifted populations like Sardinians and Basques. In Mozabites, OTOH, the same allegedly "Mongoloid component" is "pretty diminished" due to the significant Negroid admixture.

At K=7, the Druze split off from the W Asian cluster. I think it will be feasible to do at least one K a night for a while, so let's see how many of these populations will be re-discovered from the data.

What I mean is that at K=6, the other West Europeans, have a significant West Asian component, lacking in the Orkneyar and Russians - e.g. French (13%), French Basque (12%).

What I've gathered from Polako's results, are that:1. The Irish/British would be similar to the French, but the West Asian component is under 10%2. The North Russians have less than 2% East Asian component

So therefore the Orkenyar are more similar to North Russians than any North West European population - ergo not a good proxy sample for NW Europeans.

"What I've gathered from Polako's results, are that:1. The Irish/British would be similar to the French, but the West Asian component is under 10%2. The North Russians have less than 2% East Asian component"

Where have you gathered that?

In Polakos results the Orkadians are similiar to Norse, Swedes, Germans, French and British.

They are no were close to Northern Russians.

There are the Germans, Swedes, Poles and Belorussians between the Orkadians and the North Russians.

I don't know what he means by "East Asian", but it's impossible to arrive at such a low admixture estimate for this population.

My guess is that he is using a high-K estimate. A good analogy are the Mozabites who have substantial Sub-Saharan influence at K=3 to 5 which is diminished at higher K when their own cluster emerges. Their own cluster is actually a partially inbred composite of majority Caucasoids and minority Negroids.

This is evidenced from the fact that at K=5 they are 25.2% Negroid, which diminishes to 7.3% at K=6. Where did the rest go? It got incorporated into the Mozabite/N African cluster which emerged at K=6.

Let's go to Russians and examine their East Asian component at successive K in %:

Fantastic blog!It is unrealistic to separate individual human populations of the same region (say Europe) by ADMIXTURE (or any other similar structure-like approach)just as was done in the cattle paper. It is likely that only popultions with a lot of drift - Kalash in Pakistan for example - will get their own "colour". The populations just are not different enough. The new components at higher Ks will be distributed across populations (with clines). Moreover, at these very high K values the algorithm will probably not converge (meaning that parallel runs will not arrive at the same result). To save calculation time you can thin the dataset by geting rid of some LD using PLINK for example. See Admixture manual.

.......I also know there is quiet some amazingly large difference between the "North Russians" from the reference population and his Russian 23andme project members. Who somehow apear almost as distant from those "North Russians" than they apear to Germans. But apear very close to Belorus and Polish 23andme profiles.

"Pconroy" would say: "Northern Russians apear to be not a good proxy for Russians!" ;-)

You recall this map I did from his 23andme thingy?http://img4.imageshack.us/img4/519/500ksnp.jpg

The 23andme Russians are all in the region where Russia overlaps with Belorussia. All the other parts close to the 23andme Fins and the Chuvash reference are the "Northern Russians".

"How are e.g., Pathans supposed to fit into this scheme, which seems to be lacking the Central Asian component?"

I think Pathans are more like southasia.Its totaly lacking ;)But already visibe in the name: "K5 Intra-West-East-North-Asian:"

And the anchor list too. There is no Anchor in that region to mark Iranian tribes like the Pathans.

"As a rule of thumb, we have found that 10,000 markers suffice to perform GWAS correctionfor continentally separated populations (for example, African, Asian, and European populationsFST > :05) while more like 100,000 markers are necessary when the populations are within a continent (Europe, for instance, FST < 0:01)."

Dienekes, I see you're running 540,814 markers. A third of them are European. So don't you need more "markers" to pick out populations within Europe, for instance? In any case, it's very interesting that a distinct population for the Druze was resolved.

As K increases, could you let us know the termination criteria for this run? Are you converging or exceeding "N" iterations?

Thinning the marker set for linkage disequilibrium would seem to be laborious. ADMIXTURE is begging to be set up for parallelization on multiple platforms. Anybody know if this is in the works?

Another team seems to have done this. Anybody heard of parLEA? The paper is open source:

Sorry Dienekes. Your marker size is OK and just above the "rule-of-thumb" of > 100,000 markers for the Euro population.

Again, this speaks to a runtime/marker size requirement limitation for ADMIXTURE. At K=7, sounds like you're taking at least a day. So for a minimal data set for Eurasia~=500,000, at K=7, you're taking about a day.

Those admix runs all depend on which ethnic groups are used, and from what geographic region.

Dienekes is using one African reference, the Yoruba, and one East Asian reference, the Han Chinese. West Eurasians, Central South Asians and North Africans probably have admix from Africa, and other parts of Asia other than from the Yoruba, a West African ethnic group, or the Han Chinese, one ethnic group from East Asia. That is the problem 23andMe have encountered with American blacks and mixed race people; using one reference from Africa, and one reference from Asia does not work well in finding out admixture or racial composition. With Europeans, no one has been able to tease out Mesolithic, Neolithic, Indo-European, North African, Islamic Middle Eastern, Central Asian or East Asian contributions to the makeup of Europeans. The best that is done is ambiguous labels like Mediterranean, high in Sardinian Islanders and so on.

My interest in admixture is knowing what is what, and when, and to whom can the admix be attributed. I would like to know how much contribution came from the Neolithics, and separated from later Middle Eastern contributions. After all the immigration events are separated by thousands of years.

Dienekes, I see you're running 540,814 markers. A third of them are European. So don't you need more "markers" to pick out populations within Europe, for instance? In any case, it's very interesting that a distinct population for the Druze was resolved.

The trouble is that Europeans have relatively low distances from each other, so the subdivisions occur among other populations first.

It's possible to speed it up of course by reducing either the likelihood convergence criteria or pruning the markers. That will no doubt allow most populations to be resolved, but the admixture estimates will suffer (inferred "mixedness" increases as the number of markers decreases).

hi again. Marnie: thinning the dataset is very easy and quick. see indep-pairwise in PLINK. after that you just use --keep for the plink.prune.in subset. for a 1000 ind 600000snp set it takes a few hours.

"My interest in admixture is knowing what is what, and when, and to whom can the admix be attributed. I would like to know how much contribution came from the Neolithics, and separated from later Middle Eastern contributions. After all the immigration events are separated by thousands of years."

Well, we all want this.

But I doubt that this is possible without knowing the DNA of exactly those mentioned people.

If I imagine mixing tea, coffee, coke, juice etc...

There is no way to soort out at what time the coke was put in.

Its also questionable if its possible to sort out if the "Sugar" component ist from the coke or if it was sugared coffee involved.

And sometimes its even worse.Its still disputed if something like "Indoeuropeans" ever existed as a people or if this is only a language and culture that spread. As long as this problem is not sorted out, the question, what the DNA of the indoeuropeans actually is, is kind of off.

I want to know some other things too: Why have all Europeans the same mothers (mtDNA) but different fathers (Y-DNA)? ;)Yeah, most of us tend to think in Y-DNA because it seems to make sense and ignore mtDNA because it does seem to not make any sense at all.

Who spread the blue eyes mutation all over the northern half of Europe? A external visible indicator that autosomal DNA must have a North/South split, at least in one of the many layers. And all that in the last 6k years (blue eyes mutation is meant to have happend 6k-10k ago in the Ukraine and ALL modern day blue eyed humans are suposed to have it inherited from the same person....kind of unbelievable)

"The trouble is that Europeans have relatively low distances from each other, so the subdivisions occur among other populations first."

Your K=7 run likely indicates that the populations of Europe still have real admixture in them. The Orcadians have both Euro and Central Asian components in them, and the Central Asian component is higher for them than the French. As someone with a partly Scottish background, I think that's real.

The French have a higher subcomponent of Western Asian in them. Perhaps that's a result of either a refugium population infusing into France from Corsica and Sardinia or Greek (Marseilles) influence. Probably both.

Maybe that's too simply, but there's a hint of both in your k=7 run. (Actually, you can see West and Central Asia in Euro populations at k=6.)

I don't mean to get on your case, Dienekes, but it would be good to know runtime, platform type and convergence (yes, no) for these runs.

might be:

Thanks for the recommendation about pruning. What are your observations about loss of accuracy. How much redundancy is there in a data set like the one Dienekes just ran?

Also, give the massive about of raw computer power out there, again, this problem should be parallelized. Forget about messing around with datasets! ADMIXTURE, where is your option for parallel runs??!!

Marnie:I have not seen any loss of (in fact change in) results after pruninig (at least for ADMIXTURE). In fact, LD pruning is recommended both by clustering algorithm writers (see STRUCTURE or ADMIXTURE manuals)and even more so for smartpca of EIGENSTRAT. About parallel runs of ADMIXTURE: I'm not sure what you mean - distributing a single run on different cores? If yes then I don't really see a need for that. Since in a normal setting you would anyhow have to run say 100 runs at each K to see how the Loglikelihoods behave then that's where you do it in parallel. You run say 1000 runs (10 diffrent Ks) at the same time on 1000 cores.

Old Blog Archive

Dienekes' Anthropology blog is dedicated to human population genetics, physical anthropology, archaeology, and history.

You are free to reuse any of the materials of this blog for non-commercial purposes, as long as you attribute them to Dienekes Pontikos and provide a link to either the individual blog entry or to Dienekes Anthropology Blog.

Feel free to send e-mail to Dienekes Pontikos, or follow @dienekesp on Twitter.