Thursday, October 20, 2011

Comparing different ADMIXTURE runs using Zombies

My idea of using zombies with ADMIXTURE is the gift that keeps on giving. Remember that "zombies" are synthetic individuals created from ADMIXTURE output, representing the K inferred ancestral components. They can be viewed as hypothetical ancestral individuals representing each of these K components without any admixture from any of the others.

An interesting problem that often comes up is to compare across different ADMIXTURE runs. I can think of at least three different applications of this:

To compare components across different K; for example, how does a "West Asian"-centered component at K=5 differ from a similarly-centered component at K=12?

To compare components across different datasets; for example, how does a "West Asian"-centered component inferred from an existing dataset (e.g., the current Dodecad v3) differ from a "West Asian"-centered one from a new dataset (e.g., the upcoming Dodecad v4, which will also be trained on the valuable new populations of Yunusbayev et al. 2011)

To compare components across different projects; there has been a proliferation of different ancestry projects since the launching of Dodecad nearly a year ago, and since all of them slightly different individuals/SNPs/terminology, it is quite useful to be able to gauge how one component from one project maps onto other components in other projects.

I then inferred the ancestry of the MDLP zombies using Dodecad v3, and vice versa. Since Dodecad v3 also includes populations (e.g., Africans) not considered by MDLP, I did not try to map those onto MDLP.

I will comment on the MDLP-to-dv3 mapping:

The MDLP "Scandinavian" component appears to be West/East European with a little Mediterranean and a little Northeast Asian

The MDLP "Volga_Region" component appears to be East European with some Northeast Asian

The MDLP "Altaic" component is West Asian+Northeast Asian+Southeast Asian. Note that in Dodecad v3, the Northeast Asian component peaks at Chukchi, Nganasan, and Koryak, and most other east Eurasian populations have much less of it

The MDLP "Celto-Germanic" component is (surprisingly) Mediterranean-dominated. One possible interpretation is that in the context of MDLP this captures one aspect of the difference between Southwestern and Northeastern Europe -higher Mediterranean in the former-, whereas the...

... MDLP "North-Atlantic" component seems to be entirely West European, and is capturing a different aspect of east-west variation in Europe.

The MDLP "Balto-Slavic" appears the reverse of the "Celto-Germanic" with lower Mediterranean and reversed East/West European

Finally, the MDLP "Caucassian_Anatolian_Balkanic" component is predictably mainly West Asian, but with a little Mediterranean and Southwest Asian as well

A different way of comparing the different components is to include them all in a joint MDS plot, or calculate various types of distances between them (e.g., Fst).

For example, the first couple of dimensions are dominated by the African/Asian components of Dodecad v3 that are not present in MDLP. Notice, however, the position of "Altaic", right where one might expect to find it between West and East Eurasians.

Limiting ourselves to only European populations, we obtain:

It appears that the "North_Atlantic" component may be centered on a small number of related individuals.

I encourage other genome bloggers to try their own hand at comparing their components with those of other projects, or even their own. This process will be made possible if people using ADMIXTURE follow the simple instructions to convert their output for use with DIYDodecad.

Once Dodecad v4 is off the ground, and if I find time to fully automate the process, I will perhaps try to map all my past calculators (i.e., the initial K=10, Dodecad v3, 'bat', 'euro7', 'weac', 'africa9') onto the new golden standard of the Project.

PS: This analysis was done on ~63k SNPs in common between MDLP and Dodecad v3

10 comments:

If I find some time to automate the process, I will map all the previous Dodecad calculators and a few ones from other projects onto the Dodecad v4 platform. It is a bit tedious to do the analysis I did here step-by-step.

As for the zombie concept, it works just fine. I have never seen a major discrepancy between results generated using supervised ADMIXTURE+Zombies vs. those generated using DIYDodecad+allele frequencies.

"Once Dodecad v4 is off the ground, and if I find time to fully automate the process, I will perhaps try to map all my past calculators (i.e., the initial K=10, Dodecad v3, 'bat', 'euro7', 'weac', 'africa9') onto the new golden standard of the Project."

Nice to hear something new is in the works. Looking forward to see what you have in store for us :-).

Dienekes, I have just one request. Instead of accounting for ancestries in the Information about project samples thread; or at least in addition to that, could you a create a [freely editable] Google spreadsheet with columns for the Dodecad ID (DODnnnn); Ethnicity/Nationality; Y-DNA and mtDNA Haplogroups like the way Zack from the Harappa Ancestry Project has done? That would be very, very useful. Going through the numerous posts in the ancestry thread in order to look up a participant of interest's ethnicity is rather harrowing.

Useful software

You may cite, quote, or reproduce articles on this site for non-commercial purposes, provided that you attribute them to Dienekes Pontikos and provide a link either to the main page of this blog or to the individual blog entry you are referring to.