search this blog

Wednesday, May 23, 2018

Global25 workshop 2: intra-European variation

Even though the Global25 focuses on world-wide human genetic diversity, it can also reveal a lot of information about genetic substructures within continental regions.
Several of the dimensions, for instance, reflect Balto-Slavic-specific genetic drift. I ensured that this would be the case by running a lot of Slavic groups in the analysis. A useful by-product of this strategy is that the Global25 is very good at exposing relatively recent intra-European genetic variation.
To see this for yourself, download the datasheet below and plug it into the PAST program, which is freely available here. Then select all of the columns by clicking on the empty tab above the labels, and choose Multivariate > Ordination > Principal Components.

You should end up with the plot below. Note that to see the group labels and outlines, you need to tick the appropriate boxes in the panel to the right of the image. To improve the experience, it might also be useful to color-code different parts of Europe, and you can do that by choosing Edit > Row colors/symbols. Of course, if you have Global25 coordinates you can add yourself to the datasheet to see where you plot.

Components 1 and 2 pack the most information and, more or less, recapitulate the geographic structure of Europe. However, many details can only be seen by plotting the less significant components. For instance, a plot of components 1 and 3 almost perfectly separates Northeastern Europe into two distinct clusters made up of the speakers of Indo-European and Finno-Ugric languages.

Again thanks - am really learning a lot from your tutorials. Now in the process of loading the Global25.dat file for myself into these tables. (puts me right in the Anglo-Saxon Cline where I imagined it would be).

Very instructive. Prior to this I really did not know what to make of these charts loaded with coloured dots (and I bet I was not alone) :)

Activating the group labels was a great help.

Keep up your good work - you are winning over PCA newbies like myself and am loving it.

One trick you can do with the Global25 PCA data and these subset PCAs, is that after computing a subset PCA, you can then project other populations in the G25 onto it, based on their G25 data.

E.g. let's say you first do a subset PCA on modern Europeans - https://imgur.com/CkZC2Dw

You can then take the loadings from that - https://imgur.com/xEnJ0Yl

And finally use the loadings as a conversion table to put other populations back on that subset PCA - https://imgur.com/a/fg0I1u7

(Actually how you do the conversion table is a bit complex to explain. I do it in spreadsheets, but I think Eren came up with a quick R script that does this better).

Because this PCA is basically lossless compression, this should all be lossless provided you use all 25 dimensions in initial subset PCA, then conversion, etc.

Of course, this is probably less informative about total distances than simply reprocessing back through PCA, but if you're interested in where an ancient population would "project" back onto a PCA of moderns, etc. it could be quite useful.