I've taken the Stanford HGDP dataset and extracted the markers common to it and to HapMap-3, Behar et al. (2010), Rasmussen et al. (2010) and the 23andMe v2 genotyping platform, or about 500k SNPs in total (I removed C/G and A/T SNPs as a precaution and flipped strand in discordant ones to the HapMap-3 standard when it differed from that of HGDP).

I removed SNPs with less than 99% genotyping rate in any of the four data sources, and about 434k SNPs were retained. Subsequently I applied linkage disequilibrium-based pruning on the HGDP set (PLINK parameter: --indep-pairwise 50 5 0.3) resulting in a final dataset of about 177k SNPs. In all analyses of the HGDP set, I followed the recommendations of Rosenberg et al. (2006) keeping the 940 individuals in common between his 952-individual panel and the Stanford data.

Subsequently I ran multidimensional-scaling (MDS) on the 940 individual/57 population/177k SNP set in PLINK, and then I applied model-based clustering as implemented in mclust over the first 42 MDS dimensions, with a maximum number of clusters = 70. In total there were 64 clusters in the optimal solution suggested by mclust (*)

Before I give the results, it might be worth looking at the pairwise MDS scatterplots for just the first 5 dimensions:

As you can see, clusteredness emerges in different dimensions. Rather than inspecting innumerable 2D combinations visually (and indeed we should 3D, etc. as well, because clusters might emerge in 3D and higher subspaces that are not discernible in 2D projections), we let mclust iterate over k, thenumber of clusters, and different shapes, orientations, and volumes of clusters, using the well-known EM algorithm together with the Bayes Information Criterion to choose a good solution that maximizes detail without sacrficing parsimony.

Below you can see how many individuals are assigned to each of the 64 clusters from each of the 57 populations:

This is rather astonishing. There are many clusters with 100% correspondence to HGDP populations. A few populations, mostly from regions with high levels of inbreeding are split into multiple sub-clusters, perhaps reflecting some type of tribal affinity. And, there are a few populations, such as Tuscans and North Italians that are not split. But, the fact that this was inferred from unlabeled individuals is remarkable.

I remember reading Rosenberg et al. (2002), "The genetic structure of human populations" (pdf) which used structure, a model-based algorithm on raw genetic data to infer the existence of 6 clusters corresponding to continental populations. How is it that so much more detail can be achieved today?

There are three reasons: First, dense genotyping data are much better than the few hundreds of microsatellites used by Rosenberg et al. (2002). Second, the use of dimensionality reduction in the form of MDS allowed us to remove most of the "noise" in the genotyping data and focus on dimensions capturing a lot of distinctions. Third, the use of a sophisticated clustering algorithm such as mclust which can adapt to clusters of different shape, size, and orientation without human input was able to produce this result. mclust is computationally expensive, but it works like a charm (in a few minutes) with a few dozen dimensions and about a thousand individuals, producing a clustering of obviously good value.

How to repeat the experiment

If anyone wants to repeat this experiment they can do it easily. After you've managed to put the HGDP data into PLINK ped/map format, say in files HGDP.ped and HGDP.map (or any other data for that matter), just run

> plink --cluster --mds-plot d --file HGDP

Where d is the number of dimensions you want to retain. This produces a plink.mds file in which there is a header line, and each each line after that corresponds to an individual: the individual's projection in the first d dimensions are in columnsnumber 4 to d+3.

Then, in R, after you install and load the mclust package (see the MCLUST page for limitations on its use and licensing information), you just run:

> MDS <- read.table("plink.mds", header=T)

> maxclust <- 70

> MCLUST <- Mclust(MDS[, 4:(d+3)], G=1:maxclust)

where maxclust is the maximum number of clusters you want to consider.

Then, if you run:

> MCLUST$z

you will see a table in which each line corresponds to an individual and each column to the probability that it belongs to the i-th cluster.

There's much more that you can do in R with the mclust package, but this is enough for anyone wanting to repeat the experiment in its basic form.

(*) The number of clusters in the optimal solution varied between 11 with 2 dimensions retained and 64 with 42 dimensions retained. There was a secondary maximum of 60 clusters with 30 dimensions retained; choosing more dimensions than 42 (up to 50 that I examined), also resulted in a very high number of clusters, but I've decided to keep the one with 42 dimensions and 64 clusters as it is enough to serve the purpose of this post.

13 comments:

North Italians and Tuscans are on the same line, but also 2 French on 28. We know that more than four millions French are of Italian extraction. If the sample is representative, they would be 7%. What a pity that there aren’t Spaniards, otherwise it would be interesting to see their location.

"This is rather astonishing. There are many clusters with 100% correspondence to HGDP populations. A few populations, mostly from regions with high levels of inbreeding are split into multiple sub-clusters, perhaps reflecting some type of tribal affinity".

Uh? After one removes too low samples (I used n=10 as treshold), the mono-clustering populations are just 17, while the multi-clustering ones are 26. Total: 43. Mono-clustering populations represent only a bare 40% of all relevant samples.

Furthermore, some of the mono-cluster populations are clearly "tribal" and often alleged as "inbreeding", such as, in Europe: Basques, Sardinians, North Italians, Tuscans... but not French, Russian, Adigey nor Orcadian, all clear cases of multiple origins.

Valid homogeneous clusters are also found in Africa when the populations are homogeneous (the two Pygmy groups, Mandenka, Yoruba), among some rather isolated East Asian groups (Uygur, Dai, She, Tu, Japanese), but not among more cosmopolitan groups such as Han, Mongola, Cambodians or Yakuts (at least two population layers at their origins).

How do you decide on the maximum number of clusters to consider in mclust (i.e., where does maxclust=70 come from)?

The number of inferred clusters should be lower than maxclust. So, if you set, e.g., maxclust=50, then MCLUST will give you a warning that the number of clusters occurs at Max choice, and thus it will be a good idea to increase maxclust and re-run it.

Old Blog Archive

Dienekes' Anthropology blog is dedicated to human population genetics, physical anthropology, archaeology, and history.

You are free to reuse any of the materials of this blog for non-commercial purposes, as long as you attribute them to Dienekes Pontikos and provide a link to either the individual blog entry or to Dienekes Anthropology Blog.

Feel free to send e-mail to Dienekes Pontikos, or follow @dienekesp on Twitter.