Wednesday, July 10, 2013

Networks and human inter-population variation

I have noted before that there are many situations in which the model of a phylogenetic tree is likely to be inappropriate for analysis of genetic data. The most obvious of these involves the study of intra-population variation (e.g. Why do we still use trees for the dog genealogy?). The within-population genealogy of sexually reproducing species, in particular, is not likely to be tree-like, even at large spatial scales. The iconic species for the study of intra-specific evolutionary history is Homo sapiens, and this is also the species where that history is least likely to be tree-like (e.g. Why do we still use trees for the Neandertal genealogy?). Clearly, a phylogenetic network is called for.

Pemberton et al. (2013, Population structure in a comprehensive genomic data set on human microsatellite variation. Genes Genomes Genetics 3: 891-907) provide an interesting dataset of global human autosomal microsatellite variation, based on merging eight previously published datasets. Microsatellites are a bit retro in this day and age, but that does not make them any less useful for the study of genetic variation.

The biggest issue is getting a large enough sample of loci for detailed study. Different researchers collect data on different microsatellites, and so combining datasets is not straightforward. Nevertheless, Pemberton et al. managed to come up with 5,795 individuals from 267 worldwide populations with genotypes at 645 loci. After filtering a member of every intra-population first-degree and second-degree relative pair, and then reducing the size of the over-represented Gujarati sample, they then added data for 84 chimpanzees. This yielded a dataset of 5,519 individuals from 255 populations sampled at 246 shared loci.

These data were processed as follows:

Using Microsat, we evaluated population-level pairwise allele-sharing distance (one minus the proportion of shared alleles), using all 246 loci ... We constructed a greedy-consensus neighbor-joining tree using the Neighbor and Consensus programs in the Phylip package from 1000 bootstrap resamples across loci.

Note that the original inter-population distances were not calculated — the tree was constructed by combining the branches with the highest bootstrap support.

This tree (reproduced above) does not show a great deal of support for many of the branches, and the authors discuss only seven of them. However, the presentation of a tree does not give much of a visual indication of the poor support for the genealogy, even if the different branch thicknesses do indicate the bootstrap values.

So, I calculated a NeighborNet network from the distance data, by averaging the 1000 distance matrices from the bootstrap analysis. This is the network analogue of the neighbor-joining tree, as shown above. Note that I have used the same colour coding as for the tree (thus making it look like a very colourful hummingbird), and the branch lengths represent support.

There is clearly a degree of large-scale geographical clustering of the genotypes, and this corresponds to the larger bootstrap values in the tree. So, the main message from the tree and the network is the same, including the rooting of the human genealogy within the African "group". However, this message is visually much clearer in the network than in the circular version of the tree. Moreover, there is little distinction between the Middle Eastern (yellow) and European (blue) genotypes, and the network makes this more obvious than does the tree.