Wednesday, May 22, 2013

Are phylogenetic trees useful for domesticated organisms?

When looking at the population genetics literature I have noticed that many papers still present very traditional phylogenetic analyses, particularly in what can broadly be called agricultural studies. For instance, genetic distances might be calculated between the samples and a "tree of genetic relationships" presented based on UPGMA clustering.

The problem with this sort of approach to genotype data analysis is that it forces the data into an ultrametric tree, which has long been shown to be inappropriate as a model for evolutionary relationships. Furthermore, there is no indication of the robustness of this tree, nor even whether a tree model is appropriate in the first place.

As a specific example, we can look at the microsatellite data presented by Carimi et al. (2010) for various Sicilian grape cultivars. For grape varieties, where hybridization among cultivars has been the historical norm, an ultrametric tree seems singularly inappropriate.

Wine grapes have been grown on Sicily for more than 2,000 years, and at least 120 grape-vine cultivar names are known in the literature. The authors sampled 82 of the cultivars from the Institute of Plant Genetics (Palermo) germplasm collection, with 1-5 clones sampled per cultivar. They assessed six polymorphic microsatellite loci, producing diploid (co-dominant) data. Only 70 distinct genotypes were detected, which were then subjected to data analysis.

The authors used the "Simple Matching coefficient for co-dominant and multiallelic data" to estimate the genetic distances between samples. Unfortunately, this has been shown to have odd properties for diploid microsatellite data (Kosman and Leanard 2005). Therefore, in my analysis I have used the simple metric of Kosman and Leonard (2005), instead, in which genotype distances are calculated as a proportion of the shared alleles at each locus (averaged across loci). This was calculated using the mmod R package (Winter 2012).

The authors then used the "UPGMA (Unweighted Pair-Group Method with Arithmetical Averages)" clustering method to produce their ultrametric tree from the distance data. This is the most commonly encountered agglomerative hierarchical clustering method to be found in the literature. Instead, I used a NeighborNet network to evaluate whether the data are tree-like, calculated using the SplitsTree program.

The resulting network is shown in the first graph. Cultivars that are closely connected in the network are similar to each other based on their microsatellite profiles, and those that are further apart are progressively more different from each other.

The network shows that there is very little hierarchical structure to the grape-vine microsatellite data. The data do not clearly distinguish "six main groups", as interpreted by the original authors based on their tree (which is shown below). [Note that one of the authors' groups (cluster E) is more heterogeneous than the others, and to be comparable should be divided into either two or three groups.]

Note that the network emphasizes two things: (1) there are no clear groupings of the grape cultivars, and (2) the data are rather "noisy", as microsatellite data often are (e.g. Leroy et al. 2009), with many incompatible signals.

As far as the phylogenetic history is concerned, there is no evidence of "several origins for Sicilian grape-vine germplasm", as interpreted by the authors. Instead, there seems to have been continuous mixing of the genotypes, probably including cultivars from elsewhere in Italy, and even further afield around the Mediterranean. This type of complex genetic history seems to be quite common in domesticated organisms, and a tree-based analysis is therefore unlikely to be appropriate for studying them; see, for example, Decker et al. (2009) for cows, Leroy et al. (2009) for horses, and Kijas et al. (2012) for sheep.