Wednesday, June 20, 2012

Rooted networks for exploratory data analysis

Leo van Iersel has recently been trying to convince me that rooted networks might also be useful as exploratory data analysis (EDA), in addition to the unrooted networks that I have championed in print (Morrison 2010) and in this blog. I have tried to find a dataset that will support his case, and the one discussed here is the best that I have been able to find.

In infection biology we are interested in the transmission of pathogens from one host to another, possibly in geographically distant locations. It is usually assumed that pathogens (viruses, bacteria, protists, microfungi, helminths) with the same genotype found in different locations represent transmission from a single source location. Conversely, a mixture of genotypes at a single location is assumed to represent multiple sources of infection, possibly at different times. This type of analysis is a combination of population genetics and phylogenetics.

Such transmission studies can produce quite complex results, even to the extent of having different pathogen genotypes simultaneously in the same host. Data analysis is usually based on either a rooted tree or an unrooted haplotype network, but it can also conveniently be studied using a rooted reticulation network. I will illustrate the latter with a simple example.

Click to enlarge

The figure shows a rooted network for 1,544 aligned nucleotides from 72 samples of the nematode Dictyocaulus viviparus, which is the parasitic lungworm of domestic cattle. The data are concatenated mitochondrial protein (2 genes), rRNA and tRNA gene sequences, from Höglund et al. (2006). The analysis shows the inferred historical relationships among 64 farm samples from Sweden (8 worms from each of Farms 29, 34, 36, 38, 49, 65, 68 and 76) and 8 samples from a isolate that had been maintained in the laboratory (L, used as the outgroup to root the network).

The data have been analyzed using the reticulation network method of Huson et al. (2007), based on splits generated by the Median network. Since the character data are essentially binary (with two exceptions), this produces exactly the same result as for a recombination network.

In the network, most of the samples from within each farm seem to be closely related in a simple divergent fashion through time, as would also be conveniently displayed by a standard tree-based analysis. There are apparently two major clades of genotypes, with 6-7 subclades. We can conclude from the tree-like relationships that four farms show evidence of only a single source of infection (Farms 34, 36, 38 and 76 each have a single genotype), while two farms appear to have at least two genotypes and thus probably two sources of infection (Farms 49 and 68).

However, two of the farms show more complex patterns than these, which would not be revealed by a simple tree analysis. These two farms have groups of samples that descend from reticulation nodes (indicated by the arrows), thus suggesting the pooling of two distinct sources of genetic material. Note that there is no suggestion that these reticulations represent either recombination or hybridization, given that the data are from mitochondrial genes. This analysis is best treated as exploratory (EDA), highlighting genotypic complexity that warrants further biological investigation, rather then providing an explicit hypothesis of evolutionary history.

Farm 29 is shown as having one unique genotype (5 individuals) plus another genotype (3 individuals) that has elements possibly related to both of the major clades of genotypes. Perhaps these latter 3 individuals represent an earlier infection, given their apparent association with the basal branches of the two clades.

Farm 65 appears to be even more noteworthy. There are 3 individuals that are apparently related to those on Farm 36, plus 3 individuals of somewhat uncertain relationship. Then there are 2 individuals with elements possibly related to the genotypes on Farms 76 and 49. This is clearly a very interesting farm, from the point of view of lungworm infection and transmission, with at least three possible infection sources. This is important information that needs to be taken into account for possible management strategies.

This use of a rooted network analysis for exploratory data analysis seems not to have been considered before. However, it seems to me that it adds considerably to the practical information that can be gleaned from a study of the transmission of pathogens.