Wednesday, September 25, 2013

How do we interpret a rooted haplotype network?

A splits graph is an unrooted phylogenetic network (see How to interpret splits graphs). It can be produced by any of several algorithms, including distance-based methods such as NeighborNet and Split Decomposition, character-based methods such as Median Networks and Parsimony Splits, and tree-based methods such as Consensus Networks and SuperNetworks.

Such graphs can also be produced by methods that conceptually modify Median Networks, such as Reduced Median Networks and Median-Joining Networks. These two methods are popular in population genetics, especially as related to Homo sapiens, where they are used as haplotype networks (or 1-step networks); and it is their use as haplotype networks that I wish to discuss here.

Haplotype networks represent the relationships among the different haploid genotypes observed in the dataset (ie. identical sequences are pooled into a single terminal). They are usually drawn unrooted, which is quite sensible for within-species data, where the root location is often unknown. However, there are occasions when a root is provided, and authors then interpret the splits graph as a directed network. This is directly analogous to starting with an unrooted phylogenetic tree and adding a root (usually via an outgroup), so that the rooted tree can be interpreted as a genealogical history. In moving from an unrooted to a rooted tree, each branch acquires a direction (away from the root), and the internal nodes become hypothetical ancestors.

However, this is problematic for all types of unrooted network. In the case of splits graphs, each edge acquires an unambiguous direction, as for a tree, but not every internal node can necessarily be interpreted as a hypothetical ancestor. How, then, do we interpret the rooted haplotype network?

An example

Let's look at a specific example, taken from the recent paper by Witas et al. (2013).

Figure 4 from this paper shows a haplotype network of four mtDNA HVR1 (hypervariable region 1 of the control region) samples from Ancient Mesopotamia (the middle Euphrates valley between 2500 BC and 500 AD), compared to contemporary samples from five different geographical regions. It shows that the ancient samples fit neatly into modern genetic variation from southern and eastern Asia, rather than from eastern Europe.

However, note that a root is also explicitly indicated. I explain below where this root comes from, but first let's concentrate on what happens if we treat the network as rooted.

This is a Median-Joining Network, and thus it is a splits graph. As such, the root provides unambiguous directions for all of the branches, based on the principle that the network must be a directed acyclic graph with only one root. This is shown by the arrows in the modified figure. Furthermore, all of the internal nodes can be interpreted as a hypothetical ancestors, except for the two reticulations in the graph, labelled A and B.

These reticulations are created by contradictory patterns involving the characters labelled 16276, 16185 and 16311. In a rooted splits graph, reticulations represent uncertainty about the order of character changes, rather than representing reticulate evolution (eg. recombination, hybridization, etc). In this case, we cannot determine whether character 16311 changes before or after the changes in characters 16185 and 16276.

So, it is important to recognize that a rooted splits graph does not explicitly represent a phylogeny, because reticulations in the graph represent uncertainty not genealogy.

The simplest interpretation of a this type of rooted splits graph is usually that the network represents a set of most-parsimonious trees, rather than a single parsimony tree. The different trees can be obtained by resolving the reticulations (ie. by deciding what order the character changes occur in). This relationship between the rooted haplotype network and a parsimony tree is shown by the following example from Jansen et al. (2002).

This is a network of 93 mtDNA control-region haplotypes from horses. It is also a Median-Joining Network, although the data were pre-processed using a Reduced Median Network. Node A6 is the root, based on equid outgroups. The solid lines indicates one of the most-parsimonious trees contained within the network — for every reticulation, one particular order of the character changes has been selected by the authors in order to postulate this particular tree. The non-chosen parts of the network are indicated by dotted lines — these are part of alternative most-parsimonious trees.

Explanation of the human mtDNA root

mtDNA is usually treated as a non-recombining locus, and so it should evolve along a tree. A rooted global tree has therefore been produced for humans, based on parsimony analysis of the mtDNA genome (Torroni et al. 2000; van Oven and Kayser 2009). Groups and subgroups of this tree have been labelled as haplotypes, such as haplotype group M shown in the top figure, and sub-haplogroups, such as M4b, M49 and M61. These are (monophyletic) clades in the mtDNA tree that have been highlighted for convenience. Parsimony analysis has been used to reconstruct the ancestral sequences in the tree (Behar et al. 2012), and these ancestral sequences can be used to assign new sequences to their appropriate place in the rooted tree (Blanco et al. 2011).

The basic limitation of this approach is that the haplogroups and sub-haplogroups are based on a non-unique parsimony tree. There are many equally parsimonious trees for the dataset, any one of which could have been chosen to define the haplogroups. In spite of this limitation, the predefined haplogroups are treated by many people as actually designating specific mitochondrial lineages, rather than merely being groups of convenience, which is what they are.

Witas HW, Tomczyk J, Jędrychowska-Dańska K, Chaubey G, Płoszaj T (2013) mtDNA from the Early Bronze Age to the Roman period suggests a genetic link between the Indian subcontinent and Mesopotamian cradle of civilization. PLoS One 8(9): e73682.

Dear João, I presume that by "gradient of ancestry" you are referring to a lineage of hypothesized ancestors in the graph matching a geographical gradient. If, so then one possible explanation for the pattern would be that the geographic gradient represents a time sequence, in which case the root might be inferable. However, this form of inference relies on substituting space for time (ie. a spatial sequence represents a time sequence), and it does not always lead to the correct conclusion (ie. there might be other explanations). /David

Thank you for your answer. I think that the use of rooted networks can become a very important tool. In rooted networks, I have found several situations in which a sequence of haplotypes with a relationship of ancestrality occurs in a direction north / south, or west / east. It is suggestive that this pattern could indicates a route of colonization. I think that the alternative explanations are more unlikely. It is intuitive to me that a spatial sequence is also a time sequence, because the plants take time to colonize new areas. However, I can not prove this and, surely, colonization of different areas may occur almost simultaneously.Unhappily, I have found very few examples of this approach in the literature, and I dont found any experimental confirmation.