Wednesday, September 23, 2015

Uses of MUL-trees for evolutionary networks

Creating evolutionary phylogenetic networks is currently a somewhat ad hoc procedure, with a number of competing strategies based on various models of how gene flow occurs.

One possibility is to use multi-labeled trees. Here, multiple gene trees can be represented by a single multi-labeled tree (a MUL-tree), which in turn can also be represented as a reticulating network. A MUL-tree has leaves that are not uniquely labeled by a set of species (ie. each species can appear more than once). This means that multiple gene trees can be represented by a single MUL-tree, with different combinations of the leaf labels representing different gene trees.

The most obvious uses of a MUL-tree are where there are multiple copies of genes within an organism, as each gene copy can be represented independently in the MUL-tree. This will apply when there has been gene duplication, for example, or when there has been polyploidy (ie. multiple copies of the entire genome). Computer programs such as PADRE or MulRF can then be used to derive an optimal single-labeled species network from the MUL-tree.

However, this same strategy can also be used whenever there is conflict among gene trees. In this scenario, the conflicting genes are treated as different leaves in the MUL-tree. One labeled leaf would have the data for the first gene, with the second gene entered as missing data, and the second leaf would then have the inverse situation (the data for gene one are missing and those for gene two are present).

This can be illustrated by a recent example of the Erica (heather plants) genus, from Mugrabi de Kuppler et al. (2015). The authors were interested in whether the observed gene tree conflict in Erica lusitanica could be the result of hybridisation between morphologically dissimilar species, as this has previously been suggested.

They collected sequence data for a number of plastid regions as well as the nuclear ribosomal ITS region. The observed conflict was between the plastid (chloroplast) and nuclear sequences. They note:

A targeted supermatrix strategy was employed, whereby more variable ITS and trnL-trnF spacer sequences were obtained for most samples, and the other, mostly less variable chloroplast markers were added for selected taxa in order to improve resolution of deeper nodes in the chloroplast tree.

Where gene tree conflict was identified, the taxa with conflicting phylogenetic signals were duplicated in a combined matrix following the approach of Pirie et al. (2008, 2009) in order to infer a single multi-labelled "taxon duplication" tree. [This occurred for only one species. Thus, one leaf label for E. lusitanica has the data only for the chloroplast sequences, and the other leaf has the data only for the nuclear sequence.]

The figure shows the result of the coalescent BEAST analysis of the multi-labeled data, with E. lusitanica appearing twice in the MUL-tree. Inset is the resulting single-labeled network, with E. lusitanica appearing once, as a reticulation.

This is an interesting application of MUL-trees. However, there are two issues that I wish to highlight about the procedure.

First, the reticulation as shown in the example is not actually time-consistent, given that the horizontal axis of the MUL-tree is scaled to time. This could, for example, be resolved by having "E. lusitanica CP" attached to a ghost lineage.

Second, the data matrix from which the MUL-tree is created will have a non-random distribution of missing data, by definition. This non-randomness is known to have a bad effect on likelihood analyses (Simmons 2012). In the example, the non-randomness is exacerbated by further non-randomness in the acquisition of the plastid sequences. So, if this form of MUL-tree analysis is to be pursued then maybe this potential limitation should be investigated.