Tuesday, February 7, 2017

Networks, trees and sequence polymorphisms

One of the more obvious bits of evidence that an organismal history may not be entirely tree-like is the presence of sequence polymorphisms. For example, intra-individual site polymorphisms in ITS sequences create considerable conflict in a dataset, if we try to construct a tree-like phylogeny.

This means that people have adopted a range of strategies to try to get a nice neat tree out of their data. This topic is briefly reviewed in this recent paper:

Agnes Scheunert and Günther Heubl (2017) Against all odds: reconstructing the evolutionary history of Scrophularia (Scrophulariaceae) despite high levels of incongruence and reticulate evolution. Organisms Diversity and Evolution in press.

The authors discuss the following strategies, for which they also provide a few literature references.

1. Delete the offending taxa

Pruning the offending taxa is among the most-used tactics. This deletes part of the phylogeny, of course.

2. Delete the polymorphisms

Excluding the polymorphic alignment positions is probably the most common tactic. Similar strategies include the replacement of the polymorphisms with either a missing data code or the most common nucleotide at that position. All of these ideas resolve the polymorphisms in favor of the strongest phylogenetic signal, and thus sweep the conflicting signals under the carpet.

3. Select single gene copies

The polymorphisms become apparent because there are multiple copies of the gene(s) concerned, and therefore selecting a single copy removes the polymorphisms. This can be done by cloning the gene (at the time of data collection), or by statistical haplotype phasing methods (during the data analysis). This also sweeps the conflicting signals under the carpet..

4. Code the polymorphisms

As a preferred alternative, rather than discarding or substituting the sequence variabilities, we could include them as phylogenetically informative characters. This would allow the construction of a phylogenetic network, as well as a tree-like history.

One possibility, suggested by Fuertes Aguilar and Nieto Feliner (2003), concentrates on Additive Polymorphic Sites (APS). A sequence site is an APS when each of the nucleotides involved in the
polymorphism can also be found separately at the same site in at least
one other accession. Other intra-individual polymorphisms are ignored. This approach has been used to detect hybrids, for example.

An alternative, as used by Scheunert and Günther Heubl to study reticulate evolution in their paper, uses 2ISP (Intra-Individual Site Polymorphisms). All IUPAC codes, including polymorphic sites, are treated as unique characters, by recoding the complete alignment as a standard matrix, which is then analyzed using a multistate analysis option for categorical data. The authors actually use the ad hoc maximum-likelihood implementation from Potts et al. (2014), with additional adaptation of a method for bayesian inference based on Grimm et al. (2007).