Wednesday, July 25, 2012

Are mathematical constraints biologically realistic?

Mathematicians and other computational scientists have produced their own definitions of phylogenetic networks, independently of biologists. For the evolutionary type of phylogenetic network, the definition usually looks something like this:

A phylogenetic network is a rooted, directed graph (consisting of nodes, plus edges that connect each parent node to its child nodes) such that:
(1) There is exactly one node having indegree 0, the root
- all other nodes have indegree 1 or 2
(2) All nodes with indegree 1 have outdegree 2 or 0
- nodes with outdegree 2 are tree nodes
- nodes with outdegree 0 are leaves, distinctly labelled
(3) The root has outdegree 2, and
(4) Nodes with indegree 2 have outdegree 1, called reticulation nodes.

An obvious question of interest is how (or whether?) this definition connects to what biologists have in mind when they use the term "phylogenetic network". Clearly, this definition places considerable restrictions on the networks that will be inferred by any mathematical algorithm, which in turn affects their use as models for biological inference.

The first thing to note is that unrooted networks are excluded, because the graph is directed. Thus, many (if not most) of the phylogenetic networks that have appeared in the literature are excluded from the discussion. Furthermore, a tree is considered to have all internal nodes with indegree 1 and outdegree 2 (i.e. no reticulation nodes), and we know this to be biologically unrealistic, in general. (Otherwise, this blog would be redundant!)

Biologically, the other parts of the definition imply:
One node of indegree 0- the network has no previous ancestry that is to be inferred
Nodes with outdegree 0 are labelled- observed (contemporary) taxa occur only at the leaves
All nodes with indegree 2 have outdegree 1- reticulation and speciation cannot occur simultaneously
No nodes with indegree >2- reticulation events cannot involve input from more than 2 parents simultaneously
No nodes with outdegree >2- speciation involves only two children at a time.

These do not appear to be onerous biological restrictions. Indeed, he first two have been standard characteristics of tree-building for several decades. The other three are also logical extensions of the restrictions that have previously been placed on trees. However, phylogenetic history is unlikely to have been as simple as implied by these features. Thus, biologists will need to keep a careful eye on whether the simplifications are affecting the networks inferred for their particular group of organisms.

Other restrictions

In addition to the restrictions created by the definition, other topological restrictions have been used to make the mathematical inference algorithms computationally tractable. Thus, only certain sub-families of possible networks are considered by most of the computer programs. These include:

tree-child network, tree-sibling network

level-k network, galled tree

binary input trees for hybridization and HGT networks

binary characters for recombination networks.

These restrictions may be unrelated to each other; so, we can consider them separately.

Tree-child, Tree-sibling

In a tree-child network, every internal node has at least one child node that is a tree node- ie. a reticulation event cannot be followed immediately by another reticulation event
In a tree-sibling network, every reticulation node has at least one sibling node that is a tree node- ie. a parent cannot be directly involved in two separate reticulation events
Note that every tree-child network will also be a tree-sibling network, but not vice versa.

Algorithmically, these two restrictions may involve the addition of extra tree nodes to an inferred network, in order to satisfy the restrictions. Biologically, the question is whether real networks are this simple. Arenas et al. (2008) simulated data under the coalescent with recombination, and found that even at small recombination rates most of the networks produced were already more complex than a tree-sibling network. On the other hand, Arenas et al. (2010) analyzed real population-level data from the PopSet and Polymorphix databases using the TCS program, and found that >98% the resulting networks could be characterized as tree-sibling. So, there is cause for optimism, in the sense that the "optimum" networks algorithmically are not necessarily complex, at least for closely related organisms (ie. within species).

Level-k network, Galled tree

A network has level k if each tangled part of the network (ie. each biconnected component) contains at most k reticulation nodes (see this previous post). This is a generalization of the older notion of galled trees, in which reticulation cycles do not overlap (ie. do not share edges or nodes), as galled trees are level-1 networks. Level-k networks can also be seen as a generalization of networks with k reticulation nodes, although there may be a difference between a network with minimum level and one with a minimum number of reticulations.

Algorithmically, these restrictions have been used to guide the search for (or choice of) the "optimal" inferred network. Biologically, these notions do not seem to have been investigated, but basically they restrict how complex inferred reticulation histories can be. In particular, they restrict the complexity of any given subset of each network. It has been noted that optimizing k can easily lead to networks that look biologically unrealistic (Huson et al. 2011).

Binary input

The requirements for binary input trees and binary characters are restrictions that have been applied in the past, because they greatly reduce the complexity of the input to the network algorithms, but they are now being relaxed. Effectively, the restrictions are to fully dichotomous trees and SNP characters. These are not unusual restrictions in evolutionary analysis, but they are obviously unrealistic.

As I noted in an earlier blog post, non-binary data often reflect uncertainty in the input, rather than a strictly bifurcating history, and this is not taken into account in the network inference if the input is restricted to a binary state. In particular, it may be unnecessarily hard to construct a network (because not all of the data signals relate to reticulation), and the resulting networks may have far too many reticulation nodes.

Conclusion

It is still an open question about the extent to which we can use these topologically restricted families of mathematical networks as a basis for reconstructing biological histories. Clearly, much more work is needed to understand the connections between the mathematical restrictions and the requirements of biological modelling.