09 April 2010

Yesterday I posted an example of how graph-coarsening algorithms can be used to make the high-level patterns of a phylogeny more immediately visible. That example used a phylogenetic hypothesis about placental mammals. The hypothesis involves a lot of nodes (i.e., taxonomic units), but not much branching complexity. By which I mean each node has only a single parent.

So here's an example where nodes may have multiple parents. This is a phylogeny of Biota, i.e., Life:

Eukaryotes (organisms with cellular nuclei, i.e., plants [Embryophyta, etc.], animals [Metazoa], fungi [Eumycota], and "protists") have been highlighted in yellow. Some nodes have multiple parents due to one of two phenomena:

Endosymbiosis. Some organisms have evolved into organelles within the cells of other organisms, notably mitochondria (descended from proteobacteria related to those that cause rickets) and plastids (photosynthesizing organelles in plants, descended from cyanobacteria). In these cases, the organelle often retains its own DNA, although much of it may have leapt over to the "host's" nuclear DNA. In some cases, all of it may have leapt over (as with mitochondria descendants like mitosomes).

Both lateral transfer and endosymbiosis are considered valid forms of descent in this hypothesis.

Here's the graph coarsened one step:

We can see the general patterns more clearly here. Eukaryotes share a relationship with archaeans, but also have descent from proteobacteria (via mitochondria). One clade of eukaryotes (Plastida) is also descended from a basal form of cyanobacteria (via plastids). A few cases of lateral transfer are visible, but not in detail. We can also see there there is a lot of bacterial diversity, although the details are not spelled out.

Here's the graph coarsened another step:

The endosymbiosis is made even clearer, although most other relationships are obscured.

Disclaimer: This hypothesis was cobbled together from a number of sources and does not represent any rigorous research on my part. I suspect parts of it are outdated, but this area of the Tree of Life is not my bailiwick. I just wanted to throw something together for a demonstration.

08 April 2010

A new stem-human species, Australopithecus sedibaBerger et al. 2010, has just been announced. The paper's supplementary information contains the results of a cladistic analysis of stem-humans. For fun, I thought I'd plug the most parsimonious tree into the in-development version of Names on Nodes:

Stw 53 and SK 847 are specimens that are not readily assignable to named species. (SK 847 might be Homo ergaster). Our own species, sapiens, is presumably descended from the SK 847-erectus node.

The analysis finds sediba as a sister taxon to Homo (which includes habilis, rudolfensis, SK 847, and erectus), and possibly ancestral to it. Which begs the question, why not place it in Homo? If this hypothesis is correct, it shares more ancestry with the type species of Homo (sapiens) than it does with the type species of Australopithecus (africanus). Even Stw 53, which is here placed outside the sediba-Homo clade, has been attributed to Homo in the past.

Although I've been primarily reining in features on the next version of Names on Nodes, there was a new feature I couldn't resist adding. I think it's coming along pretty well.

A common problem with working with phylogenies is that many of them are gigantic, far too big to view all at once. As an example, consider Figure 1 from Beck et al. (2006). It models a hypothesis about placental mammal phylogeny, at an arbitrary resolution ("family-level"). Here's how the current version of Names on Nodes renders it:

When you look at it "zoomed out", it's almost impossible to know what's going on. When you look at it full size, you can see various local areas, but you lose a sense of what's going on with the larger image. Note that I've highlighted our own species' twig on the tree (Hominidae, the great ape clade) in yellow.

Earlier I used the term "resolution" to refer to the size of the graph's nodes. We can refer to a graph with very small nodes (e.g., each node representing an individual organism) as being "fine" and a graph with very large nodes (e.g., "class-level") as being "coarse". Thinking about the problem from this angle, I had the idea to create a control for coarsening or refining the viewed graph.

I implemented a simple graph-coarsening algorithm*, and then created an algorithm for picking the best name for the new, coarser graph's nodes. And here is the phylogeny at near-maximum coarseness:

This is placental phylogeny boiled down to its basics: rodents, laurasiatheres, and a bunch of other junk (including us). The node labelled "Placentalia*" contains the placental ancestor but not all descendants—it lacks an unnamed clade included most non-afrothere placentals. The unnamed greenish node includes all members of that unnamed clade except for rodents and laurasiatheres. (This happens to include Hominidae, which is why it has that greenish color.)

Let's refine it one step:

We're starting to get a better idea of the hypothesis. Finer:

Now we can see the basal split between afrotheres and other placentals, as well as developing complexity in Rodentia and Laurasiatheria. Finer:

Getting a little bit on the big side, now, but we can see more details. There are a lot of unnamed clades within Hystricoidea and Chiroptera—we can see that those clades are diverse, although we can't see details. Finer:

This has about 2/5 as many nodes as the base graph. It's a bit large, but still much easier to view than the base graph. Many important details are visible (e.g., the platyrrhine-catarrhine split), while others are just suggested (e.g., lots of diversity in Caviomorpha).

Obviously this works best if lots of clades have been named. I think it'll be a useful for boiling a phylogeny down to an appropriate level: coarser for quick overviews, finer for in-depth discussion.

* Basic summary of the coarsening algorithm:

Look through all nodes that have children, and find the ones whose children are all terminal (sinks).

Merge each of those nodes with their children to create a "supernode".

Merge all overlapping supernodes. (This is important for graphs where nodes may have multiple ancestors, although it doesn't come into play in this example.)

Remove the supernodes from the graph and repeat from step 1. Keep going until no nodes are left.

Add the supernodes to a new graph. A supernode is ancestral to another supernode if any of its subnodes are ancestral to any of the other supernode's subnodes.

05 April 2010

In some ways this works out to be a bit like a query language. You can use it to set up data constructs, and then search them for groups of interest. For example, suppose you wanted a list of all stem-humans from Kenya. Assuming that your dataset included 1) a taxonomic unit called Homo sapiens, 2) a group called extant for all extant taxonomic units, and 3) a group called Kenya for all Kenyan taxonomic units, that query might look like this:

MathML is great for being flexible and extensible enough to cover concepts like this. But ... it's also really verbose. This is fine for my purposes so far, but it may be cumbersome for other purposes. So I've been playing around with a more succinct way to write these expressions. Today I tossed up some rough ideas here:

I've just updated the Names on Nodes website based on these revisions to the project, most notably the MathML Definitions document. Most of the changes have actually been removals: no more mentions of rank-based taxonomy (which may be covered in future versions but not in this one), qualified names as taxonomic identifiers (no longer a necessary feature), etc. So if you didn't read it before because it was too long and dense ... well, it's still pretty long and dense, actually. But less so!