Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Wednesday, March 18, 2009

Busy day yesterday, giving two talks, one at The Natural History Museum, one at the British Library. Slides for the NHM talk are below. Karen James pointed out the irony that a talk where I gave the NHM a hard time for being backward about embracing digitisation can't be viewed on most PCs at the NHM because SlideShare requires a recent version of Flash (which users can't install without IT's permission), and the downloaded presentation won't open because the NHM uses an older version of MS Office. So much for my attempts to share the slides. There will also be a video available at some point.

The second presentation was at the British Libraries "Talk Science" series, for some background see the forum on Nature Network. There will be a podcast available of this presentation. In her introduction to my talk, Sarah Kemmitt quoted from a recent paper by Antonio G. Valdecasas ([JACC]1175-5326:1820@41 where he described Vagabundia sci:

Vagabundia comes from the Spanish word 'vagabundo' that means 'wanderer'. It is a feminine substantive; sci refers to Science Citation Index. We pointed out some time ago (Valdecasas et al. 2000) that the popularity of the Science Citation Index (SCI) as a measure of ‘good’ science has been damaging to basic taxonomic work. Despite statements to the contrary that SCI is not adequate to evaluate taxonomic production (Krell 2000), it is used routinely to evaluate taxonomists and prioritize research grant proposals. As with everything in life, SCI had a beginning and will have an end. Before it becomes history, I dedicate this species to this sociological tool that has done more harm than good to taxonomic work and the basic study of biodiversity. Young biologists avoid the 'taxonomic trap' or becoming taxonomic specialists (Agnarsson & Kuntner 2007) due to the low citation rate of strictly discovery-oriented and interpretative taxonomic publications. Lack of recognition of the value of these publications, makes it difficult for authors to obtain grants or stable professional positions.

This is all terribly incomplete and crude, but it gives a sense of where this is going. The plan is to import in bulk the trees and the mappings (from, say, TBMap), as well as the names themselves, and associated literature (including the TreeBASE studies) and then the trees will be embedded in richer data about the taxa.

It's Friday, so time for some random, half-baked ideas. Imagine that we have a database of evolutionary trees, and these overlap for a set of taxa that we are interested in. How do we summarise these trees? One approach is to make a supertree. It would be useful to display the subtrees that went into making this supertree, if only to give an idea of how much they agree with the supertree. How to do this?

One idea I've been toying with is inspired by Photosynth, from Microsoft labs (it only runs on Windows, sigh). Photosynth takes a series of pictures taken from different angles and stiches them together into a 3D model of the object being photographed:

One thing I like about Photosynth is that you can see the original pictures, so when you move around the view you get a sense of how they have contributed to the overall view. This is easier to see than explain:

Now, imagine if we did this with trees. We could create a supertree as a summary of the individual trees, then have the original trees layered on top. Perhaps we could do this in 3D, so that each individual tree is in a plane that is tilted with respect to the supertree in proportion to how much it disagrees with the supertree:I think this could be a fun way to explore a set of trees, and it would give one the ability to quickly grasp how well the source trees agreed with the supertree. Note that I'm not (necessarily) arguing that the supertree represents the try phylogeny. Think of it as a convenient way to summarise the individual trees.

Part of what attracts me to this approach is that I think most, if not all, 3D phylogeny viewers (such as Paloverde and the Wellcome tree of life) don't make any real use of 3D, beyond the rather gimmicky (and I find ultimately confusing) ability to fly around a 2D tree. Is there a better way to exploit the possibilities of 3D?

A hierarchical query can be visualised as a range query. For example, the diagram below shows a classification where the descendants of Node A correspond to the range 1-8 (this is a simplification of visitation numbers, see my earlier post, and also Chen et al. doi:10.1186/1471-2148-8-90). The three trees can be represented as the ranges 3-8, 6-7, and 2-4, respectively. To find trees for the taxon corresponding to Node A we look for trees whose range intersects 1-8, to find trees corresponding to Node B we look for trees whose range intersects 1-5.

This approach retrieves a list of all trees that include a given taxon, but potentially this list could be very large (for example, a query for all plant trees could return 1000's of trees). So the question becomes how to order these trees? Some ideas:

order by the number of taxa in the tree.

order by the size of the range ("span") of the tree.

order by set inclusion.

Ordering by size seems attractive, but will favour trees with more taxa over those with a greater taxonomic spread (which is favoured by ordering by the span of the tree). I used span in my challenge entry to display papers with related taxa.

What I have in mind for ordering by set inclusion is constructing a directed graph where each node represents a tree, and a pair of nodes (x, y) is connected by an edge x → y if the taxa in tree y are nested inside the taxa in tree x. We could also introduce some additional nodes corresponding to nodes in the classification. If we then topologically sort the graph we have a linear order for the trees. Given that this order could be pre-computed independently of any queries (in much the same way that PageRank is), it could make for faster queries.

It would be useful to explore these and other ordering criteria. Perhaps the best approach would be some measure which combines one or more criteria, in which case we might want to use some form of rank aggregation (see iSpecies clones, and taxonomic intelligence for some links to the relevant literature).