Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Tuesday, January 16, 2007

A manifesto

The funding of pPOD mentioned earlier today motivates me to write some notes on what I think "core database technologies for enabling the integration of AToL data" could, or indeed, should be about. Much of what follows I've mentioned elsewhere on the iPhylo blog (for example here and related blogs SemAnt and iSpecies) but it seems useful to bring this together here.

It's not about algorithms CIPRES seems to be to be basically about putting "go faster" stripes on phylogenetic algorithms. I don't mean this to be as dismissive as it sounds, this is intellectually challenging stuff. It's just that projects such as TreeBASE seem adrift in that environment. See my CIPRES talk for more.

It's about integration Integrating biodiversity information is a hot topic, and I think this is were most of easy stuff is. Easy in the sense that much of the intellectual work has been done, most of it elsewhere. We have GUIDs (e.g., DOIs and LSIDs), RDF, triple stores, and query languages (e.g., SPARQL). Providing we avoid getting embroiled and/or bogged down in some of the details, I think this stuff is technically pretty straightforward. I've been experimenting on some of these ideas in the context of a project integrating information on ants, see my SemAnt blog for a record of this work. There are also related posts on iSpecies.

It's about new queriesMy own view is that much of the work on tree searching in TreeBASE, for example by Jason Wang and collaborators — while interesting — is misplaced. I don't get the sense that biologists are really interested in asking the question "find me trees like this". Rather, I think biologists are really interested in questions such as "find me trees that have x more closely related to y than to z", or "find me trees in which group x is/is not monophyletic". I think these are pattern matching queries, or more fundamentally, I think they are all in essence least common ancestor (LCA) queries. Indeed, once stripped of all the rhetoric about bringing classification into the 21st century (and the nonsense about renaming species), the phylocode boils down to named LCA queries.

What we also need are queries that deal with geography and time. Work on interval queries seems relevant here. Ideally we'd move beyond GIS queries to pattern matching geographically-labelled trees (finally providing tools for cladistic biogeography).

It's about branch lengthsBrian O'Meara's comments on thhis blog reminded me that I'd forgotten about edge (=branch) lengths. Although these are implicit in my discussion of chronograms (see below), as Brian notes:

While systematists may be most interested in relatedness, many biologists will use trees for investigating trait evolution or as ways to control for phylogenetic relatedness (contrasts), and for this they need branch lengths.

Brian also mentions the potential storage issue for Bayesian trees, i..e storing the results of MCMC runs. If we want to store only topologies, it also seemed to me that there might be some clever ways to store only the difference between successive trees, given that each tree is a perturbation of the previous one (e.g., a NNI). Storing edge lengths complicates this, although they too are related to those in the previous tree. Is there a smart way to store these things, or do we just gzip the tree file and stick it on a server?

It's about new visualisationsBill Piel's work on putting phylogenies in Google Earth, and related work by Daniel Janies et al. (coming out soon in Systematic Biology) show the potential of geographic visualisation.

Continuing the theme of visualising phylogenies, one thing which strikes me is the parallel between genome browsers that display annotation "tracks" (such as the UCSC Genome Browser) and illustrations of "chronograms" with geological periods and accompanying data, such as sea levels, isotope levels, etc. In my haste I couldn't find an example with a sea-level track, but I know they exist … In both cases there is a natural co-ordinate system (genome location and time, respectively) going from left to right, and annotations that can be added using the same frame of reference.

Dating phylogenetic trees is currently "hot", but phylogeny databases don't support dated trees.

It's about collaborationI think there are lots of tools being developed elsewhere, such as Connotea, Flickr, and EditGrid that can be utilised (or used as sources of inspiration). These provide tools for managing bibliographic data, images, and spreadsheets. Let's not reinvent this stuff. For example, Connotea can be integrated with TreeBASE, Flickr can be used to store images with metadata, and EditGrid can be used to create collaborative data matrices, as well as simple annotations. And, speaking of annotations, blogs seem to provide ideal tools for this.My point here is developing domain-specific tools for this stuff seems to me to be a huge mistake.

In summary, my own perspective is that one way to tackle this problem is to take advantage of the swarm of community-driven, open API, folksonomy-based tools that are flooding the web.

6 comments:

Interesting post. If anything, I would emphasize the need to store branch lengths more than you do (add an It's about branch lengths section). While systematists may be most interested in relatedness, many biologists will use trees for investigating trait evolution or as ways to control for phylogenetic relatedness (contrasts), and for this they need branch lengths. TreeBase currently forbids inclusion of branch length information in submissions (in bold on the submission page it says "Please do not include branch lengths in the tree file!"). I think there has been a view that if the topology and raw data are provided (as in TreeBase), users could just re-estimate branch lengths using this information, but this ignores the widespread use of fossil and biogeographic information to calibrate trees as well as the increasingly complex models used to turn trees into chronograms (it also assumes that users are comfortable using phylogenetic software). If I remember its description correctly, the next version of TreeBase will continue to ignore branch lengths. Thus, ecologists who want to use trees for independent contrasts, for example, will either have to go somewhere else for dated trees or use one of the various methods for inventing branch lengths (Grafen branch lengths, uniform branch lengths) for topologies pulled from TreeBase.

If this is still the case, I see the need for creating a website that would allow phylogeneticists to submit chronograms or other trees with branch lengths as well as their raw data, skim off the trees, deposit the trees in an accessible database, and then pass the topologies and data to TreeBase. It wouldn't be too hard to do this, though I wonder how much storage would be required (especially with bootstrap or Bayesian analyses resulting in thousands of trees).

Indeed, once stripped of all the rhetoric about bringing classification into the 21st century (and the nonsense about renaming species), the phylocode boils down to named LCA queries.

Eh, eh, eh, eh, eh -- hold on a second.

Firstly, we are not bringing classification into the 21st century, we're making it superfluous. Why waste time on worrying about "how to translate a tree into a classification"? Instead, tie labels to defined places on the tree -- the PhyloCode (about to change its name to the more sensible "International Code for Phylogenetic Nomenclature") is nothing but the body of rules on how to do that.

Secondly, you'll be pleased to hear that the nonsense about renaming species has been dropped at the 2nd ISPN meeting (Yale, last summer); instead the governing of species names will be left to the existing codes. In the cases where the genus part of the binominal will not be a clade name valid under the PhyloCode, it will be ornamented in various ways that are not yet fully worked out (e. g. quotation marks if known to be the name of a paraphyletic group). This is mentioned in the paper cited below, but not yet on the PhyloCode website -- I think the Committee is still debating the details.

It may well be that the PhyloCode is ultimately about LCA queries, but I don't understand how. Could you explain?

Abstract:"The Second Meeting of the International Society for Phylogenetic Nomenclature (ISPN) convened at Yale University in New Haven from June 28 to July 2, 2006. In addition to contributed talks, the conference included symposia on phylogenetic nomenclature of species, phyloinformatics, and implementing phylogenetic nomenclature. Other discussion focused on recent controversial additions to the draft PhyloCode concerning the choice of names for total clades, and the Committee on Phylogenetic Nomenclature (CPN) was encouraged to revisit this issue. A proposal to permit emendation of phylogenetic definitions without CPN approval under certain circumstances was well received, and there was wide support for a proposed mechanism to use Linnaean binomina in the context of phylogenetic nomenclature without extending the PhyloCode to govern species names. The ISPN Council voted to expand the CPN from 9 to 12 members."

(Zoologica Scripta seems to be becoming the unofficial PhyloCode journal. You should probably read more of it.)

OK, I was perhaps being a little sloppy. How about "bringing nomenclature into the 21st century"?

By LCA queries I mean "least common ancestor" queries (or, if you prefer, "most recent common ancestor", it's the same thing). For example, consider this definition of the name Engystomops from Ron et al. (10.1016/j.ympev.2005.11.022):

EditGrid have already get out of beta by 2007’s Valentine’s day! While getting of beta, we have launched our subscription service for corporate users in the same day, but we will keep a free service for personal user, free-for-charge forever.