Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Wednesday, January 18, 2006

Finding good phylogenies using citation relationships

How does a person who is not expert in a group of organisms find a good phylogeny to use for their work? Think of somebody interested in animal behaviour who needs a tree for their birds of fish. Ignoring the answers "ask a systematist" or (even worse) "become a systematist and build the tree themselves", how do we answer this query?

Google ranks sites using link structure (in essense, pages with lots of links that are themselves pointed to by lots of sites score highly). Could we use the same idea for scientific papers? The answer is of course we could, but whether it would generate useful results is an open question. I've been toying with Jon Kleinberg's ideas in Authoritative sources in a hyperlinked environment. Kleinberg identifies "authorities" and "hubs", which are roughly analogous to highly cited papers and review articles, respectively.

So, the idea is this. For a collection of papers (such as those in TreeBASE, or those being assembled for birds by Katie Davis in my lab), use Google Scholar to extract citation information, build a graph and compute authorities and hubs using Kleinberg's algorithms. Based on a little play with TreeBASE (which I need to finish and write up, sigh), papers with high hub scores tend to be recent reviews, which may be good candidates for a place to start.

We could even test this. In the case of Katie's work on bird supertrees, we could compute a measure of fit between input trees and the supertree, and compare that with the score assigned to the paper containing the source tree. If my idea has value, papers that have "good" input trees will also have scores based on citation structure (e.g., hubiness, or some other measure).

we could compute a measure of fit between input trees and the supertree

If you want to do this, my program stsupport can now calculate something like this - based on the concept of support that Mark, Davide, Ian and I published in Syst. Biol. recently, it now reverses the calculation, telling you how many input tree clades support a supertree branch, and how many conflict. Its not the version on my website, yet, though. This might be something like what you need.