Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Wednesday, June 19, 2013

One of goals of BioNames is to be more than simply another taxonomic database. In particular, I'm interested in the idea of having a platform for viewing taxonomic publications. One way to think about this is to consider the experience of viewing Wikipedia. For any given page in Wikipedia there will be links to other, related content in Wikipedia. Reading an article about a city, you can go and read about the country the city occurs in. Reading about a battle, you can discover more about the generals who fought it. The ability to discover all this interconnected information in one place is compelling.

I'd like something similar for taxonomy. Given that a taxonomic database is in essence a collection of taxonomic names and publications, and a taxonomic publication is in essence a collection of names and citations of taxonomic publications, why not embed the publication within the database and have the names and citations link to the corresponding entries in the database?

Note the pattern in the URL, just append the DOI for an article to http://bionames.org/labs/zookeys-viewer/?doi=

Everything is a bit rough, but it's working well enough for you to get the basic idea. Code is in github Essentially the viewer grabs the ZooKeys HTML, extracts the URL for the XML file, fetches that, then uses some XSLT style sheets to convert the XML into something viewable. There's a sprinkling of Javascript to call the BioNames API. Much of the code could be tweaked to accepted other NLM XML-based articles, such as content from PLoS and the BMC journals.

One direction this could go in is to make a viewer like this the default viewer in BioNames for ZooKeys articles, so that instead of being restricted to a PDF you can interactively navigate between the article and the cited literature. Indeed, the very action of locating cited references in BioNames builds citation links. We could imagine extending the approach to content that isn't in NLM XML, such as Zootaxa PDFs, or content from BHL. Eventually I'd like to have the taxonomic literature fully embedded in the database, not as PDF or image silos, but as documents linked to names and literature. The journal becomes a database.

In browsing the GBIF classification in BioNames I keep coming across cases of wholesale duplication of taxa. I recently blogged about a single example, the White-browed Gibbon, but here's a larger example involving frogs.

Consider the frog genera Philautus and Raorchestes. The latter was described in 2010:

and contains a number of species previously in Philautus. The GBIF classification for Philautus still has these species, which means that these taxa appear twice in the GBIF data portal (associated with different occurrences).

To gauge the scale of the problem I've done a crude pairwise plot of species names in the two genera. In the diagram below a dot(●) appears if the species name in the corresponding row and column is identical. The diagonal corresponds to comparisons of a species name with itself.

Note the ●'s that appear off the diagonal. These are species in Philautus and Raorchestes that have the same species name (e.g., Philautus glandulosus and Raorchestes glandulosus. The off-diagonal dots indicate taxa that are duplicated.

Why does GBIF have duplicate frogs? As for the gibbon example, the names come from different sources, and GBIF doesn't have access to (or doesn't use) data that tells it that the names are synonyms. In this case there is a clash between the Catalogue of Life, which doesn't recognise Raorchestes, and IUCN Red List, which does. The end result is a mess.

We clearly need better tools for catching these problems. We also need a decent database of taxonomic names and synonyms. The Catalogue of Life is, frankly, grossly inadequate in this respect, especially for vertebrate taxa. Increasingly it's becoming clear that the classification underlying the GBIF portal needs some serious work.

Thursday, June 13, 2013

My latest tweak to BioNames is to add colour to the phylogenies. Terminal nodes with the same name are labelled with the same background colour. For example, here is a tree for fiddler and ghost crabs:The colours make it easier to see that this tree has a mixture of a few sequences from divergent taxa, and a lot of sequences from the same taxa.

Note that you can now also download the SVG drawing of the tree. Click on the button and (in at least some browsers, such as Chrome) the SVG will download. Other browsers may open the SVG in a separate window, in which case simply save it to your computer.

Wednesday, June 12, 2013

In 2011 I wrote a short post about DeepDyve, a service where you could rent access to an article. DeepDyve has launched a "5-Minute Freemium" service where you can view an article online for 5 minutes, for free. You have to log in, either with DeepDyve or using Facebook, but no actual money changes hands. If you want to read for longer, or download an article then you have to get out your credit card.

I've added support for DeepDyve to BioNames. If an article is available in DeepDyve, BioNames displays a link (see http://bionames.org/references/6952b806f87de2106669b2412043a4ab for an example). DeepDyve makes it possible to quickly check a fact (for example, the spelling of a taxonomic name). It obviously doesn't tackle bigger issues such as access to text for data mining, but if you just need to check something, or follow a lead, then it's an interesting and useful wrinkle on publishing models.

One reason I built BioNames (and the related digital archive BioStor) was to create tools to help make sense of taxonomic names. In exploring databases such as GBIF and the NCBI taxonomy every so often you come across cases where things have gone horribly wrong, and to make sense of them you have to drill down into the taxonomic literature.

It's becoming increasingly clear to me that large parts of the GBIF classification that underpins their data portal is, well, a mess. There are duplicate taxa, homonyms, orphan genera, and so on. Now, building a global taxonomy on the scale of GBIF is a tough problem. They are merging a lot of individual classifications into an overall synthesis. That would be a challenging problem in itself, but it's compounded by inconsistent use of names for the same taxon. In other words, synonymy. This is the greatest self-inflicted wound in taxonomy, the desire to have names be meaningful in terms of relationships (i.e., species in the same genus should be related). If you require that, then the consequence is a mess (unless you have a really good taxonomic database in place to track name changes, and we don't).

As an example, consider the White-browed Gibbon (shown here in an image from EOL). In GBIF this taxon occurs in at least three different places in the GBIF classification (each name has occurrence data associated with it):

To keep things simple I've omitted the subspecies (such as Bunopithecus hoolock hoolock). Note that three key resources for names (the Catalogue of Life, Mammal Species of the World, and the IUCN) can't agree on what to call this ape. The names are also not entirely consistent. For example, as written, Bunopithecus hoolock Harlan, 1834 (from Mammal Species of the World, 3rd edition) would imply that this was the original name for this gibbon (because the authority [Harlan, 1834] is not in parentheses). This is incorrect, the original name of the White-browed Gibbon is Simia hoolock, and you can see the original description in BioStor:

Harlan R (1834) Description of a Species of Orang, from the north-eastern province of British East India, lately the kingdom of Assam. Transactions of the American Philosophical Society 4: 52–59. http://biostor.org/reference/127799

Since then it has been shuffled around various genera, including a genus (Hoolock) for which it is the type species:

Mootnick A, Groves C (2005) A new generic name for the hoolock gibbon (Hylobatidae). International Journal of Primatology 26(4): 971–976. doi: 10.1007/s10764-005-5332-4.

GBIF regards all three names as being different taxa, despite all being names for the same gibbon. The practical consequence of this is that anyone seeking a comprehensive summary of what GBIF knows about the White-browed Gibbon is going to get different data depending on which name they use. In my experience this is not an uncommon occurrence (bats as another case where the GBIF classification is a terrible hodgepodge).

My goal here is not to berate GBIF, they are trying to aggregate messy, inconsistent data on a massive scale. But we need tools to flag cases like this poor gibbon, and ways to ensure that once we've found a problem it is fixed once and for all.

Friday, June 07, 2013

One of the things that didn't make last week's deadline for launching BioNames was the inclusion of phylogenies. This was disappointing as one of the reasons I built BioNames was to help span what I see as the gulf between classical biodiversity informatics and its emphasis on taxonomic names and classification, and modern phylogenetics where the tree is the primary focus, not some arbitrary way to partition it up.

So, where to get lots of phylogenies? I use the wonderful PhyLoTA database built by Mike Sanderson and colleagues:

I grabbed a dump of the trees, matched them to sequences in GenBank (more accurately, the European version, EMBL), did some post processing of those sequences, through them into CouchDB, built a SVG viewer, and voilà.

Here is a tree for the fig wasp family Agaonidae, showing the interactive zoomable tree viewer, and thumbnails for other trees for this taxon:

There's still lots to do on this, but the key parts are in place. Personally I can happily while away the day just browsing through the trees, looking for case where taxa lack scientific names, obvious cases of synonymy (take a look at this tree for fiddler and ghost crabs, for example), and evidence that "species" have considerable internal phylogenetic structure.

Tuesday, June 04, 2013

I've added a simple "dashboard" to BioNames to display some basic data about what is in the database. Apart from a table of the number of bibliographic identifiers in the database (currently there are 54,422 publications with DOIs, for example), there are some graphic summaries. These are a bit slow to load as they are created on the fly.

Publishers

The first summarises the relative frequency of articles from different publishers (broadly defined to include digital repositories such as DSpace and JSTOR). For most of this information I'm using data returned when I resolve a DOI at CrossRef. The data is incomplete and likely to change as I add more articles, and CouchDB finally catches up and indexes all the data.

The biggest blob is BioStor, which is my project to extract articles from BHL. Magnolia Press publish Zootaxa, then there are some well-known mainstream publishers such as Springer, Wiley, and Taylor & Francis (Informa UK). These publishers have digitised the back catalogues of a number of society journals, so their prominence here doesn't mean that they are actively publishing new taxonomic content. One use for a diagram like this is to think about what content to data mine. BioStor content is open access (via BHL) and so can be readily mined. Some articles in Zootaxa are open access and so could also be downloaded and processed. Then we have the big commercial publishers, who have a significant fraction of taxonomic content behind their paywalls. If the community was to think about mining this data, then this diagram suggests which publishers to start asking first.

Journals

The next diagram shows articles grouped by journal (using the journal's ISSN).

There circles are too small to be labelled usefully. A couple of things strike me. The first is the sheer number of journals! The taxonomic literature is widely scattered across numerous different outlets, which is part of the challenge of indexing the literature (and this diagram includes only those journals that have ISSNs, many smaller or older ones don't). There is no one journal which dominates the landscape (the largest circle on the top right is Zootaxa). But this diagram spans the complete history of taxonomic publication, so includes large journals (such as Annals and Magazine of Natural History) that no longer exist (at least in their present form). Might be useful to slice this diagram by, say, decade to get a clearer picture of patterns of publication.