Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Materials for mining

From my perspective the obvious corpus to mine is the Biodiversity Heritage Library (BHL). Ross repeats the erroneous view that BHL is just "legacy" literature. Apart from the obvious point that everything not published right not is, by definition, legacy, BHL has a lot of modern content (including papers published in the last couple of years).

The PMC OA subset is fantastic & really facilitates this kind of research – I wish ALL of the biodiversity literature was aggregated like (some) of the open access biomedical literature is. You can literally just download a million papers, click, and go do your research. It facilitates rigorous research by allowing full machine access to full texts.

So, how we can make BHL content as accessible? For each article I've extracted from BHL and stored in BioStor you can get full text by simply appending ".text" to the BioStor URL, but this isn't quite the same as grabbing a big dump of text.

The other source of mining is GenBank, which has a lot of sequences that have NHM vouchers, but also a weird and wonderful array of ways of recording those specimens. This is one reason I'm building "Material examined", to cope with these codes. For example sequence KF281084 has voucher "TRING 1877111743" which more traditionally would be written as "BMNH 1877.11.17.43", which is "NHMUK 1877.11.17.43" in the NHM database. This is just one example of the horrors of matching specimen codes (for more see the code for Material examined).

One reason GenBank is useful is that the sequences are often linked to the literature, which means you get to make the link between specimen and literature without actually needing to mine the text itself (handy if access is problematic).

Bonus question: How should I publish this annotation data?

But if I wanted to publish something a little better & a little more formal, what kind of RDF vocabulary can I use to describe “occurs in” or “is mentioned in”. What would be the most useful format to publish this data in so that it can be re-used and extended to become part of the biodiversity knowledge graph and have lasting value?

But making the output useful is an important question. Despite the fact that it is a bit clunky, I suspect Darwin Core Archives are the way to go. The core data is a CSV table, so it's easy to generate, and also easy to use. Lets say you analysed a particular corpus (e.g., PLoS ONE), you could then output the data in Darwin Core (making sure both specimen and publication had stable identifiers), then package it up and upload to Zenodo or Figshare and get a DOI. For bonus points, it would be great to see this data on GBIF, but this would require (a) mapping NHM specimen codes to GBIF ids (the NHM has this), and (b) GBIF being able to recognise that the data you're adding is not new specimens but rather annotations of existing specimens.

Things to think about

Here are a couple of additional things to think about.

Specimen finding as a service

In the same way that we have taxonomic name-finding services, it would be great if we had a specimen code-finding service. I have code that I use in BioStor, but it would be great to have something that is robust, stable, and generalisable across multiple specimen codes. My tool Material examined focusses on parsing a single string rather than parsing a block of text, but adding that functionality is an obvious thing to do.

Markup as output

One concern I have with work that involves mining text is that we hardly ever store the intermediate step of text + located elements. Instead we get to see sumamry output (e.g., this page has these three scientific names, and these 10 specimen codes). As Terry Catapano (@catapanoth) once wisely pointed out "indexing is markup", in that if you find a substring in some text, you have in effect marked up the text. Can we preserved the marked up text so that we go back and look at it and improve our text mining methods, or make that markup available to others to build upon it? There are all sorts of things which could be built upon this information, for example, imaging if the results where given to BHL so that people could search by specimen code.

Thursday, May 14, 2015

This a quick writeup of an analysis I did to make the case that the list of names held by the Index of Organism Names (ION) (part of Thomson Reuters) would be very useful for GBIF. I must declare a bias, in that I've spent a good chunk of the last 3-4 years exploring the ION database and investigating ways to link the taxonomic names it contains to the primary taxonomic literature, culminating in building BioNames.

What makes ION special is its scope (it endeavours to have all names covered by the ICZN), and that many of its names have associated citation information (i.e., details on the publication that published the name). Like any name database it has duplications and errors, and some of the older content is a bit ropey, but it's a tremendous resource and from my perspective nothing else in zoology come close.

But rather than rely on anecdote, I decided to do a quick analysis to see what ION could potentially add to GBIF. I've been doing some work on bird names recently, so as an exercise I searched GBIF for holotype specimens for birds. The search (13 May 2015) returned 11,664 records. I then filtered those on taxonomic names that GBIF could not match exactly (TAXON_MATCH_FUZZY) or names that GBIF could only match to a higher rank (TAXON_MATCH_HIGHERRANK). The query URL is:

This query found 6,928 records, so over half the bird holotype specimens in GBIF do not match a taxonomic name in GBIF. What this means is that GBIF can't accurately place these names in its own taxonomic hierarchy. It also makes it hard to do meaningful analyses of things such as "how long does it take before a bird specimen is collected to when it is described as a new species?" because if you can match the name then you can't get the date the name was published.

To explore this further, I downloaded the results of the query (the download has DOI http://doi.org/10.15468/dl.vce3ay). I then wrote a script to parse the specimen records and extract the GBIF occurrence id, catalogue number, and scientific name. I then used the GBIF API to retrieve (where available) the verbatim record for each specimen (using the URL http://api.gbif.org/v1/occurrence//verbatim where is the occurrence id). This gives us the original name on the specimen, which I then looked up in BioNames using its API. If I got a hit I extracted the identifier of the name (the LSID in the ION database) and the corresponding publication id in BioNames (if available). If there was a publication associated with the name I then generated a human-readable citation using BioNames’s citeproc API. The code for all this is on github.

The complete result of this mapping can be viewed here. Of the 6,392 holotypes with names not recognised by GBIF, nearly half (3,165, 49.5%) exactly matched a name in ION. Many of these are also linked to the publication that published that name.

So, adding ION help us find half the missing holotype names. This is before doing anything more sophisticated, such as approximate string matching, resolving synonyms, etc. Hence, I'd argue that the names in ION would add a lot to GBIF's ability to interpret the occurrence records it receives from museums.

I've not had time for further analysis, but at first glance a lot of the missed names are subspecies, the are quite a few fossils, and many names are in the relatively older literature. However there are also some recently described taxa, such as the hawk-owl Ninox rumseyiRasmussen et al. 2012, and a bunting subspecies from Tristan du Cuhna (Nesospiza acunhae fraseriRyan, 2008) that are missing from GBIF.

There are no requirements for signing up. A signature is first and foremost a statement of support for open data . Each signatory can determine how best to make progress towards the goal. Some recommendations are included in the declaration. We hope that signatories will become early adopters of the open access approach, that they will promote change in their institutions, societies and journals, and will position themselves and their institutions as leaders. (from http://www.bouchoutdeclaration.org/faqs/)

I've put off writing this post about the Bouchout Declaration for a number of reasons. I attended the meeting that launched the declaration last year, and from my perspective that was a frustrating meeting. Much talk about "Open Biodiversity Knowledge Management" with nobody seemingly willing or able to define it (see The vision thing - it's all about the links for some comments I made before attending the meeting), and as much as the signing of the Boechout Declaration provided good theatre, it struck me as essentially an empty gesture. Public pronouncements are all well and good, but are ultimately of little value unless backed up by action. We have institutions that have signed the declaration yet have much of their intellectual output locked behind paywalls (e.g., JSTOR Global Plants). So much for being open.

So, since Donat challenged me, here's what I'd like to see happen. I'd like to see metrics of "openness" that we can use to evaluate just how open the signatories actually are. These metrics could be viewed as ways to try and persuade institutions into sharing data and other information, as a league table we can use to apply pressure, or as a way to survey the field and see what the impediments are to being open (are they financial, legal, cultural, resource, etc.).

Below are some of the things I think we could "score" the openness of biodiversity institutions.

Is the collection digitised and in GBIF?

Simple criterion that is easy to measure. If an institution has specimens or other biological material, is data and or metadata on the collection freely available? What fraction of the collection has been digitised? How good is that digitsation (e.g., what fraction has been georeferenced?). We could define digitisation more broadly to include imaging and sequencing (both are methods of converting analogue specimens into digital objects).

Are the institutional publications digitised? Are they open access?

Some institutions have a history of digitising their in-house publications and making them freely available online (e.g., the AMNH), some even make them fully citable with CrossRef DOIs (e.g., the Australian Museum). But some institutions have, sadly, signed over their publications to commercial publishers or archives that charge for access (e.g., Kew's publications have been digitised by JSTOR, which limits their accessibility). As a foot note, I suspect that those institutions that lost confidence in their in-house publishing operations and outsourced them are the ones who have ended up loosing control of their intellectual output, some of which is now closed off (e.g., some of the NHM London's journals are now the property of Cambridge University Press). Those institutions that maintained a culture of in-house publishing are the ones at the vanguard of digitising and opening up those publications.

Does the institution take part on the Biodiversity Heritage Library?

There are at least two ways to participate in the Biodiversity Heritage Library (BHL), one is by becoming a member and start scanning books from institutional libraries. The other is by granting permission to BHL to scan institutional publications. BHL is often viewed as an archive of "old" literature, but in fact it has some very recent content. Some farsighted organisations have let BHL scan their journals, contributing to BHL becoming an indispensable resource for biodiversity research.

These are just some of the more obvious things that could be used to measure openness. At the same time, it would be useful to develop ways to show the benefits of being open. For example, I've long argued that we could develop citation tracking for specimens. This gives researchers a means to track provenance of information (who said what about the identity of a specimen), and it also gives institutions a way to measure the impact of their collections. Doing this at scale is only going to be possible if collections are digitised, specimens have identifiers of some sort, and we can text mine the literature and associated data for those identifiers (in other words, the data and publications need to be open). So, perhaps on way to help make the case for being open is to develop metrics that are useful for the institutions themselves.

I guess I would have been much more enthusiastic about the Bouchout Declaration if these sort of things had been in place at the start. Anyone can sign a document. Ideas are cheap, execution is everything.