Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Friday, August 14, 2015

Yet another barely thought out project, although this one has some crude code. If some 16,000 new taxonomic names are published each year, then that is roughly 40 per day. We don't have a single place that aggregates these, so any major biodiversity projects is by definition out of date. GBIF itself hasn't had an update list of fungi or plant names for several years, and at present doesn't have an up to date list of animal names. You just have to follow the Twitter feeds of ZooKeys and Zootaxa to feel swamped in new names.

And yet, most nomenclators are pumping out RSS feeds of new names, or have APIs that support time-based queries (i.e., send me the names added in the last month). Won't it be great to have a single aggregator that took these "name streams", augmented them by adding links to the literature (it could, for example, harvest RSS feeds and Twitter streams of the relevant journals), and provided the biodiversity community with a feed of new names and associated supporting information. We could watch new discoveries of new biodiversity unfold in near real time, as well as provide a stream of data for projects such as GBIF and others to ingest and keep their databases up to date.

PubMed Central has pioneered the use of XML to archive and publish scientific articles, specifically JATS XML. But the biodiversity literature comes in all sorts of formats, including several flavours of XML (such as SciElo XML, XML from OCR literature such as DjVu, ABBYY, and TEI, etc.)

While Europe PMC is doing nice things with ORCIDs and entity extraction, it doesn't deal with the kinds of entities prevalent in the taxonomic literature (such as geographic localities, specimen codes, micro citations, etc.). Nor does it deal with extracting the core scientific statements that a taxonomic paper makes.

Given that much of the taxonomic literature will be derived from OCR we need mechanisms to be able to edit and fix OCR errors, as well as markup errors in more recent XML-native documents. For example, we could envisage having XML stored on GitHub and open to editing.

We need to embed taxonomic literature in our major biodiversity databases, rather than have it consigned to ghettos of individual small-scale digitisation project, or swamped by biomedical literature whose funding and goals may not align well with the biodiversity community (Europe PMC is funded primary by medical bodies).

I hope to flesh this out a bit more later on, but I think it's time we started treating the taxonomic literature as a core resource that we, as a community, are responsible for. The NIH did this with biomedical research, shouldn't we be doing the same for biodiversity?

Monday, August 10, 2015

One of the limitations of the Biodiversity Heritage Library (BHL) is that, unlike say Google Books, its search functions are limited to searching metadata (e.g., book and article titles) and taxonomic names. It doesn't support full-text search, by which I mean you can't just type in the name of a locality, specimen code, or a phrase and expect to get back much in the way of results. In fact, in many cases when I Google a phrase that occurs in BHL content I'm more likely to find that phrase in content from the Internet Archive, and then it's a matter of following the links to the equivalent item in BHL.

So, as an experiment I've created a live demo of what full-text search in BHL could look like. I've done this using the same infrastructure the new BioStor is built on, namely CouchDB hosted by Cloudant. Using BHL's API I've grabbed some volumes of the British Ornithological Club's Bulletin and put them into CouchDB (BHL's API serves up JSON, so this is pretty straightforward to do). I've added the OCR text for each page, and asked Cloudant to index that. This means that we can now search on a phrase in BHL (in the British Ornithological Club's Bulletin) and get a result.

I've made a quick and dirty demo of this approach and you can see it in the "Labs" section on BioStor, so you can try it here. You should see something like this:

The page image only appears if you click on the blue labels for the page. None of this is robust or optimised, but it is a workable proof-of-concept of how fill-text search could work.

Note that I've not completed uploading all the page text and XML. Once I do I'll have a better idea of how scalable this approach is. But the idea of having full-text search across all of BHL (or, at least the core taxonomic journals) is tantalising.

Technical details

Initially I simply displayed a list of the pages that matched the search term, together with a fragment of text with the search term highlighted. Cloudant's version of CouchDB provides these highlights, and a "group_field" that enabled me to group together pages from the same BHL "item" (roughly corresponding to a volume of a journal).

This was a nice start, but I really wanted to display the hits on the actual BHL page. To do this I grabbed the DjVu XML for each BHL page for British Ornithological Club's Bulletin, and used a XSLT style-sheet that renders the OCR text on top of the page image. You can't see the text because it I set the colour of the text to "rgba(0, 0, 0, 0)" (see http://stackoverflow.com/a/10835846) and set the "overflow" style to "hidden". But the text is there, which means you can select with the mouse and copy and paste it. This still leaves the problem of highlighting the text that matches the search term. I originally wrote the code for this to handle species names, which comprise two words. So, each DIV in the HTML has a "data-one-word" and "data-two-words" attribute set, which contains the first (and forst plus second) word in the search term, respectively. I then use a JQuery selector to set the CSS of each DIV that has a "data-one-word" or "data-two-words" attribute that matches the search term(s). Obviously, this is terribly crude, and doesn't do well if you've more than two word sin your search query.

As an added feature, I use CSS to convert the BHL page scan to a black-and-white image (works in Webkit-based browsers).

One of the less glamorous but necessary tasks of data cleaning is mapping "strings to things", that is, taking strings such as "George A. Boulenger" and mapping them to identifiers, such as ISNI: 0000 0001 0888 841X. In case of authors such as George Boulenger, one way to do this would be through Wikipedia, which has entries for many scientists, often linked to identifiers for those people (see the bottom of the Wikipedia page for George A. Boulenger and look at the "Authority Control" section).

How could we make these mappings? Simple string matching is one approach, but it seems to me that a more robust approach could use bibliographic data. For example, if I search for George A. Boulenger in BioStor, I get lots of publications. If at least some of these were listed on the Wikipedia page for this person, together with links back to BioStor (or some other external identifier, such as DOIs), then we could do the following:

Search Wikipedia for names that matched the author name of interest

If one or more matches are found, grab the text of the Wikipedia pages, extract any literature cited (e.g., in the {cite} tag), get the bibliographic identifiers, and see if they match any in our search results.

If we get one or more hits, then it's likely that the Wikipedia page is about the author of the papers we are looking at, and so we link to it.

Once we have a link to Wikipedia, extract any external identifier for that person, such as ISNI or ORCID.

For this to work, it requires that the Wikipedia page cites works by the author in a way that we can harvest, and uses identifiers that we can match to those in the relevant database (e.g., BioStor, Crossef, etc.). We might also have to look at Wikipedia pages in multiple languages, given that English-language Wikipedia may be lacking information on scholars from non-English speaking countries (this will be a significant issue for many early taxonomists).

Based on my limited browsing of Wikipedia, there seems to be little standardisation of entries for people, certainly little in how their published works are listed (the section heading, format, how many, etc.). The project I'm proposing would benefit from a consistent set of guidelines for how to include a scholar's output.

What makes this project potentially useful is that it could help flesh out Wikipedia pages by encouraging people to add lists of published works, it could aid bibliographic repositories like my own BioStor by increasing the number of links they get from Wikipedia, and if the Wikipedia page includes external identifiers then it helps us go from strings to things by giving us a way to locate globally unique identifiers for people.

Sunday, August 09, 2015

Following on from Testing the GBIF taxonomy with the graph database Neo4J I've added a more complex test that relies on linking taxa to names. In this case I've picked some legume genera (Coursetia and Poissonia) where there have been frequent changes of name. By mapping the GBIF taxa to IPNI names (and associated LSIDs) we can build a graph linking taxa to names, and then to objective synonyms (by resolving the IPNI LSIDs and following the links to the basionym), see http://gist.neo4j.org/?4df5af75d42e0f963e5d.

In this example we find species that occur twice in the GBIF taxonomy, which logically should not happen as the names are objective synonyms. We can detect these problems if we have access to nomenclatural data. in this case, because IPNI has tracked the names changes, we can infer that, say, Coursetia heterantha and Poissonia heterantha are synonyms, and hence only one of these should appear in the GBIF classification. This is an example that illustrates the desirability of separating names and taxa, see Modelling taxonomic names in databases.

Imagine a web site where researchers can go, log in (easily) and get a list of all the species they have described (with pretty pictures and, say, GBIF map), and a list of all DNA sequences/barcodes (if any) that they've published. Imagine that this is displayed in a colourful way (e.g., badges), and the results tweeted with the hastag #itaxonomist.

Imagine that you are not a taxonomist, but if you have worked with one (e.g., published a paper), you can go to the site, log in, and discover that you “know” a taxonomist. Imagine if you are a researcher who has cited taxonomic work, you can log in and discover that your work depends on a taxonomist (think six degrees of Kevin Bacon).

Imagine that this is run as a campaign (hashtag #itaxonomist), with regular announcements leading up to the release date. Imagine if #itaxonomist trends. Imagine the publicity for the work taxonomists do, and the new found ability for them to quantitatively demonstrate this.

How does it work?

#itaxonomist relies on three things:

People having an ORCID

People having publications with DOIs (or otherwise easily identifiable) in their ORCID profile

A map between DOIs (etc.) and the names in the nomenclators (ION, IPNI, Index Fungorum, ZooBank)

For a subset of people and names this we could build this very quickly. Some some taxonomists already have ORCIDs , and some nomenclators have limited numbers of DOIs. I am currently building lists of DOIs for primary taxonomic literature, which could be used to seed the database.

The “i am a taxonomist” query is simply a map between ORCID to DOI to name in nomenclator. The “i know a taxonomist” is a map between ORCID and DOI that you share with a taxonomist, but there are no names associated with that DOI (e.g., a paper you have co-authored with a taxonomist that wasn’t on taxonomy, or at least didn’t describe a new species). The “six degrees of taxonomy” relies on the existence of open citation data, which is trickier, but some is available in PubMed Central and/or could be harvested from Pensoft publications.

Friday, August 07, 2015

I've been playing with the graph database Neo4J to investigate aspects of the classification of taxa in GBIF's backbone classification. Neo4J is a graph database, and a number of people in biodiversity informatics have been playing with it. Nicky Nicolson at Kew has a nice presentation using graph databases to handle names Building a names backbone, and the Open Tree of Life project use it in their tree machine.

One of the striking things about Neo4J is how much effort has gone in to making it easy to play with. In particular, you can create GraphGists, which are simple text documents that are transformed into interactive graphs that you can query. This is fun, and I think it's also a great lesson in how to publicise a technology (compare this with RDF and SPARQL, which is in no way fun to work with).

I created some GraphGists that explore various problems with the current GBIF taxonomy. The goal is to find ways to quickly test the classifications for logical errors, and wherever possible I want to use just the information in the GBIF classification itself.

As an example, I've grabbed the classification for the bat family Molossidae, converted it to a Neo4J graph, and then tested for the existence of species in different genera that have the same specific epithet. This is a useful (but not foolproof test) of whether there are undetected synonyms, especially if the generic placement of a set of species has been in flux (this is certainly true for these bats). If you visit the gist you will see a list of species that are potential synonyms.

Another GraphGist tests that the genus name for a species matches the genus it is assigned too. This seems obvious (the species Homo sapiens belongs in the genus Homo) but there are cases where GBIF's classification fails this test, such as the genus Forsterinaria. Typically this test fails due to problematic generic names (e.g., homonyms), incorrect spellings, etc.

The last test is slightly more pedantic, but revealing nevertheless. It relies on the convention in zoology that when you write the authorship of a species name, if the name is not in the original genus then you enclose the authorship in parentheses. For example, it's Homo sapiens Linnaeus, but Homo erectus (Dubois, 1894) because Dubois originally called this species Pithecanthropus erectus.

Because you can only move a species to a genus that has been named, it follows that if a species is described before the genus name was published, then if the species is in that newer genus the authorship must be in parentheses. For example, the lepidopteran genus Heliopyrgus was published in 1957, and includes the species willi Plötz, 1884. Since this species was described before 1957, it must have been originally placed in a different genus, and so the species name should be Heliopyrgus willi (Plötz, 1884). However, GBIF has this as Heliopyrgus willi Plötz, 1884 (no parentheses). The GraphGist tests for this, and finds several species of Heliopyrgus that are incorrectly formed. This may seem pendantic, but it has practical consequences. Anyone searching for the original description of Heliopyrgus willi Plötz, 1884 might think that they should be looking for the text string "Heliopyrgus willi" in literature from 1884, but the name didn't exist then and so the search will be fruitless.

I think there's a lot of scope for deveoping tests like these, inclusing some that m make use of external data as well. In an earlier post (A use case for RDF in taxonomy ) I mused about using RDF to perform tests like this. However Neo4J is so much easier to work with I suspect that it makes better sense to develop standard queries in it's query language (CYPHER) and use those.

describes a method for inferring a hierarchy from a set of tags (and cites related work that is of interest). I've grabbed the code and data from http://hiertags-beta.elte.hu/home/ and put it on GitHub.

Possible project

Use Tibély et al. method (or others) on taxonomic names extracted from BHL text (or other) and see if we can reconstruct taxonomic classifications. ow do classifications compare to those in databases? Can we enhance existing databases using this technique (e.g., extract classifications from literature for groups pporly represented in existing databases)? Could be part of larger study of what we can learn from co-occurrence of taxonomic names, e.g. Automatically extracting possible taxonomic synonyms from the literature.

Note to anyone reading this: if this project sounds interesting, by all means feel free to do it. These are just notes about things that I think would be fun/interesting/useful to do.