Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Friday, January 18, 2013

Continuing the theme of trying to map specimens cited in the literature to the equivalent GBIF records, consider the GBIF record http://data.gbif.org/occurrences/685591320, which according to GBIF is specimen "ZFMK 188762" (a [sic] holotype of Praomys hartwigi).

Note that this URL includes the number 188762, which is treated as the catalogue number by GBIF (i.e., "ZFMK 188762"). So, it seems that in the data provided by SysTax the primary key in that database (188762) has become the catalogue number in GBIF (I tried to verify this by clicking on the original provider message on the GBIF page but it failed to produce anything). This means any naive attempt to locate the specimen "ZFMK-68.7" in GBIF is going to fail because the harvesting and indexing as conflated a local primary key with the catalogue number that appears in publications that refer to this specimen.

Sometimes I think we are doing our level best to make retrieving data as hard as possible...

Here's some probably worthless speculation to add to the mix. Disclosure: I use Mendeley to manage 100,000's of references, and use the API for various projects. I'm not paying customer (but I do pay for some Internet services such as DropBox, BackBlaze, and Spotify, so it's not that I won't pay, it's just that the service Mendeley charge for doesn't interest me). I've published in Elsevier journals (most recently a couple of papers that, thanks to the efforts of Paul Craze, editor of TREE, are "free" in the sense you can download the PDF for free), and I took part in the Elsevier Grand Challenge.

So, given that I'm suitably compromised, here are some thoughts.

Elsevier suck

Elsevier are big, ugly, and at the corporate level are doing things that actively make researchers angry (see The Cost of Knowledge).

Elsevier rocks

Elsevier are one of the most innovative science publishers around. They fund challenges, are investing heavily in interactive and semantic markup of papers (for example, interactive phylogenies), and have built an app ecosystem on their publishing platform.

Mendeley sucks

Mendeley is suffering some from serious failings, most of which could be addressed with sufficient resources. The API sucks, mostly because Mendeley themselves don't actually use it. The Desktop client communicates with Mendeley's database using a different protocol, hence the API lacks the functionality needed to make truly great apps on the platform. The algorithms Mendeley use to de-duplicate their catalogue are flawed, occasionally creating entirely fictional entries.

Mendeley rocks

The way Mendeley engineered the creation of a bibliographic database in the cloud is genius, as is their recognition that the object around which scientists will cluster is the article, not the author. They helped foster the altmetrics movement, and have a great presence on Twitter and at conferences (i.e., you can talk to actual people who write code).

What happens next?

Let's assume that Elsevier does, indeed, buy Mendeley, and wants to do interesting things with Mendeley, and that Mendeley doesn't become one of the many startups that have a successful "exit" for the founders but ends up dying in the bosom of a larger company. Here are some possibilities.

Mendeley becomes iTunes for papers

Forget the "Last.fm" of papers, what about the "iTunes of papers"?. Big publishers are facing a revolt over the cost of institutional subscriptions, and journals are increasingly irrelevant as aggregations. The literature that people read is widely scattered across different outlets. Journals are archaic in the same way that music albums are mostly a thing of the past, people mix and match singles.

In the recent fight between UC Davis and Nature, Nature estimated that "CDL will be paying roughly $0.56 per download". So, why not charge a buck a paper? Mendeley's web interface is practically crying out for a "BUY THIS PAPER" button. Under this model, Elsevier has an outlet for its content that doesn't force people to subscribe to large amounts of stuff they don't want. Mendeley could be used to establish a relationship directly with paying customers, rather than institutions.

Mendeley becomes the de facto measure of research impact

But combining Mendeley's readership data with citations, Elsevier could construct powerful measures of research impact, bringing altmetrics into the mainstream. Couple this with links to institutions, and Elsevier could provide universities with all the data they need to evaluate academic performance (gulp).

Mendeley becomes an authoring tool

Managing references and inserting citations into manuscripts is one of the basic tasks facing an academic author. Authoring tools are evolving in the direction of being online, and embedding more semantic markup (e.g., these are taxon names, this is a chemical compound, this is a statement of causality). In a sense reference lists are the one form of structured markup we are already familiar with. Why not build on that and create an authoring platform?

Mendeley becomes the focus of post-publication review

Publishers have failed to crack the problem of post-publication review. Several provide the ability for readers to comment on an article online, but this has failed to take off. I think this is because the sociology is wrong, if you want a conversation you need to go where the people are, not expect them to come to you. Given that people are bookmaking papers in Mendeley, the next step is to get them to comment, or aggregate their annotations (in the same way that Amazon's Kindle can show you passages that others have highlighted).

Thursday, January 17, 2013

Following on from my previous post bemoaning the lack of links between biodiversity data sets, it's worth looking at different ways we can build these links. Specifically, data can be tightly or loosely coupled.

Tight coupling

Tight coupling uses identifiers. A good example is bibliographic citation, where we state that one reference cites another by linking DOIs. This makes it easy to store these links in a database, such as the Open Citations project which is exploring citation networks base don data from PubMed Cenral. Tight coupling also makes it easy to aggregate information from multiple sources. For example, one database may record citations of a paper, another may record citations of GenBank sequences, a third may record publication of taxonomic names. If all three databases use the same identifiers for the same publications (e.g., DOIs) we can combine them and potentially discover new things (for example, we could answer the question "how many descriptions of new species include sequence data?").

I've mapped many of the references in AFD to standard identifiers such as DOIs, or to digital libraries such as BioStor, and this tightly-coupled mapping is available in AFD on CouchDB. To date these mappings haven't been imported into AFD itself, which means that users of the original site don't have easy access to the literature that appears on that site (basically they'll have to Google each reference). However, if they have a browser extension (or the Javascript bookmarklet available from http://iphylo.org/~rpage/afd/openurl) that supports COinS, they will now see a clickable link that, in many cases, will take them to the online version of the corresponding reference.

This is an example of loose linking. The AFD site provides OpenURL links which can be resolved "just in time". Users of the AFD site can get some of the benefits of the tight linking stored in my CouchDB version of AFD, but the maintainers of AFD itself don't need to add code to handle these identifiers.

A lot of linking of biodiversity data shares this pattern. Instead of linking identifiers, one site links to another through a query. For example, NCBI taxonomy links to GBIF using URLs of the form "http://data.gbif.org/search/<taxon name>". Linking by query is potentially more robust than simply linking by URLs, especially if the target of the link doesn't ensure its identifiers are stable (GBIF, I'm looking at you). But there may be multiple ways to construct the same search query, which makes them poor candidates for use as identifiers. COinS are perhaps an extreme example, where there are at least two versions of the OpenURL standard in the wild, and the key-value pairs that make up the query can be in any order.

If the goal is to integrate data then having the same identifiers for the same thing make life a lot simpler, and means that we can switch from endless data cleaning and matching ("is this citation the same as that one?") to building systems that can tackle some of the scientific questions we are interested in. But in their absence we are left a kind of defensive programming where we expect the links to fail. Loose linking creates "soft links" that may work for humans (we get to click on a link and, with luck, see a web page) but they are less useful for mechanised tools trying to aggregate data.

When tight=loose

Although I've distinguished between tight and loose coupling, the distinction is not absolute. Indeed, one could argue that the best "tight" coupling is a form of "loose" coupling. For example, the most obvious form of tight linking is to use URLs for the things of interest. This is simple and direct, but has draw backs for both publisher and consumer. For the consumer, we are now at the mercy of the publisher's ability to keep the URLs stable. If they change (for example, publishing firm is bought by another firm, or adopts new publishing platform which generates different URLs) then the links break (not to mention that URLs for some resources, such as articles, are often conditional on how you are accessing the article, and may contain extraneous cruff such as session ids, etc.).

Likewise, the publisher is now constrained by a decision it made at the time of publication. If it decides to adopt better technology, or if circumstances otherwise change, it may find itself having to break existing identifiers. Some of this can be avoided if we designed clean URLs, such as this example http://data.rbge.org.uk/herb/E00001195 given by Roger Hyam. However, I wonder how persistent the ".uk" part of this URL will be if the Royal Botanic Garden Edinburgh finds itself in a Scotland that is no longer part of the United Kingdom.

One solution is our old friend indirection, where we put an identifier in between the consumer and the actual URL of the resource, and the consumer uses that identifier. This is the rationale for DOIs. The user gets an identifier that is unlikely to change, and hence can build systems upon that identifier. The publisher knows that they can change how they serve the corresponding data without disrupting their users, so long as they update the URL that the DOI points to. Indirection gives users the appearance of tight coupling without imposing the constraints of tight coupling on publishers.

This paper contains a diagram that seems innocuous enough but which I find worrying:

The nodes in the graph are "biodiversity megascience platforms", the edges are "cross-linkages and data exchange". What bothers me is that if you view biodiversity informatics through this lens then the relationships among these projects becomes the focus. Not the data, not the users, nor the questions we are trying to tackle. It is all about relationships between projects.

I want a different view of the landscape. For example, below is a very crude graph of the kinds of things I think about, namely kinds of data and their interrelationship:

What tends to happen is that this data landscape gets carved up by different projects, so we get separate databases of taxonomic names, images, publications, and specimens (these are the "megascience platforms" such as CoL, EOL, GBIF). This takes care of the nodes, but what about the edges, the links between the data? Typically what happens is lots of energy is expended on what to call these links, in other words, the development of the vocabularies and ontologies such as those curated by TDWG. This is all valuable work, but this doesn't tackle what for me is the real obstacle to progress, which is creating the links themselves. Where are the "megascience platforms" devoted to linking stuff together?

When we do have links between different kinds of data these tend to be within databases. For example, Genbank explicitly links sequences to publications in PubMed, and taxa in the NCBI taxonomy database. All three (sequence, publication, taxon) have identifiers (accession number, PubMed id, taxon id, respectively) that are widely used outside GenBank (and, indeed, are the de facto identifiers for the bioinformatics community). Part of the reason these identifiers are so widely used is because GenBank is the only real "megascience platform" in the list studied by Triebel et al. It's the only one that we can readily do science with (think BLAST searches, think of the number of databases that have repurposed GenBank data, or build on NCBI services).

Many of the questions we might ask can be formulated as paths through a diagram like the one above. For example, if I want to do phylogeography, then I want the path phylogeny -> sequence -> specimen -> locality. If I'm lucky the phylogeny is in a database and all the sequences have been georeferenced, but often the phylogeny isn't readily available digitally, I need to map the OTUs in the tree to sequences, I then need to track down the vouchers for those sequences, and obtain the localities for those sequences from, say, GBIF. Each step involves some degree of pain as we try and map identifiers from one database to those in another.

If I want to do classical alpha taxonomy I need information on taxonomic names, concepts, publications, attributes, and specimens. The digital links between these are tenuous at best (where are the links between GBIF specimen records and the publications that cite those specimens, for example?).

Focussing on so-called "platforms" is unfortunate, in my opinion, because it means that we focus on data and how we carve up responsibility for managing it (never mind what happens to data that lacks an obvious constituency). The platforms aren't what we should be focussing on, it is the relationships between data (and no, these are not the same as the relationships between the "platforms").

If I'd like to see one thing in biodiversity informatics in 2013 it is the emergence of a "platform" that makes the links the centre of their efforts. Because without the links we are not building "platforms", we are building silos.

It has been agreed by the iDigBio community that the identifier represents the digital record (database record) of the specimen not the specimen itself. Unlike the barcode that would be on the physical specimen, for instance, the GUID uniquely represents the digital record only. (emphasis added)

My heart sank. There's nothing wrong with having identifiers for metadata (apart from inviting the death spiral that is metadata about metadata), but surely the key to integrating specimens with other biodiversity data is to have globally unique identifiers for the specimens.

Now, identifiers for metadata can be useful. For example, there is a specimen of Parathemisto japonica in the National Museum of Natural History, Smithsonian Institution with the label "USNM 100988". The NMNH web site has a picture of the index card for this specimen:

This is an image of the metadata, not the specimen itself. We could link the metadata to this image, but of course we also want to link it to the actual specimen.

Specimens are the things we collect, preserve, dissect, measure, sequence, photograph, and so on. I want to link a specimen to the sequences that have been obtains from that specimen, I want to list the publications that cite that specimen, I want to be able to aggregate data on a specimen from multiple sources, I want to be able to add annotations including misidentifications, simple typos, or missing georeferencing.

Key to this is having identifiers for specimens. Identifiers for metadata about those specimens is not good enough. By analogy with bibliographic citation, one of the important decisions CrossRef made was that DOIs for articles identify the article, not the metadata about the article, or any of the different formats (HTML, PDF, print) and article may occur in. This means we can build databases about things and relationships (this article cites that one, these articles were authored by this person, etc.).

As it stands, if we don't have identifiers for specimens then we can't link data together. For example, the frog specimen "USNM 195785" is depicted in the image below (from EOL):

I confess I'm flabbergasted that iDigBio has avoid tackling the issue of specimen identifiers. If any museum wants to discover how its collection is being used to support science it will want to find the citations of its specimens in scientific papers and databases. This requires identifiers for specimens.