Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Thursday, November 24, 2016

Continuing my on-again off-again relationship with the Semantic Web, I stumbled across a cool approach to visualising the results of SPARQL queries. Toshiaki Katayama (@tktym) has put together d3sparql, a set of Javascript scripts that takes SPARQL queries and formats the results graphically using D3.

For example, give the SPARQL endpoint http://togostanza.org/sparql, the following query retrieves the NCBI classification for the tardigrade family Hypsibiidae:

The ability to quickly generate trees, charts, and maps from SPARQL queries makes things a lot easier. We can play around a little and explore things. The strength (and challenge) of SPARQL is that it is very open-ended, you can more or less develop queries to do anything. Being able to visualise the results will help guide that exploration.

The code for d3sparql is on GitHub. One "gotcha" is that the cached examples and external Javascript libraries aren't included. I've forked the repository here and added the missing files, so that if you grab that version it works straight out of the box.

Monday, November 14, 2016

Willi Egloff, Donat Agosti, Puneet Kishor, David Patterson, and Jeremy A. Miller have published an interesting preprint entitled “Copyright and the Use of Images as Biodiversity Data”DOI:10.1101/087015 in which they argue that taxonomic images aren't copyrightable. I'm not convinced, and have commented on the bioRxiv site. Frustratingly bioRxiv puts comments into a moderation queue (in my opinion the stupidest thing to do if you want to enable conversation) so I've posted my comment here.

It seems to me that there are two deeply problematic aspects to this claim. The first is that taxonomic illustration is not creative. This seems, at best, arguable. I've illustrated new species, and it sure felt like I was doing creative work. Arguably every creative work adheres to conventions of a discipline, how does this by itself make copyright irrelevant?

Secondly, I'm unconvinced that a legal opinion that hasn't been tested in a court is worth much. We can assert whatever interpretation of copyright we want, I doubt that would stop legal action by a person or organisation that felt it could benefit from such action. The real question will be whether treating taxonomic images as outside of copyright would be considered a sufficient threat to someone's business model for them to take action.

I completely support the idea that the images (and all taxonomic-relevant data) should be completely free and open, but simply asserting that it should be doesn't make it so.

TraitBank is available in JSON-LD, and so is potentially part of the Semantic Web. Unfortunately, the JSON-LD provided by TraitBank is broken, to the point that it's hard to believe that anyone's actually consuming the JSON-LD. I know that Google is using EOL data for their knowledge panels, but anyone using TraitBank JSON-LD in a semantic web client is going to run into problems.

URIs as strings

In several places EOL outputs URIs as simple strings rather than as URIs. For example, to indicate that the parent taxon of Potos flavus is the genus Potos the JSON-LD has:

"dwc:parentNameUsageID": "http://eol.org/pages/14191",

But this is simply saying that the string value for "dwc:parentNameUsageID" is "http://eol.org/pages/14191". In JSON-LD this should be:

"dwc:parentNameUsageID":{"@id":"http:\/\/eol.org\/pages\/14191"}

This syntax ensures that http://eol.org/pages/14191 is interpreted as a URI, which also means clients "know" that they can resolve that URI to get more information.

Predicates with missing (hence wrong) namespace

A number of predicates in the JSON-LD don't have a namespace specified, hence they default to being part of the schema.org vocabulary. For example, this statement:

"scientificName": "Potos flavus (Schreber, 1774)",

results in "scientificName" being interpreted as "http://schema.org/scientificName" (because "http://schema.org/" is set as the default @vocab). This is incorrect, "scientificName" should be "dwc:scientificName".

Then there are predicates such as "predicate" and "value" in the data sections that missing the correct namespace (in this case http://www.w3.org/1999/02/22-rdf-syntax-ns#). For example,

Anyone constructing, say, SPARQL queries on this data is going to be using terms such as predicate that don't exist.

Some data records have the predicate "units" - I haven't yet figured out what, if any, vocabulary that predicate comes from.

Summary

I think TraitBank has a lot of potential, and welcome the use of JSON-LD and the schema.org vocabulary. These are both steps forward in the goal of interoperable biodiversity data. But this data will only become interoperable if we take care to ensure that the data we output is what we say it is. EOL TraitBank JSON-LD isn't valid JSON-LD. This also illustrates a bigger problem, we are continually building systems that don't have users. If anyone was using TraitBank JSON-LD with standard Semantic Web clients, they would be up in arms about this. The best way to avoid these situations is for the developers to be users as well (see GBIF, biodiversity informatics and the "platform rant"). Until we "dog food" our own services we will continue to produce data and services that are less than useful. If EOL was itself built on TraitBank, I doubt we'd have these problems.