Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Thursday, April 25, 2013

Things are finally coming together, at least enough to have a functioning demo. It looks awful, but shows the main things I want BioNames to do. One thing I'm most concerned about at this stage is the possible confusion users might experience between taxon names and concepts. For example, there are two pages about Pteropus, one about the name Pteropus, the other about the bat that bears this name (as understood by GBIF).

Monday, April 22, 2013

Over on Google Plus (yeah, me neither) Donat Agosti is giving me a hard time regarding the quality of some data that I am using. I've responded to Donat directly, but here I just want to quickly outline two different approaches to cleaning and reconciling bibliographic metadata.

The problem addressed by Donat is the issue of multiple strings for the same journal (e.g., the plethora of different abbreviations and permutations people use to refer to the same journal). In trying to make sense of this mess there are a couple of strategies we can use. One is to cluster the strings into sets that we think refer to the same thing, e.g.:

We could then synthesise the preferred journal name from this set. We could make some sort of consensus string, for example. There are also some quite nice Bayesian methods for combining contradictory metadata.

Another approach, which I use, is to map the strings to a third party identifier, in this case an ISSN:

Once I've done this I can use the identifier to refer to the journal, hence ultimately I don't particularly care what string is best for the journal (indeed, I can defer to a third party for this decision).

The point is obsessing with clean, "correct" bibliographic metadata is something of a fool's errand. Obviously, it's nice to have clean metadata if you can get it, but in many cases there is no exact answer to what is the correct metadata. Some journals have multiple names (e.g., in different languages), some run different volume numbering schemes in parallel, and date of publication can be rather problematic (see my Mendeley group on publication dates). If we can map a publication to a globally unique identifier, such as a DOI, then we can sidestep this issue and focus on what I think really matters - linking data together.

Leaving all those constraints behind, and waving arms wildly, here's one take on the future of biodiversity informatics. I see three themes.

1. Knowing what we know

We have a limited grasp of how much we actually know, and crap tools to summarise this knowledge. I want a Google Analytics for biodiversity data where I can see at a glance the current state of our knowledge (e.g., what is the rate of sequencing of environmental samples in the Mediterranean? How much of Indonesia's amphibian fauna is in protected areas?). These are fairly trivial queries. If Google can analyse web traffic from sites being hit over a million times per day ( ~ 365 million hits per year) we can do the same thing on GBIF-scale databases. There is huge scope here for cool visualisation of the growth of our knowledge, such as this:

Terrible title, but this is where we monitor change, both "organic" and anthropogenic. This is where we use data mining to do a sentiment analysis of the biosphere, looking to detect changes such as outbreaks of disease, invasive species, etc. This builds on 1 but focusses on change. Imagine a "news service" for biology along the lines of tools available to financial markets (e.g., Silobreaker):

This is where we interface with decision makers, in the sense that Braulio Dias's statement "I am convinced that the lack of adequate biodiversity monitoring is at the heart of our difficulties to make convincing arguments" is true, this tackles that question.

3. Modelling the biosphere

Time to model all life on Earth (http://dx.doi.org/10.1038/493295a) is our equivalent of a moon shot (oh how I hate that analogy). Purves et al. have made the case, this is the task that will galvanise people outside the taxonomy/biodiversity community. This is real megascience (1. is data collection, 2. is data mining and analysis). Climate modellers and oceanographers get to do this:

This service is fairly crude, in particular, I make no attempt to score the matches that VIAR returns because this would require parsing and normalising author names. This could be added if needed. If you want some exmaple names to try, here are some taxonomists:

Open Data, should be normal practice and should embody the principles of being accessible, assessable, intelligible and usable.

Seems obvious, but data providers are often reluctant to open "their" data up for reuse.

Data encoding should allow analysis across multiple scales, e.g. from nanometers to planet-wide and from fractions of a second to millions of years, and such encoding schemes need to be developed. Individual data sets will have application over a small fraction of these scales, but the encoding schema needs to facilitate the integration of various data sets in a single analytical structure.

No I don't know what this means either, but I'm guessing that it's relevant if we want to attempt this: doi:10.1038/493295a

Infrastructure projects should devote significant resources to market the service they develop, specifically to attract users from outside the project-funded community, and ideally in significant numbers. To make such an investment effective, projects should release their service early and update often, in response to user feedback.

Put simply, make something that is both useful and easy to use. Simples.

Build a complete list of currently used taxon names with a statement of their interrelationships (e.g. this is a spelling variation; this is a synonym; etc.). This is a much simpler challenge than building a list of valid names, and an essential pre-requisite.

One of the simplest tasks, first tackled successfully by uBio, now moribund. The Global Names project seems stalled, intent on drowning in acronym soup (GNA,GNI,GNUB, GNITE).

Attach a Persistent Identifier (PID) to every resource so that they can be linked to one another. Part of the PID should be a common syntactic structure, such as ‘DOI: ...’ so that any instance can be simply found in a free-text search.

DOIs have won the identifier wars, and everything citable (publications, figures, datasets) is acquiring one. The mistake to avoid is forgetting that identifiers need services built on top of them (see http://labs.crossref.org/ for some DOI-related tools). The core service we need is reverse lookup: given this thing (publication, specimen, etc.) what is its identifier?

Implement a system of author identifiers so that the individual contributing a resource can be identified. This, in combination with the PID (above), will allow the computation of the impact of any contribution and the provenance of any resource.

This is a solved problem, assuming ORCID continues to gain momentum. For past authors VIAF has identifiers (which are being incorporated into Wikipedia).

Make use of trusted third-party authentication measures so that users can easily work with multiple resources without having to log into each one separately.

Again, a solved problem. People routinely use third parties such as Google and Facebook for this purpose.

Build a repository for classifications (classification bank) that will allow, in combination with the list of taxonomic names, automatic construction of taxonomies to close gaps in coverage.

Let's not, let's focus on the only two classifications that actually matter because they are linked to data, namely GBIF and NCBI. If we want one classification to coalesce around make it GBIF (NCBI will grow anyway).

Develop a single portal for currently accepted names - one of the priority requirements for most users.

Yup, still haven't got this, we clearly didn't get the memo about point 3.

Standards and tools are needed to structure data into a linked format by using the potential of vocabularies and ontologies for all biodiversity facets, including: taxonomy, environmental factors, ecosystem functioning and services, and data streams like DNA (up to genomics).

The most successful vocabulary we've come up with (Darwin Core) is essentially an agreed way to label columns in Excel spreadsheets. I've argued elsewhere that focussing on vocabularies and ontologies distracts from the real prerequisite for linking stuff together, namely reusable identifiers (see 5). No point developing labels for links if you don't have the links.

Mechanisms to evaluate data quality and fitness-for-purpose are required.

A next-generation infrastructure is needed to manage ever-increasing amounts of observational data.

Not our problem, see doi:10.1038/nature11875 (by which I mean lots of people need massive storage, so it will be solved)

Food for thought. I suspect we will see the gaggle of biodiversity informatics projects will seek to align themselves with some of these goals, carving up the territory. Sadly, we have yet to find a way to coalesce critical mass around tackling these challenges. It's a cliché, but I can't help thinking "what would Google do?" or, more, precisely, "what would a Google of biodiversity look like?"

Thursday, April 11, 2013

Quick notes on "taxon concepts". In order to navigate through taxon names I plan to have at least one taxonomic classification in BioNames. GBIF makes the most sense at this stage.

The model I'm adopting is that the classification is a graph where nodes have the id used by the external database (in this case GBIF). Each node has one or more names attached, and where possible the names are linked to the original description. Where we have synonyms it would be nice to link the synonymy to publication(s) that proposed that relationship.

The idea is that a taxonomy, such as the GBIF backbone taxonomy, could be placed in GitHub where people could clone it, annotated, correct, edit, or otherwise mess with it, then GBIF could pull in those edits and release an updated, cleaner taxonomy. If software version control seems a bit esoteric, it's worth noting that use of GitHub is rapidly becoming much more mainstream in science, and not just for software development. People are using it to store versions of data analysis (e.g., https://github.com/dwinter/Fungal-Foray) and collaboratively write manuscripts (e.g., https://github.com/weecology/data-sharing-paper). The journal eLIFE is depositing articles there (e.g., https://github.com/elifesciences/elife-articles). In addition to all the infrastructure GitHub provides (the ability to identify who did what and when, to roll back changes, to fork classifications, etc.) there is also the attraction of not creating yet more software, but simply editing a classification by moving folders around on your local filesystem. The idea seems irresistible…

Despite QR Codes being uncool, there's something appealing about the idea of compressing a DNA barcode sequence into a small image. Imagine having a specimen label with a QR Code, pointing a smart phone at the label using an app that converts the QR Code to a sequence, sends it to BLAST and returns a phylogeny that includes DNA from that specimen (perhaps using a service like http://iphylo.org/~rpage/phyloinformatics/blast).

Thursday, April 04, 2013

I'm working on displaying OCR text from BHL using SVG, and these are just some quick notes on font size. Specifically how SVG font size corresponds to the size of letters, and how you work out what point size was used to print text on a BHL page.

SVG font-size corresponds to the EM square of the font. Hence, if I specify a font-size of 100px then text looks like this (you'll need need a browser that supports SVG to see this):

The yellow box is the EM square (in this example 100px by 100px). The height of the letter "M" is set by the properties of the font which in this case is Times-Roman which has a capheight of 662. This value (and others) are defined in the font description file (Adobe-Core35_AFMs-314.tar.gz).

Below is a diagram showing attributes of Times Roman with respect to the 1000 x 1000 EM square:

Couple of things to note. The first is that the height of a digital font is not given by simply adding the capheight and descender, the height of the font is the EM square. If you know the capheight and the font metrics you can compute the size of the EM square (for Times Roman capheight / 0.662 gives you the EM square). Hence it is possible to fairly accurately reproduce printed text in SVG. I had hopes that I could then go on to infer the actual point size used on the printed page (being able to say "this is 10pt" seemed more elegant than this font is "x pixels"). Turns out that "point size" is a terribly elusive concept, see Point Size and the Em Square: Not What People Think. I've clearly lots to learn about typography. BHL would be a gold mine for anyone interested in the development of type faces and printing technology over time.