iPhylo

Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Wednesday, January 24, 2018

Nico Franz and Beckett Sterner created a stir last year with a preprint in bioRxiv about expert validation (or the lack of it) in the "backbone" classifications used by aggregators. The final version of the paper was published this month in the OUP journal Database (doi:10.1093/database/bax100).

To see what effect "backbone" taxonomies are having on aggregated occurrence records, I've recently been auditing datasets from GBIF and the Atlas of Living Australia. The results are remarkable, and I'll be submitting a write-up of the audits for formal publication shortly. Here I'd like to share the fascinating case of the genus Not Chan, 2016.

I found this genus in GBIF. A Darwin Core record uploaded by the New Zealand Arthropod Collection (NZAC02015964) had the string "not identified on slide" in the scientificName field, and no other taxonomic information.

GBIF processed this record and matched it to the genus Not Chan, 2016, which is noted as "doubtful" and "incertae sedis".

There are 949 other records of this genus around the world, carefully mapped by GBIF. The occurrences come from NZAC and nine other datasets. The full scientific names and their numbers of GBIF records are:

Number

Name

2

Not argostemma

14

not Buellia

1

not found, check spelling

1

Not given (see specimen note) bucculenta

1

Not given (see specimen note) ortoni

1

Not given (see specimen note) ptychophora

1

Not given (see specimen note) subpalliata

1

not identified on slide

1

not indentified

1

Not known not known

1

Not known sp.

1

not Lecania

4

Not listed

873

Not naturalised in SA sp.

18

Not payena

5

not Punctelia

18

not used

6

Not used capricornia Pleijel & Rouse, 2000

GBIF cites this article on barnacles as the source of the genus, although the name should really be Not Chan et al., 2016. A careful reading of this article left me baffled, since the authors nowhere use "not" as a scientific name.

Next I checked the Catalogue of Life. Did CoL list this genus, and did CoL attribute it to Chan? No, but "Not assigned" appears 479 times among the names of suprageneric taxa, and the December 2018 CoL checklist includes the infraspecies "Not diogenes rectmanus Lanchester,1902" as a synonym.

The Encyclopedia of Life also has "Not" pages, but these have in turn been aggregated on the "EOL pages that don't represent real taxa" page, and under the listing for the "Not assigned36" page someone has written:

This page contains a bunch of nodes from the EOL staff Scratchpad. NB someone should go in and clean up that classification.

"Someone should go in and clean up that classification" is also the GBIF approach to its "backbone" taxonomy, although they think of that as "we would like the biodiversity informatics community and expert taxonomists to point out where we've messed up". Franz and Sterner (2018) have also called for collaboration, but in the direction of allowing for multiple taxonomic schemes and differing identications in aggregated biodiversity data. Technically, that would be tricky. Maybe the challenge of setting up taxonomic concept graphs will attract brilliant developers to GBIF and other aggregators.

Meanwhile, Not Chan, 2016 will endure and aggregated biodiversity records will retain their vast assortment of invalid data items, character encoding failures, incorrect formatting, duplications and truncated data items. In a post last November on the GitHub CoL+ pages I wrote:

Being old and cynical, I can speculate that in the time spent arguing the "politics" of aggregation in recent years, a competent digital librarian or data scientist would have fixed all the CoL issues and would be halfway through GBIF's. But neither of those aggregators employ digital librarians or data scientists, and I'm guessing that CoL+ won't employ one, either.

Monday, December 11, 2017

These notes are the result of a few events I've been involved in the last couple of months, including TDWG 2017 in Ottawa, a thesis defence in Paris, and a meeting of the Science Advisory Board of the Natural History Museum in London. For my own benefit if no one else's, I want to sketch out some (less than coherent) ideas for how a natural history museum becomes truly digital.

Background

The digital world poses several challenges for a museum. In terms of volume of biodiversity data, museums are already well behind two major trends, observations from citizen science and genomics. The majority of records in GBIF are observations, and genomics databases are growing exponentially, through older initiatives such as barcoding, and newer methods such as environmental genomics. While natural history collections contain an estimated 109 specimens or "lots" [1], less than a few percent of that has been digitised, and it is not obvious that massive progress in increasing this percentage will be made any time soon.

Furthermore, for citizen science and genomics it is not only the amount of data but the network effects that are possible with that data that make it so powerful. Network effects arise when the value of something increases as more people use it (the classic example is the telephone network). In the case of citizen science, apart from the obvious social network that can form around a particular taxon (e.g., birds), there are network effects from having a large number of identified observations. iNaturalist is using machine learning to suggest identifications of photos taken by members. The more members join and add photos and identifications, the more reliable the machine identifications become, which in turn makes it more desirable to join the network. Genomics data also shows network effects. In effect, a DNA sequence is useless without other sequences to compare it with (it is no accident that the paper describing BLAST is one of the most highly cited in biology). The more sequences a genomics database has the more useful it is.

For museums the explosion of citizen science and genomics begs the question "is there any museum data that can show similar network effects"? We should also ask whether there will be an order of magnitude increase in digitisation of specimens in the near future. If not, then one could argue that museums are going to struggle to remain digitally relevant if they remain minority biodiversity data providers. Being part of organisations such as GBIF certainly helps, but GBIF doesn't (yet) offer much in the way of network effects.

Users

We could divide the users of museums into three distinct (but overlapping) communities. These are:

Scientists

Visitors

Staff

Scientists make use of research and data generated by the museum. If the museum doesn't support science (both inside and outside the museum) then the rationale for the collections (and associated staff) evaporates. Hence, digitisation must support scientific research.

Visitors in this sense means both physical and online visitors. Online visitors will have a purely digital experience, but in person visitors can have both physical and digital experiences.

In many ways the most neglected category is the museum staff. Perhaps best way to make progress towards a digital museum is having the staff committed to that vision, and this means digitisation should wherever possible make their work easier. In many organisations going digital means a difficult transition period of digitising material, dealing with crappy software that makes their lives worse, and a lack of obvious tangible benefits (digitisation for digitisation's sake). Hence outcomes that deliver benefits to people doing actual work should be prioritised. This is another way of saying that museums need to operate as "platforms", the best way to ensure that external scientists will use the museums digital services is if the research of the museum's own staff depends on those services.

Some things to do

For each idea I sketch a "vision", some ways to get there, what I think the current reality is (and, let's be honest, what I expect it to still be like in 10 years time).

Vision: Anyone with an image of an organism can get a answer to the question "what is this?"

Task: Image the collection in 2D and 3D. Computers can now "see", and can accomplish tasks such as identify species and traits (such as the presence of disease [2]) from images. This ability is based on machine learning from large numbers of images. The museum could contribute to this by imaging as many specimens as possible. For example, a library of butterfly photos could significantly increase the accuracy of identifications by tools such as iNaturalist. Creating 3D models of specimens could generate vast numbers of training images [3] to further improve the accuracy of identifications. The museum could aim to provide identifications for the majority of species likely to be encountered/photographed by its users and other citizen scientists.

Reality: Imaging is unlikely to be driven by identification and machine learning, beiggest use is to provide eye-catching images for museum publicity.

Who can help: iNaturalist has experience with machine learning. More and more of research is appearing on image recognition, deep learning, and species identification.

Vision: Anyone with a DNA sequence can get a answer to the question "what is this?"

Task: DNA sequence the collection, focussing first on specimens that (a) have been identified and (b) represent taxonomic groups that are dominated by "dark taxa" in GenBank. Many sequences being added to GenBank are unidentified and hence unnamed. These will only become named (and hence potentially connected to more information) if we have sequences from identified material of those species (or close relatives). Often discussions of sequences focus on doing the type specimens. While this satisfies the desire to pin a name to a sequence in the most rigorous way, it doesn't focus on what users need - an answer to "what is this?" The number of identified specimens will far exceed the number of type specimens, and many types will not be easily sequenced. Sequencing identified specimens puts the greatest amount of museum-based information into sequence space. This will become even more relevant as citizen science starts to expand to include DNA sequences (e.g., using tools like MinION).

Reality: Lack of clarity over what taxa to prioritise, emphasis on type specimens, concerns over whether DNA barcoding is out of date compared to other techniques (ignoring importance of global standardisation as a way to make data maximally useful) will all contribute to a piecemeal approach.

Vision: A physical visitor to the museum has a digital experience deeply informed by the museum's knowledge

Task: The physical walls of the museum are not barriers separating displays from science but rather interfaces to that knowledge. Any specimen on display is linked to what we know about it. If there is a fossil on a wall, we can instantly see the drawings made of that specimen in various publications, 3D scans to interact with, information about the species, the people who did the work (whether historical figures or current staff), and external media (e.g., BBC programs).

Reality: Piecemeal, short-lived gimmicky experiments (such as virtual reality), no clear attempt to link to knowledge that visitors can learn from or create themselves. Augmented reality is arguably more interesting, but without connections to knowledge it is a gimmick.

Who could help: Many of the links between specimens, species, and people full into the domain of Wikipedia and Wikidata, hence lots of opportunities for working with GLAM Wiki community.

Vision: A museum researcher can access all published information about a species, specimen, or locality via a single web site.

Task: All books and journals in the museum library that are not available online should be digitised. This should focus on materials post 1923 as pre-1923 is being done by BHL. The initial goal is to provide its researchers with the best possible access to knowledge, the secondary goal is to open that up to the rest of the world. All digitised content should be available to researchers within the museum using a model similar to the Haithi Trust which manages content scanned by Google Books. The museum aggressively pursues permission to open as much of the digitised content up as it can, starting with its own books and journals. But it scans first, sorts out permissions later. For many uses, full access isn't necessarily needed, at least for discovery. For example, by indexing text for scientific names, specimen codes, and localities, researchers could quickly discover if a text is relevant, even if ultimately direct physically access is the only possibility for reading it.

Reality: Piecemeal digitisation hampered by the chilling effects of copyright, combined with limited resources means the bulk of our scientific knowledge is hard to access. A lack of ambition means incremental digitisation, with most taxonomic research remaining inaccessible, and new research constrained by needing access to legacy works in physical form.

Who could help: Consider models such as Hathi, work with BHL and publishers to open up more content, and text mining researchers to help maximise use even for content that can't be opened up straight away.

Vision: The museum as a "connection machine" to augment knowledge

Task: While a museum can't compete in terms of digital volume, it can compete for richness and depth of linking. Given a user with a specimen, an image, a name, a place, how can the museum use its extensive knowledge base to augment that user's experience? By placing the thing in a broader context (based on links derived from image -> identity tools, sequence -> identity tools, names to entities e.g., species, people and places, and links between those entites) the museum can enhance our experience of that thing.

Reality: The goal of having everything linked together into a knowledge graph is often talked about, but generally fails to happen, partly because things rapidly descend into discussions about technology (most of which sucks), and squabbling over identifiers and vocabularies. There is also a lack of clear drivers, other than "wouldn't it be cool?". Hence expect regular calls to link things together (e.g., Let’s rise up to unite taxonomy and technology), demos and proof of concept tools, but little concrete progress.

Who can help: The Wikidata community, initiatives such as (some of these are no longer alive but useful to investigate) Big Data Europe, BBC Things. The BBC's defunct Wildlife Finder is an example of what can be achieved with fairly simple technology.

Summary

The fundamental challenge the museum faces is that it is analogue in an increasingly digital world. It cannot be, nor should it be, completely digital. For one thing it can't compete, for another its physical collection, physical space, and human expertise are all aspects that make a museum unique. But it needs to engage with visitors that are digitally literate, it needs to integrate with the burgeoning digital knowledge being generated by both citizens and scientists, and it needs to provide its own researchers with the best possible access to the museum's knowledge. Above all, it needs to have a clear vision of what "being digital means".

Tuesday, December 05, 2017

David Attenborough’s latest homage to biodiversity, Blue Planet II is, as always, visually magnificent. Much of its impact derives from the new views of life afforded by technological advances in cameras, drones, diving gear, and submersibles. One might hope that the supporting information online reflected the equivalent technological advances made in describing and sharing information. Sadly, this is not the case. Instead the BBC offers a web site with a video clips and a poster... a $%@£ poster.

This is a huge missed opportunity. Where do people go to learn more about the organisms featured in an episode? How do we discover related content on the BBC and elsewhere? How do we discover the science underpinning each episode that has been so exquisitely filmed and edited?

Perhaps the lack of an online resource reflects a lack of resources, or expertise? Yet one look at the series (and the "Into the blue" epilogues) tells us that resources are hardly limiting. Furthermore, the BBC has previously constructed rich, informative web sites to support natural history programming. The now deprecated BBC Nature Wildlife site had an extensive series of web pages for the organisms featured in BBC programmes, with links to individual clips. For each organism the corresponding web page listed key traits such as behaviours, habitats, and geographic distribution, and each of these traits had its own web page list all organisms with those traits (see, for example the page for Steller's Sea Eagle).

Underlying all this information was a simple vocabulary (the Wildlife Ontology), and the entire corpus is also available in RDF: in other words, the BBC used Semantic Web technologies to structure this information. To get this data you simply append ".rdf" to the URL for a web page. For example, below is the RDF for Steller's Sea Eagle. It is not pretty, but it is a great example of machine-readable data which enables all sorts of interesting things to be built.

For some reason, this web site is now deprecated. As an exercise I grabbed the RDF from the web site, did a little cleaning, and merged it together resulting in a set of around 94,500 triples (statements of the form “subject”, “predicate”, “object”). For example, this triple says that Steller's Sea Eagle is monogamous.

One reason the Semantic Web has struggled to gain widespread adoption is the long list of things you need to get to the point where it is usable. You need data consistently structured using the same vocabulary. You need identifiers that everyone agrees on (or at least can map their own identifiers too). And you need a triple store, which is essentially a graph database, a technology that is still unfamiliar to many. But in this case the BBC has done a lot of the hard work by cleverly minting identifiers based on Wikipedia URLs (”slugs”), and developing a vocabulary to express relationships between organisms, traits, and habitats. All that’s needed is a way to query this data. Rather than use a triple store (most of which are not much fun to install or maintain) I’ve used the delightfully simple approach of employing a Hexastore. Hexastores provide fast querying of graphs by indexing all six permutations of the subject, predicates, object triple (hence “hexa”). The approach is sufficiently simple that for moderately sized databases we can implement it in Javascript and run it in a web browser.

Once you load the page there are no further server requests, other than fetching images. Every query is “live” but takes place in the browser. You can click on the image for a species and get some textural information, as well as images representing traits of that organism. Click on a trait and you discover what organisms share those traits. This example is trivial, but surprisingly rich. I’ve found it fascinating to simply bounce around the images discovering unexpected facts about different species. There’s lots of potential for serendipitous discovery, as well as an enhanced appreciation for just how rich the BBC’s content is. If the Encyclopedia of Life were this engaging I’d be it’s biggest fan.

The question then, is why a similar approach was not taken for Blue Planet II? It can’t be a lack of resources, this series has amazing production values. And yet a wonderful opportunity has been missed. Why not build on the existing work and create an interactive resource that encourages people to explore more deeply and learn more? Much of the existing data could be used, as well as adding all the new species and behaviours we see on our TV screens. Blue Planet also highlights the impacts humans are having on the marine environment, these could be added as categories as well to show wat organisms are susceptible to different impacted (e.g., plastics).

That the BBC thinks a poster is an adequate for of engagement in the digital age speaks of a corporation that, in spite of many triumphs in the digital sphere (e.g., iPlayer) has not fully grasped the role the web can play in making its content more widely useful and relevant, beyond enthralling viewers on a Sunday evening. It also seems oblivious to the fact that it already knows how to deliver rich, informative online content (as evidenced by the now deprecated Wildlife application). So please, BBC, can we have a resource that enables us to learn more about the organisms and habitats that are the subjects of the grandeur and beauty we see on our TV screens?

Follow up

Below is some of the discussion this post generated on Twitter.

Very cool, your hexastore page has now become a great tool, using it for my revision now

Also going to contradict this view slightly (since the /programmes proposition sits in my portfolio). There's a lot more than a poster, for a start: https://t.co/xbHwsaCUHO and there are some assumptions about resource and available content that may not be as simple as suggested

“For some reason, this web site is now deprecated.” “That the BBC thinks a poster is an adequate for (sic) of engagement in the digital age” Have a look at this item, especially the comment by George Osborne at the bottom. https://t.co/OmHiYienzB

I took part in some SW pilots several years ago. The problem is the need to reach critical mass, particularly at a time when budgets are dropping. Public-facing providers like BBC and OU will always prioritise reach: anything additional needs to be off-the shelf to get adoption.

The Biodiversity Literature Repository (BLR) is a repository of taxonomic papers hosted by Zenodo. Where possible Plazi have extracted individual images and added those to the BLR, even if the article itself is not open access. The justification for being able to do this is presented here: DOI:10.1101/087015. I'm not entirely convinced by their argument (see Copyright and the Use of Images as Biodiversity Data) but rather than rehash that argument I decide dit would be much more fun to get a sense of what is in the BLR. I built a tool to scrape data from Zenodo and store it in CouchDB, put a simple search engine on top (using the search functionality in Cloudant) to search within the figure captions, and wrote some code to use a cloud-based image server to generate thumbnails for the images in Zenodo (some of which are quite big). The tool is hosted at Heroku, you can try it out here: https://zenodo-blr-interface.herokuapp.com/.

This is not going to win any design awards, I'm simply trying to get a feel for what imagery BLR has. My initial reactions was "wow!". There's a rich range of images, including phylogenies, type specimens, habitats, and more. Searching by museum codes, e.g. NHMUK is a quick way to discover images of specimens from various collections.

Based on this experiment there are at least two things I think would be fun to do.

Adding more images

BLR already has a lot of images, but the biodiversity literature is huge, and there's a wealth of imagery elsewhere, including journals not in BLR, and of course the Biodiversity Heritage Library (BHL). Extracting images from articles in BHL would potentially add a vast number of additional images.

Machine learning

Machine learning is hot right now, and anyone using iNaturalist is probably aware of their use of computer vision to suggest identifications for images you upload. It would be fascinating to apply machine learning to images in the BLR. Even basic things such as determining whether an image is a photo or a drawing, how many specimens are included, what the specimen orientation is, what part of the organism is being displayed, is the image a map (and of what country) would be useful. There's huge scope here for doing something interesting with these images.

The toy I created is very basic, and merely scratches the surface of what could be done (Plazi have also created their own tool, see http://github.com/punkish/zenodeo). But spending a few minutes browsing the images is well worthwhile, and if nothing else is a reminder of both how diverse life is, and how active taxonomists are in trying to discover and describe that diversity.

Wednesday, October 04, 2017

Day three of TDWG 2017 highlighted some of the key obstacles facing biodiversity informatics.

After a fun series of "wild ideas" (nobody will easily forget David Bloom's "Kill your Darwin Core darlings") we had a wonderful keynote by Javier de la Torre (@jatorre) entitled "Everything happens somewhere, multiple times". Javier is CEO and founder of Carto, which provides tools for amazing geographic visualisations. Javier provided some pithy observations on standards, particularly the fate of official versus unofficial "community" standards (the community standards tend to be simpler, easier to use, and hence win out), and the potentially stifling effects standards can have on innovation, especially if conforming to standards becomes the goal rather than merely a feature.

Identifiers, identifiers, identifiers, identifiers

It's also striking to compare Javier de la Torre's work with Carto where there is a clear customer-driven focus (we need these tools to deliver this to users so that they can do what they want to do) versus the much less focussed approach of our community. Many of the things we aspire to won't happen until we identify some clear benefits for actual users. There's a tendency to build stuff for our own purposes (e.g., pretty much everything I do) or build stuff that we think people might/should want, but very little building stuff that people actually need.

This browser does not support PDFs. Please download the PDF to view it: Download PDF.

What struck me was how similar this was to the now deprecated TDWG LSID vocabulary, still used my most of the major taxonomic name databases (the nomenclatures). This is an instance where TDWG had a nice, workable solution, it lapsed into oblivion, only to be subsequently reinvented. This isn't to take anything away from Frank's work, which has a thorough discussion of the issues, and has a nice way to handle the the difference between asserting that two taxa are the same (owl:equivalentClass) and that a taxon/name hybrid (which is what many databases serve up because they don't distinguish between names and taxa) and a taxon might be the same (linking via the name they both share).

The fate of the RDF served by the nomenclators for the last decade illustrates a point I keep returning too (see also EOL Traitbank JSON-LD is broken). We tend to generate data and standards because it's the right thing to do, rather than because there's actually a demonstrable need for that data and those standards.

We like open data because it's free and it makes it easy to innovate, but we struggle to (a) get it funded and (b) it's hard to demonstrate value (hence pleas for credit/attribution, and begging for funding).

The alternative of closed data, such as paying a subscription to access a database limits access and hence use and innovation, but generates an income to support the database, and the value of the database is easy to measure (it's how much money it generates).

What if we have a "third model" where we pay small amounts of money to access data (micropayments)?

Micropayments as a way to pay creators is an old idea (it was part of Ted Nelson's Xanadu vision). Now that we have cryptocurrencies such as Bitcoin, micropayments are feasible. So we could imagine something like this:

Access to raw datasets is free (you get what you pay for)

Access to cleaned data comes at a cost (you are paying someone else to do the hard, tedious work of making the data usable)

Micropayments are made using Bitcoin

To help generate funds any spare computational capacity in the biodiversity community is used to mine Bitcoins

After the talk Dmitry Mozzherin sent me a link to Steem, and then this article about Steemit appeared in my Twitter stream:

Clearly this is an idea that has been bubbling around for a while. I think there is scope for thinking about ways to combine a degree of openness (we don't want to cripple access and innovation) with a way to fund that openness (nobody seems interested in giving us money to be open).

TDWG is a great opportunity to find out what is going on in biodiversity informatics, and also to get a sense of where the problems are. For example, sitting through the Financial Models for Sustaining Biodiversity Informatics Products session you couldn't help being struck by (a) the number of different projects all essentially managing specimen data, and (b) the struggle they all face to obtain funding. If this was a commercial market there would be some pretty drastic consolidation happening. It also highlights the difficulty of providing services to a community that doesn't have much money.

Chatting to Andrew at the evening event at the Canadian Museum of Nature, I think there's a lot of potential for developing tools to provide collections with data on the use and impact of their collections. Text mining the biodiversity literature on a massive scale to extract (a) mentions of collections (e.g., their institutional acronyms) and (b) citations of specimens could generate metrics that would be helpful to collections. There's a great opportunity here for BHL to generate immediate value for natural history collections (many of which are also contributors to BHL).

The final session I attended was Towards robust interoperability in multi-omic approaches to biodiversity monitoring. The overwhelming impression was that there is a huge amount of genomic data, much of which does not easily fit into the classic, Linnean view of the world that characterises, say, GBIF. For most of the sequences we don't know what they are, and that might not be the most interesting question anyway (more interesting might be "what do they do?"). The extent to which these data can be shoehorned into GBIF is not clear to me, although doing so may result in some healthy rethinking of the scope of GBIF itself.