DISQUS and Crowdsourcing Academic Data

I am not sure what exactly compelled me to write this post today aside from the fact that I had a few spare minutes to do so. I will briefly outline what we have done with a botanical database I work on and the need to generate feedback on over a million objects, objects that need scientific scrutiny on scale. So, this post is basically what we did without a model to follow, how we sort of invented process to suit our needs, and what worked for us. There weren’t a lot of lessons learned to be shared here as DISQUS made it quite easy all around.

The Environment

The project and database is JSTOR Plant Science. The standard text we use to describe it, which I wrote, is as follows. JSTOR Plant Science is an online environment that brings together content, tools, and people interested in plant science. It provides access to foundational content vital to plant science – plant type specimens, taxonomic structures, scientific literature, and related materials, making them widely accessible to the plant science community as well as to researchers in other fields and to the public.

Most importantly for the purposes of this discussion is that it is a million + objects (slated to top out at 2.2 million) that describe the known plant biodoversity of our world. Their metadata is especially important for identifying, classifying, and preserving our dwindling plantlife. And the metadata ranges in quality from perfect to poor. We needed a mechanism for getting more eyes on the problem.

The workflow is generally sturdy enough to handle the influx of new materials contributed weekly by partners (easily 7000 images a week). Essentially, the materials are contributed via hard drive to our production team, they are given a quick QC, metadata is corrected if needed, and these are then published to the live site. The only problem with this approach is that the metadata cannot be corrected centrally as it exceeds our resource allocation for such a process (ie, our one production person); it needs a wider scale of participation to fully understand and authenticate. In short, we need more eyes, and scientific ones at that, to correct this data.

Case Study: Syntype of Astilbe thunbergii Miq. var. aethusifolius

So, we have this following specimen, one collected in 1908 by T. Taquet in Quelpaert (modern-day Chejudo) in Korea. All the metadata fields are linked when possible. All are possible areas of contention, particularly the Identifications field at bottom. The Locality field in general tends to have wide variations as it is an open field and these specimens range from the 18th centuries to modern times. As such, place names change (Formosa=Taiwan, Ceylon=Sri Lanka, etc.) relatively frequently. Which is fine, but it makes it difficult to enact a large cross-linking presentation of this data with any accuracy. All of this metadata is provided by the partner, in this the Royal Botanic Gardens at Edinburgh (as one of our best, I suspect this particular metadata is flawless).

Enter DISQUS

We needed interaction, discussion, collaboration surrounding these materials to be taking place on the site (which is always nice to illustrate impact) primarily because we could then begin a communications loop with this feedback that would result in the metadata being corrected if the feedback did indeed prove to be valid. So we turned to DISQUS to get the discussion part rolling.

We installed DISQUS to be shown on each and every object, turned it on, and said very little. People managed to find it and began commenting. 6000 comments to date in about a year rolled in, all with metadata corrections, suggested edits, and even contesting identifications, which was, in our eyes, remarkable. 6000 comments might not seem like a lot, but these were 6000 scientifically valid observations which would greatly enhance the scientific record if they could be reflected on the object itself.

So, we created a workflow that revolved around some assets we had at our disposal

Each object had a DOI (I am building here so bear with me)

Each contributing organization had a unique partner code. For example, Kew=K, Edinburgh=E, New York Botanical Garden=NYBG)

Each object DOI had the partner code embedded in it. For example, the above specimen’s URL is http://plants.jstor.org/specimen/e00313675 with 00313675 being the DOI for the object and e referring to Edinburgh, the contributing organization.

Each partner had contacts dedicated to maintaining their collections

The workflow looked like this and this is where DISQUS helped quite a bit.

Using DISQUS allowed us to record comments, display them on the site, and simultaneously route the feedback to the contributing organizations with instructions on how to change the metadata if they felt it was valid. Our team didn’t change anything, nor should we; we aren’t scientists and our role is strictly as facilitator. The contributing organizations, the partners themselves, had full autonomy over their scientific data. If they felt the feedback was valid, and more often than not it has been, they corrected it. Otherwise, they discarded it. DISQUS creates links directly to comments, a boon to busy pages full of scientific data.

The system has worked amazingly well so far and allow us to demonstrate the following

a networking of an existing, often disparate community

evidence of real cross-organizational collaboration

improved scientific data

highly transparent process for scientific/academic crowdsourcing

This system worked so well for us that it has led me to believe it is exportable (it wasn’t novel by any means) for large scale digitization projects. The New York Public Library has used a similar approach amazingly well for its What’s on the Menu collection?, a perfect example on how logical, even playful design can stimulate interaction. For our purposes, we could not afford to allow everyone to correct data on site, precisely because it wasn’t our data; it was maintained with scientific scrutiny by our contributing partners. Our concern revolved around getting feedback to the partners and back to the site as edited data in a predictable communication loop. With some minor glitches, we generally succeeded.