Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

A rose by any other name might still smell as sweet, but an animal with two scientific monikers can wreak havoc for researchers trying to study it. Since 1895, the International Commission on Zoological Nomenclature (ICZN) has helped ensure animal names are unique and long-lasting, with a panel of volunteer commissioners who maintain naming rules and resolve conflicts when they arise. But the U.K.-based charitable trust that supports all this is slated to run out of money before the year's end—and that could spell trouble. "If the trust ceases to exist it will be very difficult for the commissioners to do their work," says Michael Dixon, chair of the trust's board and director of the Natural History Museum in London. If ICZN disappeared "it would be something akin to anarchy in animal naming."

The sums of money are not huge:

The nonprofit organization that formed in 1947 to raise funds and administer the ICZN code and the journal—the International Trust for Zoological Nomenclature—has weathered other crises. But net income from its journal is only about $47,000 a year, and the trust's annual expenses now top $155,000. So reserves are about to be exhausted, Dixon says.

A few weeks ago, he sent an e-mail plea to directors of natural history museums around the world for emergency relief. In it, he proposed establishing a committee that would come up with a new financial model for the troubled organization. "This is not unlike GenBank," the database of genome sequences that receives government support, Coddington says. "It's the same distributed goods [situation], that everyone needs and nobody wants to pay for."

...

Dixon estimates the trust needs $78,000 or more to make it through the year. No single organization may be able to fund it long-term, but a network of 10 or 20 institutions might be able to kick in enough to sustain it, he says.

Thursday, February 21, 2013

Somehow I get the feeling that botanists haven't got the "open data" religion. Not only is the list of plant names list behind a really bad license, but the Global Plants Initiative (GPI) hides its type images behind a JSTOR Plant Sciences paywall. Why is botany determined to keep its data under wraps?

For example, the first specimen on the JSTOR site is the GOET008353, the isotype of Aa achalensis Schltr.. You can see a thumbnail of the specimen (shown on the right), but if you want the full image you need to have a subscription, otherwise you see this message:

The resource you are attempting to access is part of JSTOR Plant Science. JSTOR Plant Science is currently being offered free of charge for all JSTOR participants and not for profit institutions. To learn more about JSTOR Plant Science, please contact plants@jstor.org.

So, without a subscription you don't get to see this in high resolution (the JSTOR site features a higher resolution image and associated viewer):

So, the only place I can see this image is on JSTOR, for which I need a subscription. I'm also puzzled by the fact that JSTOR refers to this as "GOET008353", whereas the original herbarium refers to it as "4966". GBIF also has this specimen, which it refers to as GOET GOET-Typen 4966. The GOET008353 is a barcode given to types as part of the GPI digitisation programme. Unfortunately, neither the originating herbarium nor GBIF seems to know about this.

In summary, we have three databases with data on this specimen, each with a different specimen identifier, none of which link to each other, and the available imagery is behind a paywall.

This is the number of new animal species described each year, estimated by parsing taxonomic names and extracting the date in the taxonomic authority. There are two prominent "spikes" which are worrying. Sarkar et al. discuss the peak in 1994:

For example, the analyzed data indicate that a significant portion of the 1994 peak is due to an increase in descriptions of the family Cerambycidae, a large group of beetles.

So, 1994 was a bumper year for describing new species of Cerambycidae? Not quite. Taxatoy is based on names in uBio, and I have a local copy of most of these names. The Cerambycidae names contain lots of duplicate names that differ only in taxon authority. For example, searching the name Ancylocera macrotela on uBio finds:

Why the spike in 1994? I suspect that this is due to the publication in 1994 of "Checklist of the Cerambycidae and Disteniidae (Coleoptera) of the Western Hemisphere" by Miguel A Monné and Edmund F Giesbert. At least 8552 names from that checklist seem to have ended up in uBio, all with the date "1994". So the spike is an artefact. Similarly, the other peak (1912) corresponds to the publication of a checklist by Per Olof Christopher Aurivillius, which contributes over 3000 names.

The data for this chart is on figshare http://dx.doi.org/10.6084/m9.figshare.156862. ION is an index of all new animal names, based on Zoological Record. I place more confidence in its data than data derived from uBio, but it clearly ION has its own issues (such as the gap after 1850, and the uneven sampling of the early years of taxonomy). The key point is that arguments on the temporal distribution of taxonomic descriptions (and the value of legacy literature) need to be aware that the data used is in pretty poor shape.

Update 2013-02-23Jose Antonio Gonzalez Oreja pointed out in an email that the values for ION that I used were a little higher than those that appear on the ION web site. My script for retrieving those values hadn't quite worked. I've uploaded the corrected data to Figshare http://dx.doi.org/10.6084/m9.figshare.156862, updated the diagram above, and put the web calls I used to fetch the data on GitHub https://gist.github.com/rdmpage/5019153. The story doesn't change, but it helps to have the correct data.

I've just come back from a pro-iBiosphere Workshop at Leiden where the role of "legacy literature" became the subject of some discussion. This continued on Twitter as Ross Mounce (@rmounce) and I went back and forth:

@rdmpage but ~700,000 papers were published in 2009. Were there even 70,000 published in 1920? 2000-2012 contains *a lot*— Ross Mounce (@rmounce) February 13, 2013

Ross was wondering whether we should invest much effort in extracting information from legacy literature, suggesting that this literature was of most interest to taxonomists, whereas other biologists will be more likely to find what they want from ever growing recent literature. I was arguing that because many taxa are poorly studied, the chances that you will find data on your organism in the recent literature is likely to be low, unless you study an economically or medically important taxon, or a model organism (many of which fit first categories). My view is based on papers such as Bob May's 1988 paper:

In table 3 May lists the average number of papers per species in the period 1978-1987 across various taxonomic groups. Mammals averaged 1.8 papers per species, beetles averaged 0.01. This means that if you study a beetle species you have a 1/100 chance (on average) of finding a paper on your species in any given year (assuming all beetles are equal, which is clearly false). At this point perhaps we should define "legacy literature". In many ways the issue is not so much the age of the literature, but whether the literature was "born digital", that is, whether from it's authoring to publication the document has been in digital form, so the output is in a format (e.g., HTML, XML, or PDF that contains the document text) from which we can readily extract and mine the text. In contrast, documents that have been digitised from a physical medium (e.g., scans of pages) are less tractable because the text has to be extracted by OCR, and error-prone process. Given these errors is the effort worth it. At this point I should say that BHL is not using the best OCR technology available (my own experience suggests that ABBYY Online is much better), and our community is not making use of research on automating OCR correction). But the question is worth asking. In an effort to answer it, I've done a quick analysis of the PanTHERIA database:

PanTHERIA is a database assembled by Kate Jones (@ProfKateJones) and colleagues for comparative biologists (not taxonomists), and collects fundamental biological data about the best studied animal group on the planet (see May's paper above). In the metadata for the database there is a list of the 3143 publications they consulted to populate the database. Below is a table showing the distribution of the year in which these publications appeared:

Decade starting

Publications

1840

1

1860

1

1890

1

1900

10

1910

4

1920

14

1930

48

1940

61

1950

114

1960

295

1970

527

1980

865

1990

1019

2000

183

The bulk of the papers came from the second half of the 20th century, and many of these are "legacy" in the sense that they are in archives like JSTOR, and hence the PDFs are based on scanned images and OCR. The oldest papers are from the 19th century, which is legacy by anyone's definition. My interpretation of this data is that even for a well-studied group such as mammals, the basic organismal-level data sought by comparative biologists is in the "legacy" literature. My suspicion is that if we attempt to build PanTHERIA-style databases for other, less well-studied taxa, the data (if it exists at all) will be found not in the modern literature (where the focus has long since moved on from the organism to genomics and system biology) but in the corpus of taxonomic and ecological literature that are being scanned and stored in digital archives.

UpdateI've put the articles cited as data sources by the PanTHERIA database in a Mendeley group.