Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Saturday, May 30, 2009

Not much information on the other entries yet, except for the eBiosphere Citizen Science Challenge, by Joel Sachs and colleagues, which will demonstrate a "global human sensor net". Their plan is to aggregate observations posted on Flickr, Twitter, Spotter, and email. It might be fun to make use of some of this for my own entry (by default already will, because we are both using EOL's Flickr pool).

Wednesday, May 27, 2009

Over on the EOL blog is a summary of a meeting Visualizing the Evolutionary Tree of Life. This sounds like it was a fun meeting, but part of me is suffering from déjà vu. Our community has tossed this subject around for a while now. I recall Tamara Munzner wowing us with the H3 hyperbolic browser at a meeting at UC Davis in 2000 (part of the original NSF TOL workshops, archived herel) -- the image of Joel Cracraft excitedly running up to the display screen to find various birds is forever etched into my brain. TreeJuxtaposer has been around for a while, to not much effect.

Tuesday, May 26, 2009

Prepare and present a real-time demonstration during the days of the Conference of the capabilities in your community of practice to discover, disseminate, integrate, and explore new biodiversity-related data by:

Indexing, linking and/or automatically submitting the new data records to other relevant databases;

Integrating the data with other databases and data streams;

Making these data available to relevant audiences;

Make the data and links to the data widely accessible; and

Offering interfaces for users to query or explore the data.

Originally I planned to enter the wiki project I've been working on for a while, but time was running out and the deadline was too ambitious. Hence, I switched to thinking about RSS feeds. The idea was to first create a set of RSS feeds for sources that lack them, which I've been doing over at http://bioguid.info/rss, then integrate these feeds in a useful way. For example, the feeds would include images from Flickr (such as EOL's pool), geotagged sequences from GenBank, the latest papers from Zootaxa, and new names from uBio (I'd hoped to include ION as well, but they've been spectacularly hacked).

It's a lot easier to simply look at it rather than describe what it does, but here's a quick sketch of what's under the hood.

Firstly, I take RSS feeds, either the raw geoFeed from Flickr, or from http://bioguid.info/rss. The bioGUID feeds include the latest papers in Zootaxa (most new animal species are described in this journal), a modified version of uBio's new names feed, and a feed of the latest, geotagged sequences in GenBank (I'd hoped to use only DNA barcodes, but it turns out rather few barcode sequences are geotagged, and few have the "BARCODE" keyword). The Flickr feeds are simple to handle because they include locality information (including latitude, longitude, and Yahoo Where-on-Earth Identifiers (WOEIDs)). Similarly, the GenBank feed I created has latitude and longitudes (although extracting this isn't always as straightforward as it should be). Other feeds require more processing. The uBio feed already has taxonomic names, but no geotagging, so I use services from Yahoo! GeoPlanet™ to find localities from article titles. For the Zootaxa feed that I created I use uBio's SOAP service to extract taxonomic names, and Yahoo! GeoPlanet™ to extract localities.

I've tried to create a useful display popup. For Zootaxa papers you get a thumbnail of the paper, and where possible an icon of the taxonomic group the paper talks about (the presence of this icon depends on the success of uBio's taxonomic name finding service, the Catalogue of Life having the same name, and my having a suitable icon). The example above shows a paper about copepods. Other papers have a icon for the journal (again, a function of my being able to determine the journal ISSN and having a suitable icon). Flickr images simply display a thumbnail of the image.

What does it all mean? Well, I could say all sorts of things about integration and mash-ups but, dammit, it's pretty. I think it's a fun way to see just what is happening in digital biodiversity. I've deliberately limited the demo to items that came online in the month of May, and I'll be adding items during the conference (June 1-3rd in London). For example, if any more papers appear in Zootaxa, or in the uBio feeds I'll add those. If anybody uploads geotagged photos to EOL's Flickr group, I'll grab those as well. It's still a bit crude, but it shows some of the potential of bringing things together, coupled with a nice visualisation. I welcome any feedback.

Friday, May 22, 2009

For the last two days I've been participating in a NESCent meeting on Dryad, a "repository of data underlying scientific publications, with an initial focus on evolutionary biology and related fields". The aim of Dryad is to provide a durable home for the kinds of data that don't get captured by existing databases such as GenBank and TreeBASE (for example, the Excel spreadsheets, Word files, and tarballs of data that, if they are lucky, make it on to a journal's web site as supplementary material (like this example). These data have an alarming tendency to disappear (see "Unavailability of online supplementary scientiﬁc information from articles published in major journals" doi:10.1096/fj.05-4784lsf).

Perhaps it was because I was participating virtually (via Adobe Connect, which worked very well), but at times I felt seriously out of step with many of the participants. I got the sense that they regard the scientific article as primary, data as secondary, and weren't entirely convinced that data needed to be treated in the same way as a publication. I was arguing that Dryad should assign DOIs to data sets, join CrossRef, and ensure data sets were cited in the same way as papers. For me this is a no brainer -- by making data equivalent to a publication, journals don't need to do anything special, publishers know how to handle DOIs, and will have fewer qualms than handling URLs, which have a nasty tendency to break (see "Going, Going, Gone: Lost Internet References" doi:10.1126/science.1088234).

Furthermore, being part of CrossRef would bring other benefits. Their cited-by linking service enables publishers to display lists of articles that cite a given paper -- imagine being able to do this for data sets. Dryad could display not just the paper associated with publication of the data set, but all subsequent citations. As an author, I'd love to see this. It would enable me to see what others had done with my data, and provide an incentive to submit my data to Dryad (providing incentives to authors to archive data is a big issue, see Mark Costello's recent paper doi:10.1525/bio.2009.59.5.9).

Not everyone saw things this way, and it's often a "reality check" to discover that things one takes for granted are not at all obvious to others (leading to mutual incomprehension). Many editors, understandably, think of the the journal article as primary, and data as something else (some even struggle to see why one would want to cite data). There's also (to my mind) a ridiculous level of concern about whether ISI would index the data. In the age of Google, who cares? Partly these concerns may reflect the diversity of the participants. Some subjects, such as phylogenetics, are built on reuse of previous data, and it's this reuse that makes data citation both important and potentially powerful (for more on this see my papers hdl:10101/npre.2009.3173.1 and doi:10.1093/bib/bbn022). In many ways, the data is more important than the publication. If I look at a phylogenetics paper published, say 5 or more years ago, the methods may be outmoded, the software obsolete (I might not be able to run it on a modern machine), and the results likely to be outdated (additional data and/or taxa changing the tree). So, the paper might be virtually useless, but the data continues to be of value. Furthermore, the great thing about data (especially sequence data) is that it can be used in all sorts of unexpected ways. In disciplines such as phylogenetics, data reuse is very common. In other areas in evolution and ecology, this might not be the case.

It will be clear from this that I buy the idea articulated by Philip Bourne (doi:10.1371/journal.pcbi.0​010034) that there's really no difference between a database and a journal article and that the two are converging (I've argued for a long time that the best thing that could happen to phylogeneics would be if Molecular Phylogenetics and Evolution and TreeBASE were to merge and become one entity). Data submission would equal publication. In the age of Google where data is unreasonably effective (doi:10.1109/mis.2009.36, PDF here), privileging articles at the expense of data strikes me as archaic.

So, whither Dryad? I wish it every success, and I'm sure it will be a great start. There are some very clever people behind it, and it takes a lot of work to bring a community on board. However, I think Dryad's use of Handles is a mistake (they are the obvious choice of identifier given Dryad is based on DSpace), as this presents publishers with another identifier to deal with, and has none of the benefits of DOIs. Indeed, I would go further and say that the use of Handles + DSpace marks Dryad as being basically yet another digital library project, which is fine, but it puts it outside the mainstream of science publishing, and I think that is a strategic mistake. An example of how to do things better is Nature Precedings, which assigns DOIs to manuscripts, reports, and presentations. I think the use of DOIs in this context demonstrated that Nature was serious, and valued these sorts of resource. Personally, I'd argue that Dryad should be more ambitious, and see itself as a publisher, not a repository. In fact, it could think of itself as a journal publisher. Ironically, maybe the editors at the NESCent meeting were well advised to be wary, what they could be witnessing is the formation of a new kind of publication, where data is the article, and the article is data.

Thursday, May 07, 2009

Continuing with RSS feeds, I've now added wrappers around IPNI that will return for each plant family a list of names added to the IPNI database in the last 30 days. You can see the list at here.

One thing which is a constant source of frustration for me is the disconnect between nomenclators (lists of published names for species) and scientific publishing. The unit of digitisation for a publisher is the scientific article, but nomenclators often cite not the article in which a name was published, but the page on which the name appears.

Very detailed, and great if I have access to a physical library that has the Edinburgh Journal of Botany -- I just find volume 66 on the shelf and turn to page 105. But, I want this on my computer now ("library" - who they?). How do I find this reference on the web? The answer, is not easily. Tools such as OpenURL, which could be used, assume that I know at least the starting page of the article, but IPNI doesn't tell me that. Nor do I have an article title, which might help, but a Google search on "Begonia ozotothrix" finds the article:

Note the DOI! This article exists on the web, so why can't IPNI give me the DOI? They've gone to a lot of trouble to describe the citation in great detail, but adding the DOI brings the record into the 21st century and the web (the DOI is even printed on the article!).

I think nomenclators need to make a concerted effort to integrate with the digital scientific literature, otherwise they will remain digital backwaters that make the implicit assumption that their users have access to libraries such as that at the Royal Botanic Gardens, Edinburgh (pictured).

For recently published articles there's absolutely no reason not to store the DOI. Finding these retrospectively is a pain, but I need these for my RSS feed (and other projects) so one thing I added a while ago to bioGUID's OpenURL resolver is the ability to search for an article given an arbitrary page. For example,

I've been playing with RSS on and off for a while, but what reignited my interest was the swine flu timemap I made last week. The neatest thing about the timemap was how easy it was to make. Just take some RSS that is geotagged and you get the timemap (courtesy of Nick Rabinowitz's wonderful Timemap library).

So, I began to think about taking RSS feeds for, say journals and taxonomic and genomic databases and adding them together and displaying them using tools such as timemap (see here for an earlier mock up of some GenBank data). Two obstacles are in the way. The first is that not every data source of interest provides RSS feeds. To address this I've started to develop wrappers around some sources, the first of which is ZooBank.

The second obstacle is that integration requires shared content (e.g., tags, identifiers, or localities). Some integration will be possible geographically (for example, adding geotagged sequences and images to a map), but this won't work for everything. So, I need to spend some time trying to link stuff together. In the case of Zoobank there's some scope for this, as ZooBank metadata sometimes includes DOIs, which enables us to link to the original publication, as well as bookmarking services such as Connotea. I'm aiming to include these links within the feed, as shown in this snippet (see the <link rel="related"...> element):