Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Friday, September 28, 2012

Thinking about accessing the taxonomic literature I started revisiting previous ideas. One is DeepDyve (see DeepDyve - renting scientific articles). Imagine not having to pay large sums for an article, but being able to rent it. Yes, open access would be great, but ultimately it's all a question of money (who pays and when), the challenge is to find the mix of models that encourage people to digitise the relevant literature. Instead of publishers insisting we pay $US30 for an article, how about renting it for the short time we actually need to read it?

Another model is unglue.it, a Kickstarter-like company that seeks to raise funds to digitise and make freely available e-Books. unglue.it has campaigns where people pledge donations, and if sufficient pledges are made the book's rights-holder has the book digitised and released DRM-free.

Looking at unglue.it I stumbled across Readmill, "a curious community of readers, highlighting and sharing the books they love." Readmill has an iPad app where you can highlight passages of text and add your own annotation. These annotations can be shared, and multiple people can read and comment on the same book. Imagine doing this on BHL content. You could highlight parts of the text where the OCR has failed, and provide a correction. You could highlight taxonomic names that automatic parsers have missed, geographic localities, cited literature, etc. All within a nice, social app.

Even better, Readmill has an API. You can retrieve highlights and comments on those highlights. So, if someone flags a sentence as mangled OCR and provides a correction, that correction could be harvested and feed back to, say, BHL. These corrections could be used to improve searches, as well as the text delivered when generating searchable PDFs, etc.

You can even add highlights via the API, so we could upload a ePub book then add all the taxonomic names found by uBio or NetiNeti, enabling users to see which bits of text are probably names, correcting any mistakes along the way. Instead of giving readers a blank canvas they could already have annotations to start with.

Building an app from scratch to read and annotate BHL content would be a major undertaking. From my cursory initial look I wonder if Readmill might just provide the platform we need to clean up and annotate key parts of the BHL corpus?

Monday, September 24, 2012

We all have a "past" that we might not advertise widely, and my past includes flirting with panbiogeography. Indeed my PhD thesis hdl:2292/1999 is entitled "Panbiogeography: a cladistic approach." Shortly after graduating I moved on to host-parasite cospeciation and the gene tree/species tree problem ("reconciled trees", see Katz et al. http://dx.doi.org/10.1093/sysbio/sys026 for a recent example of this approach), but part of me misses the glory days of vicariance, dispersal, and panbiogeography.

One thing which strikes me is how little use large-scale historical biogeography makes of GBIF data. One of the things that made Croizat's panbiogeography so interesting was the way he exposed similar distribution patterns in unrelated groups of organisms. He did this by hand, producing map after map, some embellished with all manner of annotations ("gates", "nodes", "massings", etc.). In some ways, Croizat as an early data miner. Now we are awash in distributional data, where are the people revisiting global scale historical patterns? In particular, wouldn't it be cool to have a biogeographic search engine that could pull out taxa with particular distribution patterns that we could then analyse.

For example, while working on a project to map taxonomic names to literature and genomics data, I embedded a widget to display GBIF maps. Every so often I come across taxa have the classic "Gondwana" distribution pattern. For example, below is a map for stoneflies of the family the Notonemouridae from GBIF.

Another family of stone flies, the Gripopterygidae, show a similar pattern:

What I'd like is to be able to query a database like GBIF for patterns such as these Gondwanic distributions, then be able to pull out associated phylogenetic information (e.g., via sequences in GenBank) so that we could determine the antiquity of these patterns, and whether they are consistent with geological models. We could begin to do large-scale testing of biogeographic hypotheses in a (semi-)automated way. At present we generally rely on a few well-studied examples that are either broadly consistent with

the hypothesis that the history of biota of the southern hemisphere has been largely structured by the break-up of Gondwana.

A first step might be to index distributions at, say, family level and above, and provide a series of polygons representing different distribution patterns. We then search for distributions that are largely concordant with those patterns, and query GenBank (or TreeBASE) for sequences (or phylogenies) for those taxa. We then ask the questions "how old are these taxa?" and "what biogeographic histories do they have?"

Saturday, September 22, 2012

Prompted by a conversation with Vince Smith at the recent Online Taxonomy meeting at the Linnean Society in London I've been revisiting touch-based displays of large trees. There are a couple of really impressive examples of what can be done.

Perceptive Pixel

I've blogged about this before, but came across another video that better captures the excitement of touch-based navigation of a taxonomy. Perceptive Pixel's (recently acquired by Microsoft) Jeff Han demos browsing an animal classification. The underlying visualisation is fairly sttaightforward, but the speed and ease with which you can interact with it clearly makes it fun to use.

Saturday, September 08, 2012

The release of the ENCODE (ENCyclopedia Of DNA Element) project has generated much discussion (see Fighting about ENCODE and junk). Perhaps perversely, I'm more interested in the way Nature has packaged the information than the debate about how much of our DNA is "junk."

Nature has a website (http://www.nature.com/encode/) that demonstrates the use of "threads" to navigate through a set of papers. Instead of having to read every paper you can pick a topic and Nature has collected a set of extracts on that topic (such as a figure and its caption) from the relevant papers and linked them together as a thread. Here is a video outlining the rationale behind threads.

Threads can be viewed on Nature's web site, and also in the iPad app. The iPad app is elegant, and contains full text for articles from Nature, Genome Research, Genome Biology, BMC Genetics. Despite being from different journals the text and figures from these articles are displayed in the same format in the app. Curious as to how this was done I "disassembled" the iPad app (see Extract and Explore an iOS App in Mac OS X for how to do this. If you've downloaded the app on your iPad and synced the iPad with your Mac, then the apps are in the folder "iTunes/iTunes Media/Mobile Applications" folder inside your "Music" folder. The app contains a file called encode.zip, and inside that folder are the articles and threads, all as ePub files. ePub is the format used by a number of book-reading apps, such as Apple's iBooks. Nature has a lot of experience with ePub, using it in their iPhone and iPad journal apps (see my earlier article on these apps, and my web-based clone for more details).

ePub has several advantages in this context over, say, PDFs. Because it ePUb is essentially HTML, the text and images can be reflowed, and it is possible to style the content consistently (imagine how much clunkier things would have looked if the app had used PDFs of the articles, each in the different journals' house style). Having the text in ePub also makes creating threads easy, you simply extract the relevant chunks and combine them into a new ePub file.

Threads are an interesting approach, particularly as they cut across the traditional boundaries of individual articles to create a kind of "mash up." Of course, in the ENCODE app these are preselected for you, you can't create your own thread. But you could imagine having an app that would enable you to not just collect the papers relevant to a topic (as we do with bibliographic software), but enable you to extract the relevant chunks and create a personalised mash up across papers from multiple journals, each linked back to the original article (much like Ted Nelson envisioned for the Xanadu project). It will be interesting to see whether thread-like approaches get more widely adopted. Whatever happens, Nature are consistently coming up with innovative approaches to displaying and navigating the scientific literature.

Wednesday, September 05, 2012

Quick note that as much as I like that the Biodiversity Heritage Library is using DOIs, they are generating them for publications that already have them (or are acquiring them from other sources). For example, here are the two DOIs for the same article (formatted using the DOI Citation Formatter), one from BHL and one from the Smithsonian:

The BHL DOI resolves to a page in BHL, the other DOI resolves to the a page in the Smithsonian Digital Repository (this article also has the handle hdl:10088/5222).

Now this is a problem, because DOIs are meant to be unique: one article, one DOI. I've encountered duplicates elsewhere, but in these cases one should be an alias of the other. In the example above, the DOIs resolve to different locations. If you are just after the content this isn't a huge problem, but if, say, you were using the DOI to uniquely identify the publication (say, in a database) you have a problem: which DOI to choose? If you and I choose differently then we will make statements about the same article but be unaware of that sameness.

Much of this problem arises because BHL has no concept of articles. Most articles are likely to reside within scanned volumes of a journal, but some articles (e.g., monographs) may be treated a single title by BHL, and each BHL title now gets a DOI.

I know that handling articles is on BHL's radar, but it because it hasn't tackled it yet we are going to have cases where BHL DOIs duplicate existing DOIs. In these cases, BHL may have to make their DOI an alias of the other DOI.