Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Thursday, December 18, 2008

Success is the ability to go from failure to failure without losing your enthusiasm -- Winston Churchill

I learnt today that my Elsevier Challenge entry didn't make the final cut. This wasn't unexpected. In the interests of "open science" (blame Paulo Nuin) here is the feedback I received from the judges:

StrengthsBeautiful presentation, lovely website. Page clearly made his case for open access to metadata/full articles in order to allow communities to build the tools they want. The judges would have enjoyed seeing more elements from the original abstract (tree of life). Great contribution so far to the discussion; Page made his point very well.

WeaknessesGiven that no specific tool was proposed, this submission is somewhat out of scope for the competition. Nonetheless, in support of his point, Page could have elaborated on the kinds of open formats and standards for text and data and figures that would support integrated community-wide tool-building. Alternatively, if the framework and the displayed functionalities were to be the submission, there could have been more discussion of how others can integrate their plug-ins and make them cross-referential to the plug-ins of others. The proposal for Linked Data should utilize Semantic Web standards

Elements to Consider for DevelopmentHow many, and which types of, information substrates? How much work for a new developer to create a new one, and to make this work? How to incentivize authors to produce the required metadata? Or to make the data formats uniform?

I think this is a pretty fair evaluation of my entry. I was making a case for what could be done, rather than providing a specific bit of kit that could make this happen right now. I think I was also a little guilty of not following the "underpromise but overdeliver" mantra. My original proposal included harvesting phylogenies from images, and that proved too difficult to do in the time available. I don't think having trees would have ultimately changed the result (i.e., not making the cut), but it would have been cool to have them.

Anyway, time to stomp around the house a bit, and be generally grumpy towards innocent children and pets. Congratulations to the bastards fellow contestants who made it to the next round.

Tuesday, December 16, 2008

One advantage of flying to the US is the chance to do some reading. At Newark (EWR) I picked up Guy Kawasaki's "Reality Check", which is a fun read. You can get a flavour of the book from this presentation Guy gave in 2006.

While at MIT for the Elsevier Challenge I was browsing in the MIT book shop and stumbled across "Google and the Myth of Universal Knowledge" by Frenchman Jean-Noël Jeanneney. It's, um, very French. I have some sympathy with his argument, but ultimately it comes across as European whining about American success. And the proposed solution involves that classic European solution -- committees! In many ways it's really a librarian complaing about Google (again), which librarians just need to get over:

OK, I'm not really doing the arguments justice, but I'm getting a little tired of European efforts that are essentially motivated by "well the Americans are doing this, so we need to do something as well."Lastly, I also bought Linda Hill's "Georeferencing: The Geographic Associations of Information", which is a little out of date (what, no Google Maps or Google Earth?), but is nevertheless an interesting read, and has lots of references to georeferencing in biodiversity informatics. Given that my efforts for the challenge in this area where so crude, it's something I need to think about a bit more deeply.

Quick post about the Elsevier Challenge, which took place yesterday in the wonderful Stata Center at MIT. It was a great experience. Cool venue, interesting talks, probing questions (having a panel of judges ensured that everybody got feedback/queries). Some talks (like mine) were more aspirational (demos of what could be done), others, such as Sean O'Donoghue's talk on Reflect, and Stephen Wan's on CSBIS (see "In-Browser Summarisation: Generating Elaborative Summaries Biased Towards the Reading Context") were systems that Elsevier could plug in to their existing Science Direct product (and hence are my picks to go forward to the last round).

I was typically blunt in my talk, especially about how useless Science Direct's "2collab" and "Related articles" features were. Rafael Sidi is not unsympathetic to this, and I think despite their status as the Microsoft of publishing (for the XBox crowd, that's a Bad Thing™), the Elsevier people at the meeting were genuinely interested in changing things, and exploring how best to disseminate knowledge. There's hope for them yet! Oh, and special thanks to Anita de Ward and Noelle Gracy for organising the meeting, and the smooth running of the Challenge.

I'm in the US on UK time, so this is probably a bad idea to write this, but the paper by Malte Ebach et al. ("O Cladistics, Where Art Thou?", doi:10.1111/j.1096-0031.2008.00225.x) in the latest Cladistics just annoys me too much. Rather than the call to arms that the authors intend, I think they've provided one more example of the death throes of cladistics (in the narrow parsimony is all, statistical methods are evil, molecular systematics is phenetics, barcoding is killing taxonomy sense).

Associations, such as the Willi Hennig society, and journals, such as Cladistics, were erected in order to tackle the growing problem of pheneticists, purveyors of overall similarity, clustering and divergence rates. Rather than challenge molecular systematists and their numerical taxonomic methods, we take part. Where is our integrity?

Gosh, maybe people realised that molecular data are useful, that molecular data benefit from statistical analysis, and that divergence rates (and times) were of great biological interest? Fancy that!

What happened to the Cladistic Revolution? Today, students appear to have no knowledge of that Revolution. They graduate as students did so before the Revolution, with a sound knowledge of phenetics, ancestor worship and a healthy dose of molecular genetics. What happened to taxonomy and cladistics?

I suspect the real drivers in the "Revolution" were: the development methods that could be implemented in computer software (I include parsimony in this); computer hardware that was becoming cheaper and more powerful; and, the growth of molecular data (i.e., data that was easily digitised). I don't mean to imply that everything was technologically driven, but I suspect it was a combination of a desire to infer evolutionary trees coupled with plausible means of doing so that drove the "revolution", rather than any great conceptual framework.

Such matters as the Phylocode, DNA taxonomy and barcoding, for example, have risen to prominence despite criticism of their many ﬂaws and illogical conclusions. The attempts of these applied technologies to derail almost 250 years of scholarship are barely even questioned by our own peers with only a few taking a stand (e.g., Will and Rubinoﬀ, 2004; Wheeler, 2005).

Barcoding is happening, get over it. There are technical issues with its ability to identify "species", but to object to it on ideological grounds (as papers published in Cladistics tend to do) is ultimately futile. If the authors dealt with bacteria they wouldn't bat an eyelid. Besides, I suspect that the ability to identify organisms, or discover clusters of similar sequences will be among the least interesting applications of barcoding. There will be a wealth of standardised, geotagged data from across life around the planet. People not blinkered by ideology will do interesting things with these data.

Barcoding is understood as a ‘‘solution’’ (to what, one might ask?), systematics journals are infested with phenetics and population genetics (cladistics has vanished), both, seemingly, directing the course and future of taxonomy. Where are the scholars?

Personally I use the term "phenetics" as a litmus test. If anybody says that a method is "phenetic" then I pretty much switch off. Almost always, if somebody uses this term they simply don't understand what they are talking about. If you describe a method as "phenetic" then that tells me that you either don't understand the method, or you're too lazy to try and understand it.

In some ways all this saddens me. I was an undergraduate student around the time of the heyday of the New York school, thought Systematics and Biogeography: Cladistics and Vicariance was a great (if flawed) book (and I still do), and did my first post doc with Gary Nelson at the AMNH. It was a great time to be a student. Phylogenetic trees were appearing in all sorts of places, and systematists were tackling big topics such as biogeography, diversification, coevolution, and development. There was a sense of ambition, and excitement. Yet now it seems that Cladistics has become a venue for reactionary rants by people unable to break out of the comforting (but ultimately crippling) coherence of the hard-core cladist's world view.

Saturday, December 13, 2008

The case of the red lionfish exemplfies how EOL can provide information for science-based decision making. Red lionfish are native to coral reef ecosystems in the Indo-Pacific. Yet, probably due to human release of the fish from aquariums, a large population has found itself in the waters near the Bahamas.

Nope, I suggest it demonstrates just how limited EOL is. If I view the page for the red lionfish I get an out of date map from GBIF that shows a very limited distribution, and doesn't show the introductions in Florida and the Bahamas (I have to wade through text to find reference to the Florida introduction, and the page doesn't mention the Bahamas!). The blog entry states that

In this senerio[sic], EOL and its data partners provide up to date information about the lionfish, or pterois[sic] volitans, in a species page.

In other words, EOL in it's present state is serving limited, out of date information. The gap between hype and delivery shows no sign of narrowing. How can this help "science-based decision making"? Surely there will come a point when people will tire of breathless statements about how EOL will be useful, and they will start to ask "where's the beef?"

Tuesday, December 09, 2008

Quick note to say how much I like the programmers' Q & A site Stack Overflow. I've only asked two questions, but the responses have been rapid and useful. I found out about Stack Overflow by listening to the Stack Overflow podcast episodes on IT Conversations (which carry a lot of other podcasts as well).For a wannabe geek, these podcasts are a great source of ideas.

The idea is to display a table in a fixed space. As you mouse over a cell, the contents of the cell, and the relevant row and column labels become visible. This enables you to get an overview of the full table, but still see individual items:

There are some things to fix. Firstly, I group all sequences by NCBI taxon and gene "features". If there's more than one sequence for the same gene and taxon, I just show one of them (an obvious solution is to add a popup menu if there's more than one sequence). Secondly, the gene "names" are extracted from GenBank feature tables, and will include synonyms and duplicates (for example, a sequence may have a gene feature "RAG-1" and a CDS feature "recombination activating protein 1"). I've stored all of these as not every sequence is consistently labelled, so excluding one class of feature may loose all labels from a sequence. At some point it would be useful to cluster gene names (a task for another day).

The girl is Carmen Electra, which is understandable given the Yahoo image search was for "Electra" (a genus of bryozoan). However, what are the wild men (and women) doing at the top? Turns out this is the result of searching for the genus Homo. But why, you ask, does a paper on bryozoans have human sequences? Well, looks like the table in the paper has incorrect GenBank accession numbers. The sequences AJ711044-50 should, I'm guessing, be AJ971044-50.

Ironically, although it was Carmen Electra's photo that initially made me wonder what was going on, it's really the hairy folks above her image that signal something is wrong. I've come across at least one other example of a paper citing an incorrect sequences, so it might be time to automate this checking. Or, what is probably going to be more fun, looking at treemaps for obviously wrong images and trying to figure out why.

Monday, November 24, 2008

One of the things I've struggled with most in putting together a web site for the challenge is how to summarise that taxonomic content of a study. Initially I was playing with showing a subtree of the NCBI taxonomy, highlighting the taxa in the study. But this assumes the user is familiar with the scientific names of most of life. I really wanted something that tells you "at a glance" what the study is about.

I've settled (for now, at least) on using a treemap of images of the taxa in the study. I've played with treemaps before, and have never been totally convinced of their utility. However, in this context I think they work well. For each paper I extract the taxonomic names (via the Genbank sequences linked to the paper), group them into genera, and then construct a treemap where the size of each cell is proportional to the number of species in each genus. Then I harvest images from Flickr and/or Yahoo's image search APIs and display a thumbnail with a link to the image source.

I'm hoping that these treemaps will give the user an almost instant sense of what the study is about, even if it's only "it's about plants". The treemap above is for Frost et al.'s The amphibian tree of life (doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2), the one to the right is for Johnson and Weese's "Geographic distribution, morphological and molecular characterization, and relationships of Lathrocasis tenerrima (Polemoniaceae)".

Note that the more taxa a study includes the smaller and more numerous the cells (see below). This may obscure some images, but gives the user the sense that the study includes a lot of taxa. The image search isn't perfect, but I think it works well enough for my purposes.

Saturday, November 22, 2008

Elsevier have released this video about the challenge, featuring a few of the contestants. I couldn't get my act together in time to send anything useful, and having seen the 16 gigabytes song (full version here), I'm glad I didn't -- there's just no way I could compete with Michael Greenacre and Trevor Hastie.

A key cosmetic (and philosophical) difference between OpenURL and OpenRef/ResolveRef URLs is that OpenURL uses HTTP GET fields, eg ?title=bla&issn=12345, while OpenRef/ResolveRef uses the URL path itself eg, somejournalname/2008/4/1996. It’s a bit like one scheme was designed in the age of CGI scripts, while the other was designed for web applications capable of more RESTful behaviour. In my mind OpenURL is more versatile but much uglier, while OpenRef is cleaner and simpler but can only reference journal articles.

Of course, it is straightforward to add openref-style URLs to an OpenURL resolver by using URL rewriting, for example:

I've done this for my resolver. One limitation of OpenRef is that there are many different ways to write a journal's name, so you can't determine whether two OpenRef's refer to the same journal by simply string matching (as you can with a DOI, for example -- if the DOI's are different the article is different). For example I might write BMC Bioinformatics and you might write BMC Bioinf.. One way around tis is to have unique identifiers for journals, which of course is the approach Robert Cameron advocated with Universal Serial Item Names and JACC's. The obvious candidate for journal identifier is the ISSN. I guess the problem here is that it's easier to use the journal name rather than require the user to know the ISSN. OpenRefs are certainly easier to write. Hence, I think they are great as a simple way for people to construct a resolvable URL for an artcle, but not so great as an identifier.

Elsevier Labs is inviting creative individuals who have wanted the opportunity to view and work with journal article content on the web to enter the Elsevier Article 2.0 Contest. Each contestant will be provided online access to approximately 7,500 full-text XML articles from Elsevier journals, including the associated images, and the Elsevier Article 2.0 API to develop a unique yet useful web-based journal article rendering application. What if you were the publisher? Show us your preference!

Elsevier are clearly looking for ideas (they also have their Grand Challenge), and there's been some interesting commentary on the Article 2.0 contest.The site provides some sample applications (written in XQuery), which you can play with by going to the list of journals that are included in the challenge and clicking down through volume and issue until you get to individual articles.

Saturday, November 15, 2008

Watch CBS Videos OnlineCBS News Sunday Morning Segment on the EOL. All fun stuff (Paddy skewering the interviewer who fails to recognise an echidna), but still long on promises and short on actual product.

Monday, November 10, 2008

One problem with my cunning plan to use Mediawiki REDIRECTs to handle DOIs is that some DOIs, such as those that BioOne serves based on SICIs contain square brackets, [ ], which conflicts with wiki syntax. For example, doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2 I want to enable users to enter a raw DOI, so I've been playing with a simple URL rewrite in Appache httd.conf, namely:

This rewrites the [ and ] in the original DOI, then forces a new HTTP request (hence the [NC,R] at the end of the line). This keeps Mediawiki happy, at the cost of the REDIRECT page having a DOI that looks a slightly different from the original. However, it means the user can enter the original DOI in the URL, and not have to manually edit it.

Bibliographic coupling is a term coined by Kessler (doi:10.1002/asi.5090140103) in 1963 as a measure of similarity between documents. If two documents, A and B, cite a third, C, then A and B are coupled.

I'm interested in extending this to data, such as DNA sequences and specimens. In part this is because within the challenge dataset I'm finding cases where authors cite data, but not the paper publishing the data. For example, a paper may list all the DNA sequences in uses (thus citing the original data), but not the paper providing the data.

To make this concrete, the paper "Towards a phylogenetic framework for the evolution of shakes, rattles, and rolls in Myiarchus tyrant-flycatchers (Aves: Passeriformes: Tyrannidae)" doi:10.1016/S1055-7903(03)00259-8 lists the sequences used, but does not cite the source of three of these (which is the Science paper "Nonequilibrium Diversity Dynamics of the Lesser Antillean Avifauna" (doi:10.1126/science.1065005). As a result, if I was reading "Nonequilibrium Diversity Dynamics of the Lesser Antillean Avifauna" and wanted to learn who had cited it I would miss the fact that paper "Towards a phylogenetic framework for the evolution of shakes, rattles, and rolls..." had used the data (and hence, in effect, "cited" the paper). In some cases, data citation may be more relevant than bibliographic citation because it relates to people using the data, which seems a more significant action than simply reading the paper.

Note that I'm not interested in the issue of credit as such. In the above example, the authors of the Science paper are also coauthors of the "shakes, rattles, and rolls" paper, and hence show commendable restrain in not citing themselves. I'm interested in the fate of the data. Who has used it? What have they done with it? Has anybody challenged the data (for example, suggesting a sequence was misindentified)? These are the things that a true "web of data" could tell us.

Many scientists now manage the bulk of their bibliographic information electronically, thereby organizing their publications and citation material from digital libraries. However, a library has been described as “thought in cold storage,” and unfortunately many digital libraries can be cold, impersonal, isolated, and inaccessible places. In this Review, we discuss the current chilly state of digital libraries for the computational biologist, including PubMed, IEEE Xplore, the ACM digital library, ISI Web of Knowledge, Scopus, Citeseer, arXiv, DBLP, and Google Scholar. We illustrate the current process of using these libraries with a typical workflow, and highlight problems with managing data and metadata using URIs. We then examine a range of new applications such as Zotero, Mendeley, Mekentosj Papers, MyNCBI, CiteULike, Connotea, and HubMed that exploit the Web to make these digital libraries more personal, sociable, integrated, and accessible places. We conclude with how these applications may begin to help achieve a digital defrost, and discuss some of the issues that will help or hinder this in terms of making libraries on the Web warmer places in the future, becoming resources that are considerably more useful to both humans and machines.

It's an interesting read, and it also <shamless plug>cites my bioGUID project</shamless plug>.

The resolver gets the sequence form NCBI, does a little post processing, then displays the result. Postprocesisng includes parsing the latitude and longitude coordinates (something of a mess in GenBank, see my earlier metacrap rant), extracting specimen codes, adding bibliographic GUIDs (such as DOIs, Handles, or URLs), finding uBio namebankID's for hosts, etc. Note that some records have a key called "taxonomic_group". This is to provide clues for resolving museum specimens -- often the DiGIR provider needs to know what kind of taxon you are searching for.

The aim is to have a simple service that returns somewhat cleaned up GenBank records that I (and others) can play with.

Thinking more and more about using Mediawiki (or, more precisely, Semantic Mediawiki) as a platform for storing and querying information, rather than write my own tools completely from scratch. This means I need ways of modelling some relationships between identifiers and objects.

The first is the relationship between document identifiers such as DOIs and metadata about the document itself. One approach which seems natural is to create a wiki page for the identifier, and have that page consist of a #REDIRECT statement which redirects the user to the wiki page on the actual article.

This seems a reasonable solution because:

The user can find the article using the GUID (in effect we replicate the redirection DOIs make use of)

The GUID itself can be annotated

It is trivial to have multiple GUIDs linking to the same paper (e.g., PubMed identifiers, Handles, etc.).

Taxon names present another set of problems, mainly because of homonyms (the same name being give to two or more diferent taxa).The obvious approach is to do what Wikipedea does (e.g., Morus), namely have a disambiguation page that enable the user to choose which taxon they want. For example:

In this example, there are two taxon names Pinnotheres, so the user would be able to choose between them.

For names which had only one corresponding taxon name we would still have two pages (one for the name string, and one for the taxon name), which would be linked by a REDIRECT:

The advantage of this is that if we subsequently discover a homonym we can easily handle it by changing the REDIRECT page to a disambiguation page. In the meantime, users can simply use the name string because they will be automatically redirected to the taxon name page (which will have the actual information about the name, for example, where it was published).

Of course, we could do all of this in custom software, but the more I look at it the power to edit the relationships between objects, as well as the metadata, and also make inferences makes Semantic Mediawiki look very attractive.

Friday, October 24, 2008

Following on from the previous post, I wrote a simpe Mediawiki extension to insert a Google Book into a wiki page. Written in a few minutes, not tested much, etc.

To use this, copy the code below and save in a file googlebook.php in the extensions directory of your Mediawiki installation.

<?php# rdmp

# Google Book extension# Embed a Google Book into Mediawiki## Usage:# <googlebook id="OCLC:4208784" /># # To install it put this file in the extensions directory # To activate the extension, include it from your LocalSettings.php# with: require("extensions/googlebook.php");

Now you can add a Google book to a wiki page by adding a <googlebook> tag. For example:

<googlebook id="OCLC:4208784" />

The id gives the book identifier (such as an OCLC number or a ISBN (you need to include the identifier prefix). By defaulot, the book will appear in a box 425 × 400 pixels in size. You can add optional width and height parameters to adjust this.

Wednesday, October 22, 2008

I've started to come across more taxonomic books in Google Books, such as Catalogue of the specimens of snakes in the collection of the British museum by John Edward Gray. Google books provides a nice widget for embedding views of books. There is tool for generating the Javascript code. Note that in Blogger (which I use to create this blog) you need to make sure that theJavascript occurs on a single line with no line breaks for it to work.

The Javascript used (with linebreaks that must be removed before using) is:

I stumbled across this book whilst searching for the original record for the snake Enhydris punctata. Confusingly, the Catalogue of Life lists this snake as Enhydris punctata GRAY 1849, implying that Gray's original name still stands, whereas in fact it should be Enhydris punctata (Gray, 1849) as the Gray's original name for the snake was Phytolopsis punctata. It's little things like this that drive me nuts, especially as the Catalogue of Life has no obvious, quick means of fixing this (Wiki, anyone?).

I was also interested in using the OCLC numbers a GUID for the book, but there are several to choose from (including two related to the Google Book). Unlike DOIs, a book may have multiple OCLCs (sigh). Still, it's a GUID, and it's resolvable, so it's a start. Hence, one could link GUIDs for the names published in this book to the book itself.

Tuesday, October 21, 2008

As part of the slow rebuild of bioguid.info, and as part of the Challenge, I've started making an OpenURL resolver for specimens. Partly this is just a wrapper around DiGIR providers, but it's also a response to the lack of GUIDs for specimens. In the same way that I think OpenURL for papers only really makes sense in a world without GUIDs for literature (DOIs pretty much take care of that), given the lack of specimen GUIDs we are left to resolve specimens based on metadata.

Thursday, October 09, 2008

The latest issue of Wired has an article on DNA barcoding, entitled "A Simple Plan to ID Every Creature on Earth". The article doesn't say much that will be new to biologists, but it's a nice intro to the topic, and some of the personalities involved.

Tuesday, October 07, 2008

The rather frail nature of biodiversity services (some of the major players have had service breaks in the last few weeks) has prompted me to revisit Dave Vieglais's BigDig and extend it to other services, such as uBio, EOL, and TreeBASE, as well as DSpace repositories and tools such as Connotea.

The result is at http://bioguid.info/status/. The idea is to poll each service once an hour to see if it is online. Eventually I hope to draw some graphs for each service, to get some idea of how reliable it is.

Much of my own work depends on using web sites and services, and I'm constantly frustrated when they go offline (some times for months at a time).

My aim is to be constructive. I well aware that reliability is not easy, and some tools that I've developed myself have disappeared. But I think as a community we need to do a lot better if biodiversity informatics is to deliver on its promise.

The list of service is biased by what I use. I'm also aware that some of the DiGIR provider information is out of date (I basically lifted the list from the BigDig, I'll try and edit this as time allows).

Comments (and requests for adding services) are welcome. There is a comment box at the bottom of the web page, which uses Disqus, a very cool comment system that enables you to keep track of your comments across multiple sites. It also supports OpenID.

Monday, October 06, 2008

D. Ross Robertson has published a paper entitled "Global biogeographical data bases on marine fishes: caveat emptor" (doi:10.1111/j.1472-4642.2008.00519.x - DOI is broken, you can get the article here). The paper concludes:

Any biogeographical analysis of fish distributions that uses GIS data on marine fishes provided by FishBase and OBIS 'as is' will be seriously compromised by the high incidence of species with large-scale geographical errors. A major revision of GIS data for (at least) marine fishes provided by FishBase, OBIS, GBIF and EoL is essential. While the primary sources naturally bear responsibility for data quality, global online providers of aggregated data are also responsible for the content they serve, and cannot side-step the issue by simply including generalized disclaimers about data quality. Those providers need to actively coordinate, organize and effect a revision of GIS data they serve, as revisions by individual users will inevitably lead to confused science (which version did you use?) and a tremendous expenditure of redundant effort. To begin with, it should be relatively easy for providers to segregate all data on pelagic larvae and adults of marine organisms that they serve online. Providers should also include the capacity for users to post readily accessible public comments about the accuracy of individual records and the overall quality of individual data bases. This would stimulate improvements in data quality, and generate 'selection pressures' favouring the usage of better quality data bases, and the revision or elimination of poor-quality data bases. The services provided to the global science community by the interlinked group of online providers of biodiversity data are invaluable and should not be allowed to be discredited by a high incidence of known serious errors in GIS data among marine fishes, and, likely, other marine organisms. (emphasis added)

As I've noted elsewhere on this blog, and as demonstrated by Yesson et al.'s paper on legume records in GBIF (doi:10.1371/journal.pone.0001124) (not cited by Robertson), there are major problems with geographical information in public databases. I suspect there will be more papers like this, which I hope will inspire database providers and aggregators to take the issue seriously. (Thanks to David Patterson for spotting this paper).

Friday, September 26, 2008

Among the many ways to display trees, degree of interest (DOI) trees strike me as one potentially useful way to display trees such as the NCBI taxonomy. For background see, e.g. doi:10.1145/1133265.1133358 (or Google "degree of interest trees").

The thing that would make this really useful is if an application was written that, like Google Earth, supported a simple annotation file format. Hence, users could create their own annotation files (e.g., taxa of a certain size, those with eyes, etc.) and upload those files, creating their own annotation layers, in much the same way as we can load sets of geographical annotations into Google Earth. I think it's this feature which makes Google Earth what it is, so my question is whether we can replicate this for classifications/phylogeny?

Next few weeks will be busy with term starting, kids visiting, and other commitments, so time to jot down some ideas. The first is to have a Wiki for taxonomic names. Bit like Wikispecies, but actually useful, by which I mean useful for working biologists. This would mean links to digital literature (DOIs, Handles, etc.), use of identifiers for names and taxa (such as NCBI taxids, LSIDs, etc.), and having it pre-populated with data. Imagine merging the NCBI taxonomy, Catalogue of Life, Index Fungorum, and IPNI, say, and having it automatically updated with sources such as WoRMS and uBio RSS. Why a Wiki? Well, partly to capture all the little textual annotations that are needed to flesh out the taxonomy, and partly to make it easy to correct the numerous mistakes that litter existing databases.

As an initial target, I'd aim for a comprehensively annotated NCBI taxonomy, as this is probably the most important taxonomic database that we have.

Julia Clarke and I were advocating data mining, not entirely successfully. At one point I started ranting about post-phylogenetics (i.e., what do do when we've basically got the tree of life). For a brief moment I thought this might be a cool new term to use, although Googling finds that W. Ford Doolittle has used it in the title of talks given at the Wenner-Gren Foundations International Symposium at Stockholm in 2003, and at Penn State in 2006. However, the 2006 talk title (Postphylogenetics: The Tree of Life in the Light of Lateral Gene Transfer) suggests a different meaning (i.e., there isn't a tree of life to be found). I prefer to think of it in the same sense as "postgenomics" -- now that we have all this information, how can we make the best use of it?

Thursday, September 04, 2008

I've been using ISSN's (International Standard Serial Number) to uniquely identify journals, both to generate article identifiers, and as a parameter to send to CrossRef's OpenURL resolver. Recently I've come across journals that change their ISSN, which has fairly catastrophic effects on my lookup tools. For example, the Canadian Journal of Botany has the ISSN 0008-4026, or at least this is what JournalSeek tells me. However, the journal web site tells me that it has been renamed as Botany, with ISSN 1916-2804. The thing is, if I want to look up DOIs for articles published in the Canadian Journal of Botany, I have to use the ISSN for Botany if I want to get a result. Hence, I can't rely on looking up the ISSN for the Canadian Journal of Botany. I've come across this in other journals as well.

The problem with these changes is that it makes ISSN's more fragile. Ideally, the original ISSN would be preserved, and/or CrossRef would have a table mapping old ISSN's onto new ones. The rate things are going, I may have to create such a table myself.

Wednesday, September 03, 2008

Starting to get serious about the Grand Challenge. First step is to parse the XML data Elsevier made available. Sadly this is only for Molecular Phylogenetics and Evolution for 2007, I would have liked the whole journal in XML to avoid hassles with parsing PDF. However, XML is not without it's own problems. I'm slowly getting my head around Elsevier's XML (which is, it has to be said, documented in depth). Two tools I find invaluable are the oXygen XML editor, and Marc Liyanage's TextXSLT application.

As a first attempt, I'm converting Elsevier XML into JSON (being a much simpler format to handle). I'm just after what I regard as the core data, namely the bibliography, and the tables (rich with GenBank accession numbers, specimen codes, and geocoordinates). There are a few "gotchas", such as misisng namespaves to add, and HTML entities that need to be added. Then there's the fact that the XML describes both the document content and it's presentation. Tables can get complicated (cells can span more than one row or column), which makes tasks such as identifying cell contents by using the heading of the corresponding column a bit harder. I hope to put a XSLT style sheet online once I'm happy that it can handle most, if not all the tables I've come across. Then the fun of trying to extract the information can begin.

Friday, August 29, 2008

In case I forget how to do this, and as an example of how easy it is to get sucked into a black hole of programming micro-details, I spent a hour or more trying to figure out how to handle Japanese characters.

If I want to harvest bibliographic metadata, I can parse the resulting HTML. I could follow the links to formats such as BibTex, but there's enough information in the link itself. For example, there's a link to the BibTex format that looks like this:

Note the percent-encoded fields, such as %B7%A6%CC%DA+%B4%B4%C9%D7. This string represents the author's name, 窪木 幹夫. It took me a little while to figure out how to convert %B7%A6%CC%DA+%B4%B4%C9%D7 to 窪木 幹夫. Eventually I discovered this table, which shows that there are a number of ways to represent Japanese characters, including JIS, SJIS, and EUC-JP. Given that C9D7 = 夫, the string is EUC-JP encoded. What I want is UTF-8. After some fussing, it turns out that all I need to do (in PHP) is:

rawurldecode decodes the percent-encoding to EUC-JP, then mb_convert_encoding gives me UTF-8.As an example, here is the above reference displayed by the bioGUID OpenURL resolver. A small victory, but it is nice to display the Japanese title. The English title of this article is "A New Subgenus of the Genus Pidonia MULSANT (Coleoptera: Cerambycidae)". It's perhaps the major triumph of Linnean taxonomy that even though I can't read a word of Japanese, I know the paper is about Pidonia.

For Vince the award brings kudos, recognition, and €30,000 (just a little less than the fortune implied by dechronization ;) ).For me, it's a opportunity for unseemly basking in reflected glory (Vince is a former PhD student of mine, and also spent a Wellcome Trust Fellowship in my lab in the heady days when I cared about lice). If you haven't seen it, check out Vince's blog, and the Scratchpads.

Saturday, August 23, 2008

OMG. Playing with extracting identifiers from text, I have a regular expression for GenBank accession numbers that looks something like this:(A[A-Z])[0-9]{6} | (U[0-9]){5} | (D[A-Z])[0-9]{6} | (E[A-Z])[0-9]{6} | (NC_)[0-9]{6}).OK, it won't get everything, but what is more worrying are the things it will pickup that aren't GenBank accession numbers.

For example, I ran Robert Mesibov's 2005 paper "The millipede genus Lissodesmus Chamberlin, 1920 (Diplopoda: Polydesmida:Dalodesmidae) from Tasmania and Victoria, with descriptions of a new genus and 24 new species" [PDF here] through a script, and out came loads of GenBank accession numbers ... which is a worry as there aren't any sequences in this paper.

Turns out, Mesibov uses UTM grid references to describe localities, and these look like just GenBank accessions. There is a nice web site here which describes how UTM grid references are determined in Tasmania (from which the image below is taken).Not all the "accession numbers" in Mesibov(2005) exist in GenBank, but some do, for example grid reference DQ402119 (41°26'31''S 146°17'02''E) is also a sequence DQ402119 and, you guessed it, it's not from a millipede. So, I need to be a little bit careful in extracting identifiers from text.

Thursday, August 21, 2008

Elsevier recently announced the 10 semi-finalists for their Grand Challenge. To my consternation, I'm one of them. I wrote a proposal entitled "Towards realising Darwin’s dream: setting the trees free" (I have uploaded a copy to Nature Precedings, it should be available shortly see doi:10.1038/npre.2008.2217.1). The "setting the trees" free is a reference to my oft expressed view that much of our knowledge of evolutionary history is locked up in the pages of Molecular Phylogenetics and Evolution.

Of course, writing a proposal is one thing, making something useful is quite another. I envision something along the lines of this, but *cough* better. Meantime, the other semi-finalists look scarily good.

Wednesday, August 20, 2008

Time for some fun. In between some tedious text mining I've been meaning to explore some visualisations of NCBI. Here's the first, inspired by Jörn Clausen's wonderful Live Earthquake Mashup (thanks to Donat Agosti for telling me about this). What I've done is take all the frog sequences in Genbank that are georeferenced, add the date those Genbank records were created, generate a KML file, and use Nick Rabinowitz's timemap to plot the KML. The result is here:

By dragging the time line you can see collections of sequences and where the frog samples came from. Clicking on a marker on the Google Map takes displays a link to the Genbank record. It's all pretty crude, but fun to play with. What I'm toying with is trying to do something like this for new taxa, i.e., a timemap showing where an when new species are described. Sort of a live biodiversity map like the earthquake mashup, albeit not quite so rapidly moving.

ZooKeys (ISSN 1313-2970) is a new journal for the rapid publication of taxonomic names, rather like Zootaxa. On first glance it has some nice features, such as being Open Access (using the Creative Commons Attribution license), DOIs, and RSS feeds -- although these don't validate, partly due to an error at the bottom of the feeds:

The RSS feeds are reasonably informative, although they don't include the DOI, which somewhat defeats the point of having them. DOIs need to be first class citizens in taxonomic literature.

But these are technical matters, the real question is why? Why create a new journal when Zootaxa is pumping out new taxaonomic papers at an astonishing rate. Why not combine forces (DOIs and RSS for Zootaxa, yay!)? There is an editorial doi:10.3897/zookeys.1.11 that is rather coy about this. Yes, Open Access is a Good Thing™, but Zootaxa has some Open Access articles. Why dilute the effort to transform zoological taxonomy by creating a new journal?

While biodiversity informatics putters along, generating loads of globally unique identifiers that nobody else uses, perhaps it's time to take a look at the bigger picture. DBPedia is an effort to extract data from Wikipedia and make it available as linked data. At the heart of this effort is the use of HTTP URIs to identify resources, and reusing those URIs. Hence, for many concepts DBpedia URIs are the default option.

This approach has several adavantages. For one, it embeds taxonomic authorities in the broader ocean of linked data. It also makes use of Wikipedia to provide biographical details on taxonomic authorities (many of whom are sufficiently notworthy to appear in Wikipedia). Until we start linking to other data sources, taxonomic data will remain in it's own little ghetto.

Thursday, August 14, 2008

The proceedings of the BNCOD 2008 Workshop on "Biodiversity Informatics: challenges in modelling and managing biodiversity knowledge" are online. This workshop was held in conjunction with the 25th British National Conference on Databases (BNCOD 2008) at Cardiff, Wales. The papers make interesting reading.

This paper describes visualisation as a means to explore data from the International Plant Names Index (IPNI). Several visualisations are used to display large volumes of data and to help data standardisation efforts. These have potential uses in data mining and in the exploration of taxon concepts.

Nicky explores some visualisations of the IPNI plant name database. Unfortunately only one of these (arguably the east exciting one) is shown in the PDF. The visualisations of citation history using Timeline, and social networks using prefuse are mentioned, but not shown.

Taxonomists have been slow to adopt the web as a medium for building research communities. Yet, web-based communities hold great potential for accelerating the pace of taxonomic research. Here we describe a social networking application (Scratchpads) that enables communities of biodiversity researchers to manage and publish their data online. In the first year of operation 466 registered users comprising 53 separate communities have collectively generated 110,000 pages within their Scratchpads. Our approach challenges the traditional model of scholarly communication and may serve as a model to other research disciplines beyond biodiversity science.

This is a short note describing Scratchpads, which are built using the Drupal content management system (CMS). Scratchpads provide a simple way for taxonomists to get their content online. Based in large measure on the success of scratchpads, EOL will use Drupal as the basis of their "Lifedesks". There are numerous scratchpads online, although the amount and quality of content is, um, variable.

Managing Biodiversity Knowledge in the Encyclopedia of Life by Jen Schopf et al. [PDF]

The Encyclopedia of Life is currently working with hundreds of Content Providers to create 1.8 million aggregated species pages, consisting of tens of millions of data objects, in the next ten years. This article gives an overview of our current data management and Content Provider interactions.

This is a short note on EOL itself. I've given my views on EOL's progress (or, rather, lack thereof) elsewhere (here, here and here). The first author on this paper has left the project, and at least one of the other authors is leaving. It seems EOL has yet to find its feet (it certainly has no idea of how to use blogs).

A core mission in biodiversity informatics is to build a computing infrastructure for rapid, real-time analysis of biodiversity information. We have created the information technology to mine, analyze, interpret and visualize how diseases are evolving across the globe. The system rapidly collects the newest and most complete data on dangerous strains of viruses that are able to infect human and animal populations. Following completion, the system will also test whether positions in the genome are under positive selection or purifying selection, a useful feature to monitor functional genomic charac-teristics such as, drug resistance, host specificity, and transmissibility. Our system’s persistent monitoring and reporting of the distribution of dangerous and novel viral strains will allow for better threat forecasting. This information system allows for greatly increased efficiency in tracking the evolution of disease threats.

In this paper we describe a GBIF/TDWG-funded project in which LSIDs have been deployed in the Catalogue of Life’s Annual and Dynamic Checklist products as a means of identifying species and higher taxa in these large species catalogues. We look at the technical infras- tructure requirements and topology for the LSID resolution process and characteristics of the RDF (Resource Description Framework) metadata returned by the resolver. Such characteristics include the use of concepts and relationships taken from the TDWG (Taxonomic Database Working Group) ontology and how a given taxon LSID relates to others includ- ing those issued by database providers and those above and below it in the taxonomic tree. Finally we evaluate the pro ject and LSID usage in general. We also look to the future when the CoL LSID infrastructure will have to deal changing taxonomic information, annually in the case of the Annual Checklist and possibly much more frequently in the case of the Dynamic Checklist.

Ironically, when I was talking to Frank Bisby earlier this year, he implied that LSIDs would change with each release if the information about a name changed, thus failing to solve the existing, fundamental design flaw in the Catalogue of Life, namely the lack of stable identifiers! So, at first glance we are stuck with hideous-looking identifiers that may be unstable. Hmmm...

In this paper we discuss the potential that scientiﬁc work- ﬂow systems have to support biodiversity researchers in achieving their goals. This potential comes through their ability to harness distributed resources and set up complex, multi-stage experiments. However, there remain concerns over the usability of existing workﬂow systems and re- search still needs to be done to help match the functionality of the soft- ware to the needs of its users. We discuss some of the existing concerns regarding workﬂow systems and propose three potential interfaces in- tended to improve workﬂow usability. We also outline the software ar- chitecture that we have adopted, which is designed to make our proposed workﬂow interface software interoperable across key workﬂow systems.

Not sure what to make of this paper. Workflows seem to generate an awful lot of publications, and few tools that people actually use.

All aspects of organismal biology rely on the accurate identification of specimens described and observed. This is particularly important for ecological surveys of biodiversity, where organisms must be identified and labelled, both for the purposes of the original research, but also to allow reinterpretation or reuse of collected data by subsequent research projects. Yet it is now clear that biological names in isolation are unsuitable as unique identifiers for organisms. Much modern research in ecology is based on the integration (and re-use) of multiple datasets which are inherently complex, reflecting any of the many spatial and temporal environmental factors and organismal interactions that contribute to a given ecosystem. We describe visualization tools that aid in the process of building concept relations between related classifications and then in understanding the effects of using these relations to match across sets of classifications.

The second contribution published in the conference proceedings, but there is also free version available here from the project's blog. The paper describes TaxVis, a project developing visualisation techniques for comparing multiple taxonomic hierarchies. The paper discusses taxonomic concepts and the difficulty of establishing what a taxonomist meant when they used a particular name. As much as I understand the argument, I can't shake the feeling that obsessing about taxonomic concepts is ultimately a dead end. It won't scale, and in an age of DNA barcoding, it becomes less and less relevant.

Releasing the content of taxonomic papers: solutions to access and data mining by Chris Lyal and Anna Weitzman [PDF]

Taxonomic information is key to all studies of biodiversity. Taxonomic literature contains vast quantities of that information, but it is under-utilised because it is difficult to access, especially by those in biodiverse countries and non-taxonomists. A number of initiatives are making this literature available on the Web as images or even as unstructured text, but while that improves accessibility, there is more that needs to be done to assist users in locating the publication; locating the relevant part of the publication (article, chapter etc) and locating the text or data required within the relevant part of the publication. Taxonomic information is highly structured and automated scripts can be used to mark-up or parse data from it into atomised pieces that may be searched and repurposed as needed. We have developed a schema, taXMLit that allows for mark-up of taxonomic literature in this way. We have also developed a prototype system, INOTAXA that uses literature marked up in taXMLit for sophisticated data discovery.

This is a nice overview of the challenge of extracting information from legacy literature. There are numerous challenges facing this work, including taks that are trivial for people, such as determining when an article starts and ends, but which are challenging for computers (see Lu et al. doi:10.1145/1378889.1378918, free copy here -- there is a job related to this question available now). Related efforts are the TaxonX markup being used by Plazi. My own view is that for legacy literature heavy markup is probably overkill, decent text mining will be enough. The real challenge is to stop the rot at source, and enable new taxonomic publications to be marked up as part of the authoring and publishing process.

The present biodiversity distributed solution using DiGIR / TAPIR protocols and the Darwincore2 schema has been very valuable in the centralized portals, which that can provide distributed information in a very quickly way. Using the same concept this paper presents an architecture based on the case study of pollinators to bring the centralization of the relational information to those portals. This architecture is based on a technological structure to facilitate the implementation and extraction from the providers of that relational information, and proposes a model to make this information reliable to be used with the present specimens information on the portal database.

This is a short note on extending DarwinCore to include information about pollination relationships. The wisdom of doing this has been question (see Roger Hyam's comment on the proposal).

A Pan-European Species-directories Infrastructure (PESI) by Charles Hussey and Yde de Jong [PDF]

This communication introduces the rationale and aims of a new Europe-wide biodiversity informatics project. PESI defines and coordinates strategies to enhance the quality and reliability of European biodiversity information by integrating the infrastructural components of four major community networks on taxonomic indexing, namely those of marine life, terrestrial plants, fungi and animals, into a joint work programme. This will include functional knowledge networks of both taxonomic experts and regional focal points, which will collaborate on the establishment of standardised and authoritative taxonomic (meta-) data. In addition PESI will coordinate the integration and synchronisation of the European taxonomic information systems into a joint e-infrastructure and the creation of a common user-interface disseminating the pan- European checklists and associated user-services results.

This paper describes PESI, yet another mega-science project in biodiversity, complete with acronyms, work packages, and vacuous, buzzword-compliant statements. Just what the discipline needs...