Rants, raves (and occasionally considered opinions) on phyloinformatics, taxonomy, and biodiversity informatics. For more ranty and less considered opinions, see my Twitter feed.ISSN 2051-8188 View this blog in Magazine View.

Wednesday, September 30, 2009

One thing about the Encyclopedia of Life which bugs me no end is the awful way it displays the bibliography generated from the Biodiversity Heritage Library (BHL). The image on the right shows the bibliography for the frog Hyla rivularis Taylor, 1952. It's one long, alphabetical list of pages. How can a user make sense of this? It's even more annoying because the BHL is one of the cornerstones of EOL, and one could argue that BHL content is one of the few thing pages EOL offer that distinguishes them from cheap and cheerful mashups such as iSpecies. Can't we do something a little better?

BHL has an API (documented here), so I decided to experiment. As I mentioned in an earlier post (Biodiversity Heritage Library, Google books, and metadata quality), a key piece of metadata about a bibliographic reference is its date. This is especially so for the taxonomic literature, where the earliest reference that contains a name may (depending on how complete BHL scanning is) be the first description of that name. So, it would be nice to order the BHL bibliography by date. Turns out it's possible to get dates from quite a few BHL items, providing one fusses with regular expressions long enough.

So, in principle we could sort BHL content by dates. But, we could go one better and visualise them. As an experiment, I've put together a demo that uses the SIMILE Timeline widget to display the BHL bibliography for a taxon. Here's a screenshot of the bibliography for Hyla rivularis:

You can generate others at http://bioguid.info/bhl/. The demo has been thrown together in haste, but here's what it does:

Takes a taxon name and tries to find it in uBio. This gives us the NamebankID BHL needs

Calls the BHL API and retrieves the bibliography for the NamebankID found in step 1

The previous step generates a JSON document which can be displayed by Timeline

If you click on an item you get a list of pages, clicking on any those takes to you to the page in BHL. Items that have a range of dates are displayed as horizontal lines, items with a well-defined date are shown as points. Note that my code for working out the date of an item will probably fail on some items, and some items don't have any dates at all. Hence, not every item in BHL will appear in the timeline.

It would be nice to embellish the results a little (for example, group pages into articles, refine the dates, etc.) but I think this goes a little way to demonstrating what can be done. We could also add articles obtained from other sources (e.g., Google Scholar, PubMed) to the same display, providing an overview of published knowledge about a taxon.

Now, Google Scholar isn't perfect, but it's come to play a key role in a variety of bibliographic tools, such as Mendeley, and Papers. These tools do a delicate dance with Google Scholar who, strictly speaking, don't want anybody scraping their content. There's no API, so Mendeley, Papers (and my own iSpecies) have to keep up with the HTML tweaks that Google introduces, pretend to be web browsers, fuss with cookies, and try to keep the rate of queries below the level at which the Google monster stirs and slaps them down.

Jacsó's critique also misses the main point. Why do we have free (albeit closed) tools like Google Scholar in the first place? It's largely because scientists have ceeded the field of citation analysis to commercial companies, such as Elsevier and Thompson Reuters. To echo Martin Kalfatovic's comment:

Over the years, we've (librarians and the user community) have allowed an important class of metadata - specifically the article level metadata - migrate to for profit entities.

For me, this is the one thing the ridiculously over-hyped Mendeley could do that would merit the degree of media attention it is getting -- be the basis of an open citation database. It would need massive improvement to its metadata extraction algorithms, which currently suck (Google Scholar's, for all Jacsó's complaints, are much better), but it would generate something of lasting value.

Saturday, September 19, 2009

I've been playing recently with the Biodiversity Heritage Library (BHL), and am starting to get a sense for the complexities (and limitations) of the metadata BHL stores about publications. The more I look at BHL the more I think the resource is (a) wonderfully useful and (b) hampered by some dodgy metadata.

The BHL data model has three kinds of entities, "Titles", "Items", and "Pages". Pages are individual pages in an item, where an item which corresponds to a physical object that has been scanned (such as a book or a bound volume of a journal). A title may comprise a single item, such as book, or many items, such as volumes of a journal. Most of the metadata BHL has relates to physical items (books and bound volume issues), as opposed to article-level metadata, which is basically absent (see But where are the articles?).

This model reflects the sources of the BHL metadata (library catalogues) and the mode of operation (bulk scanning of bound volumes). But it can make working out dates of somewhat challenging.

To give an example, I did a search on the frog name Hyla rivularis Taylor, 1952 (NameBankID 27357), currently known as Isthmohyla rivularis. I wanted to find the original description of this frog. A BHL search returns 34 pages containing the name Hyla rivularis, distributed over 5 titles (a title in BHL may be a book, or a journal). Given that the name was published in 1952, it would be nice if I could sort these results by date, and then look at items from 1952. Unfortunately I can't. BHL has limited information on dates, especially at the level I would need to find a document published in 1952.

For the five titles returned in the search, I have dates for four of them, albeit two are ranges (University of Kansas publications, Museum of Natural History, 1946-1971, and The University of Kansas science bulletin, 1902-1996). At the level of individual items, only item 25858 (University of Kansas publications, Museum of Natural History) has dates (1961-1966). If I look at the VolumeInfo field for an item (you can get this from the database dump, or using the JSON web service) I sometimes get strings like this "v.35:pt.1 (1952)". This item (25857) is the one I'm after, but the date is buried in the VolumeInfo string. So, the information I need is there, but it's going to need some parsing.

Another issue is that of duplicates. Searching for publications on Rana grahamii, I found items 41040 and 45847. Although one item is treated as a book, and the other as a volume of the journal Records of the Indian Museum, these are the same thing. Having duplicates is a complication, but it might also be useful for quality control and testing (for example, do taxon name extraction algorithms return the same names from OCR text from both copies?). Nor is having duplicate copies and/or identifiers unique to BHL. The Records of the Indian Museum has a series-level identifier (ISSN 0537-0744), and this article ("A monograph of the South Asian, Papuan, Melanesian and Australian frogs of the genus Rana") also as the ISBN 8121104327.

There are parallels with Google books scanning project, which has been the subject of criticism on several fronts, including the quality of the metadata they have for each book. Geoff Nunberg has an entertaining post entitled Google Books: A Metadata Train Wreck which lists many examples of errors. This blog post also contains a detailed response from Jon Orwant of Google books. In essence, Google books is riddled with metadata errors (such as books on the Internet with publication dates predating the birth of their authors), but most of these errors have come from library catalogues (not unexpected given the scale of the task), not Google.

What could BHL do about its metadata? One thing is crowdsourcing. BHL does a little of this already, for example capturing user-provided metadata when PDFs are created, but I wonder if we could do more. For example, imagine dumping metadata for all 39,000 items into a semantic wiki and inviting people to edit and annotate the metadata. This could be extended to adding article boundaries (i.e., identifying which page corresponds to the start of an article). There is also considerable scope for trying to find article boundaries using existing metadata from bibliographies assembled by individual scientists.

But we should watch closely what Google does with its book project. Eric Hellman has argued that, far from creating the metadata mess, Google is ideally positioned to sort it out. He writes:

What if Google, with pseudo-monopoly funding and the smartest engineers anywhere, manages to figure out new ways to separate the bird shit from the valuable metadata in thousands of metadata feeds, thereby revolutionizing the library world without even intending to do so?

Thursday, September 17, 2009

At the start of this week I took part in a biodiversity informatics workshop at the Naturhistoriska riksmuseets, organised by Kevin Holston. It was a fun experience, and Kevin was a great host, going out of his way to make sure myself and other contributors were looked after. I gave my usual pitch along the lines of "if you're not online you don't exist", and talked about iSpecies, identifiers, and wikis.

I also ran a short, not terribly successful exercise using iTaxon to demo what semantic wikis can do. As is often the case with something that hasn't been polished yet, the students found the mechanics of doing things less than intuitive. I need to do a lot of work making data input easier (to date I've focussed on automated adding of data, and forms to edit existing data). Adding data is easy if you know how, but the user needs to know more than they really should have to.

The exercise was to take some frog taxa from the Frost et al. amphibian tree (doi:10.1206/0003-0090(2006)297[0001:TATOL]2.0.CO;2) and link them to GenBank sequences and museum specimens. The hope was that by making these links new information would emerge. You could think of it as an editable version of this. With a bit of post-exercise tidying, we got someway there. The wiki page for the Frost et al.paper now shows a list of sequences from that paper (not all, I hasten to add), and a map for those sequences that the students added to the wiki:

Although much remains to be done, I can't help thinking that this approach would work well for a database like TreeBASE, where one really needs to add a lot of annotation to make it useful (for example, mapping OTUs to taxon names, linking data to sequences and specimens). So, one of the things I'm going to look at is dumping a copy of TreeBASE (complete with trees) into the wiki and seeing what can be done with it. Oh, and I need to make it much, much easier for people to add data.

Alex gives the background to the argument about whether Pyramica is a synonym of Strumigenys, and investigates the issue using the surprisingly small about of data available in GenBank. The tree he found (shown below) suggests this issue will require some work to resolve:

The diagram shows the two occasions when the page has been striped of content (and subsequently restored) as contributors dispute whether Pyramica is a synonym of Strumigenys. It would be useful to have one or more metrics of how controversial a page (and/or a contributor) was, to both identify controversial pages, and to see how controversial taxonomic pages were compared to other Wikipedia topics. The paper On Ranking Controversies in Wikipedia: Models and Evaluation by Ba-Quy Vuong et al. (doi:10.1145/1341531.1341556) would be a good place to start (a video of the presentation of this paper is available here).

Friday, September 11, 2009

Here's the take home message: in terms of online gene annotation resources, Gene Cards is the most common top-ranked resource, followed closely by the Gene Wiki / Wikipedia, with NCBI in a very distant third (note the log scale).

This result is interesting in that an existing resource (Gene Cards) beats Wikipedia, but only just. There are various ways we could interpret this, but from the point of view of biodiversity resources I suspect it emphasises that if there is a good, existing resource that has a lot of traction (i.e., Gene Cards) it will do well in Google Searches. If there is no single dominant resource (as is the case for biodiversity), then it leaves the field open to be dominated by Wikipedia.

Imagine a scenario where three people will make contributions to a Wiki page at different points in time. Each person edits the page and then saves their changes to what becomes the latest version of that page.

History Flow connects text that has been kept the same between consecutive versions. Pieces of text that do not have correspondence in the next (or previous) version are not connected and the user sees a resulting "gap" in the visualization; this happens for deletions and insertions. (animated GIF from Jeff Atwood's post).

There's a nice paper describing history flow (doi:10.1145/985692.985765, free PDF here). Inspired by this I decided to try and implement history flow in PHP and SVG. Here's a preliminary result:

This is the edit history for the Afrotheria page. Click on the image above (or here to see the SVG image -- you need a decent web browser for this, IE uses will need a SVG plugin).

The SVG image is clickable. The columns represent revisions, click on those to go to that revision. The columns are evenly spaced (i.e., the gaps don't correspond to time). The bands between revisions trace individual blocks of text (in this case lines in the Wikipedia page source). If you click on a band you get taken to that Wikipedia user's page.

This is all done in a rush, but it gives an idea of what can be done. The history flow carries all sorts of information about how an article has developed over time, major changes (such as the introduction of Taxoboxes), and makes the content of a page traceable, in the sense that you can see who contributed what to a page.

Thursday, September 03, 2009

Given that one response to my post on Fungi in Wikipedia was to say that fungi are also charismatic, so maybe I should try [insert unsexy taxon name here]. So, I've now looked at all the species I extracted from Wikipedia (nearly 72,000), ran the Google searches, and here are the results:

Site

How many times is it the top hit?

en.wikipedia.org

42515

www.birdlife.org

2125

commons.wikimedia.org

1522

plants.usda.gov

1496

species.wikimedia.org

1487

animaldiversity.ummz.umich.edu

1419

amphibiaweb.org

851

www.calflora.org

770

www.fishbase.org

727

ibc.lynxeds.com

699

davesgarden.com

659

www.arkive.org

510

ukmoths.org.uk

414

zipcodezoo.com

368

www.itis.gov

304

calphotos.berkeley.edu

294

www.floridata.com

234

www.planetcatfish.com

234

www.eol.org

226

www.arthurgrosset.com

213

The table lists the top twenty sites, based on the number of times each site occupies the number one place in the Google search results. Surprise, surprise, Wikipedia wins hands down.

What is interesting is that the other top-ranking sites tend to be taxon-specific, such as FishBase, Amphibia Web, and USDA Plants. To me this suggests that the argument that Wikipedia's dominance of the search results is because it focusses on charismatic taxa doesn't hold. In fact, the truly charismatic taxa are likely to have their own, richly informative webs sites that will often beat Wikipedia in the search rankings. If your taxon is not charismatic, then it's a different story. This suggests one of two strategies for making taxon web sites that people will find. Either go for the niche market, and make a rich site for a set of taxa that you (and ideally some others) like, or add content to Wikipedia. Sites that span across all taxa will always come up against Wikipedia's dominance in the search rankings. So, it's a choice of being a specialist, or trying to compete with an über-generalist.

Wednesday, September 02, 2009

One response to the analysis I did of the Google rank of mammal pages in Wikipedia is to suggest that Wikipedia does well for mammals because these are charismatic. It's been suggested that for other groups of taxa Wikipedia might not be so prominent in the search results.

As a quick test I extracted the 1552 fungal species I could find in Wikipedia and repeated the analysis. If anything, the results are more dramatic:

If fungi are less "charismatic" than mammals, the implication is that the less charismatic the taxon, the better Wikipedia does (perhaps there is less competition from other sites). Of course, Wikipedia is severely underpopulated with fungal pages, so one could argue that for fungi not in Wikipedia, sites like EOL may do better (relative to other sites), but that would need to be tested. I suspect that sites that provide more broadly useful information (such as APSnet) will continue to dominate the search rankings, followed by scientific articles (for the fungi in Wikipedia the publishers Springer, Wiley, and Elsevier all appear in the top of sites that appear in the Google rankings).

Tuesday, September 01, 2009

Playing a bit more with the Wikipedia mammal data, there are some interesting patterns to note. The first is that rank the mammal pages by size (here defined as the number of characters in the source for the page) and plot size against rank then we get a graph that looks very much like a power law:

There are a few large pages on mammals (these are on the left), and lots of small pages (the long tail on the right). If we do a log-log plot we get this:

The straight line is characteristic of a power law. The dip at the far right reflects the fact that Wikipedia pages have a minimum size (for example, they must include a Taxobox). Now, this is a bit crude (I should probably look at "Power-law distributions in empirical data" arXiv:0706.1062v2 before getting too carried away), but power laws are characteristic of the link structure of the web (a few big sites with huge numbers of links, huge numbers of sites with few links), and indeed of at least parts of Wikipedia, such as the Gene Wiki project (see doi:10.1371/journal.pbio.0060175).

In this context, the diagrams are showing that even if mammals are "charismatic megafauna", most of them aren't that charismatic. Wikipedia mammal pages are mostly small. This raises the question of whether the high frequency in which Wikipedia mammal pages appeared in the top of Google searches might be attributed to those large pages on (presumably) charismatic mammals. If this were the case, then we'd expect that small pages wouldn't rank highly in Google searches. So, I plotted page size against Google search rank for the Wikipedia mammal pages:

This is a box plot, where the grey boxes represent 50% of the distribution of page size (the horizontal black line is the median), and extreme values are shown as circles. Note that "0" is the highest rank (i.e., the first hit in Google), and 9 is the lowest.

While, not surprisingly, most large Wikipedia pages do well in Google searches, and rarely are large pages low down the rankings, my sense is that small pages can have any rank, from top (0) to bottom (9). If page size (i.e., which is a crude measure of the effort put into editing a Wikipedia page) is a measure of "charisma" (contributors are more likely to edit pages on animals that lots of people know about), then charisma isn't a great predictor of where you come in Google's search results. It's not about size, it's about being in Wikipedia.

One assumption I've been making so far is that when people search for information on an organism using its scientific name, Wikipedia will dominate the search results (see my earlier post for an example of this assumption). I've decided to quantify this by doing a little experiment. I grabbed the Mammal Species of the World taxonomy and extracted the 5416 species names. I then used Google's AJAX search API to look up each name in Google. For each search I took the top 10 hits and recorded for each hit the site URL and the rank in the search results (i.e., 1-10). Below is a table of how many mammal species had a hit in the top 10 Google results (showing just the top 20 most frequent sites).

Things get more interesting if we look at the ranking of search results. The graph below plots the cumulative rank of search results for some of the web sites listed above.

Wikipedia dominates things. For 48% of all mammal species Wikipedia is the first result returned by Google. Just under three quarters of all mammal species are either the first or second top hit in Google. The next best sites are Animal Diversity Web and Wikispecies, which get a small share of first place for some species (19% and 7% respectively). Note that EOL pages manage to make it into the top 10 for only 11% of all mammal species.

What does this all mean? Well, it seems clear that if people are using Google to find information about an organism, then Wikipedia is more likely than anything else to be the first result they see. It is also interesting that for all the energy (and funds) being expended on biodiversity databases (doi:10.1126/science.324_1632), ITIS is the only classical biodiversity database that routinely gets found in these searches (albeit in only a quarter of the searches).

I know I tend to go on a bit about EOL, but if I was running (or funding) EOL, I'd be worried. EOL barely figures in these search results, and is being taken to the cleaners by a volunteer effort (Wikipedia). Furthermore, it seems difficult to envisage what EOL can do to improve things. Sure it can link to (and make use of) content in sites such as Animal Diversity Web, ITIS (and maybe even, gasp, Wikipedia), but that just adds "link love" to those sites. Ironically, perhaps the single thing that would improve EOL's ranking would be if Wikipedia spread some of its link love over EOL, by linking all it's taxon pages to the corresponding EOL page.

But there are bigger issues at stake. Site popularity on the web tends to follow a power law, where a very few web sites grab the vast majority of eye balls. In a old blog post Clay Shirky wrote:

Now, thanks to a series of breakthroughs in network theory by researchers ... we know that power law distributions tend to arise in social systems where many people express their preferences among many options. We also know that as the number of options rise, the curve becomes more extreme. This is a counter-intuitive finding - most of us would expect a rising number of choices to flatten the curve, but in fact, increasing the size of the system increases the gap between the #1 spot and the median spot.

So, creating new and improved biodiversity web sites is likely to have the effect of only increasing the gap between Wikipedia and the rest.

Lastly, as I've mentioned before regarding Wikipedia and citations of taxonomic work, the graph above suggests to me that for anybody wanting to make basic biodiversity information available on the web, and attract readers to basic taxonomic literature, there really is only one game in town.