Digital library pipeline for a million books.

This entry was published on
16/03/2008

I was pleased to be invited by Brian Fuchs to a 'Million Books Workshop' at Imperial College, London last Friday. A fascinating day, in the company of what was, for me, an unusual group of 20-30 linguists, classical scholars and computer scientists. The morning session consisted of three presentations (following an introduction from Gregory Crane which I missed thanks to the increasingly awful transport system between London and the South West) which brought us up to speed with some advances in OCR, computer aided text analysis and translation, and classification. The presentations were intended to form an ordered progression:

Listening to these presentations, I quickly found myself well outside of my comfort zone, in terms of both the science and the domain (classical literature), so it was a challenging and exhilarating morning! It was difficult to take comprehensive notes as I had to really concentrate on the presentations at times in order to follow them - the pace was pretty smart, with jargon and 'in jokes' galore.

David Smith, Johns Hopkins University gave a fascinating and entertaining presentation which outlined some of the challenges, and advances, in language parsing and translation. He pointed out that although the structured view of the semantic web is a seductive one, even the newer online, digital genres such as email, blogs mostly use unstructured or semi-structured text. However, parsing free text is very difficult, especially with the growing scale and diversity of texts available on the web. To illustrate this he employed a series of (sometimes amusing) translations from the Google translation service. The best available technology today uses supervised machine learning techniques to build a treebank. An alternative approach employs semi-supervised, modelling techniques. Parallel texts in different languages are useful but, for some languages, only the Universal Declaration of Human Rights exists as a parallel text! As an aside, David pointed to the potential advantage in search engines searching several languages: if you enter your query in English for example, by searching resources in other languages, the search engine automatically expands the search, utilises synonyms etc. 'for free'. This can then be more effective than monolingual searching. David offered a future based in pragmatism: translation support rather than fluent translation.

David Mimno presented on classification, sequences & topic modelling. In an interesting talk, it was the visualisation (as a topical transfer graph) of topic relations extrapolated from citations in a set of scholarly communications which really got the audience engaged - a series of questions ensued before David could move on. He illustrated his work with accessible examples: for example, it turned out from one experiment that the single term most likely to identify email spam was, believe it or not, the word "red" showing in the markup, owing to the fact that "only spammers use red text"! Apparently, he had a system which could classify any of Shakespeare's plays as tragedy, comedy or history…. with the exception of Romeo and Juliet, which comes out as a comedy for some reason….

The takeaway for me was that some of the technology in these spaces is maturing. Thomas Breuel, for instance, made a compelling case for really effective OCR (Optical Character Recognition) in his description of the open-source OCRopus project, which he leads and which is sponsored by Google. Building on previous systems like the character recognition system Tesseract, OCRopus employs a modular design with components which offer the following workflow, focussing on the processing of scanned books:

The project is heading towards a beta release this year, and the team plan to create a deployment 'bundle' in the form of an Amazon EC2 AMI. I didn't quite catch the details but I think they have found a way to monetise this through the Amazon referral program, which sounds interesting. In any case, the idea is that one could take the AMI, deploy it, run it for a few hours to process a particular scan, and then shut it down again - potentially a very cost-effective way of proceeding. Thomas made the point that, as OCR technology continues to improve, we are likely to want to process scans of books several times. He explained how the project was aiming towards a "full digital library pipeline", a system which could be deployed from a connected laptop: with the new affordability of powerful digital cameras, a researcher might photograph a book's pages themselves before feeding the resulting image into the OCRopus workflow OCRopus can handle the distortion effect of non-flat pages very effectively). Another interesting aspect of this work is the distributed parallel training which underpins the statistical language modelling: a large model is achieved by combining many little models created by many people, through the web. If you are interested in this area, then you should also check out the hOCR format specification and tools.

I had been invited to this workshop because of my role and interest in the deployment of services at a community and network level. I joined a panel at the the very end of the day where we were invited to consider what services and infrastructure might be required, in the UK, to support the digitisation and useful processing of a 'million books'. We didn't get very far with this because we had run out of time and, I suspect, energy by this point, but the question remains…. I'll be picking this up with some colleagues in due course.

Fascinating day, and topped of with a quick pint standing outside a packed London pub in a light drizzle, which was actually a refreshing and pleasant way to conclude!