How to Digitize a Million Books

Fifteen months after Google announced a book-scanning project of biblical proportions -– an effort to digitize the entire book collections of the New York Public Library and Harvard University libraries, among others – the company is still secretive about how they are solving key technical problems and won’t say how much they’ve accomplished so far.

However, a similar if smaller project – the Million Book Project at Carnegie Mellon University in Pittsburgh – has been underway for about seven years. It could provide some clues. The project’s director, computer scientist Raj Reddy, says he and his colleagues have no more knowledge about Google’s methods or progress than anyone else, but they are tackling many of the same challenges.

The goal of Google Book Search is to make all offline books – currently invisible to Google’s eye – searchable. This means physically scanning hundreds of millions of pages bound between the covers of an estimated 18 million books, recognizing around 430 languages and all sorts of fonts, making the results available for text searches, and replicating the traditional library browsing experience when it’s all done. Daniel Clancy, engineering director for Google Book Search, says he cannot comment on what the company has accomplished so far.

In the CMU project, though, the scanning technology is off the shelf. They’re using readily available Minolta PS 7000 book scanners set up at 40 scanning stations in India and China, where the local governments are helping to keep the costs low for the nonprofit project. In this setup, workers manually turn each page. Seven years into the project, around 600,000 books (mostly public-domain works shipped from around the world) have been scanned, and every day another 100,000 pages join the digital corpus. At this rate, it could take just under five years to complete the CMS project.

In contrast, Clancy says Google has developed its own scanning technology. But the company is mum about the technical details of the hardware, optical-character recognition (OCR) software, and scanning rate at their five scanning centers near cooperating library partners at Harvard, Stanford, University of Michigan, University of Oxford, and New York Public Library.

Reddy says commercially available software for recognizing English works well for the Million Book Project. The challenges they face with OCR are being addressed by their Chinese partners, who are developing specific software to better recognize unconventional fonts and calligraphic scripts often found in older books. Additionally, their partners in Egypt are developing OCR for Arabic. Right now, Reddy says, OCR is an active research area in which many countries are contributing expertise.

Once books are scanned and their texts accessible, the major challenge is making the text useful for searching. The inconsistency in physical quality of books can cause problems, Clancy says, in particular, with page numbering. For instance, full pages can be missing or dog-eared corners could reveal an incorrect page number. And if pagination is wrong in one part of the book, the error propagates throughout the work.

This problem is being overcome, CMU’s Reddy says, by designing software that does not rely on page numbers. Instead, it creates “structural metadata,” which are basically tags that summarize the meaning of information within a book, so that researchers can link words in the table of contents with corresponding chapters. Additionally, indexed terms can be linked to the correct passages. Unfortunately, says Reddy, establishing the links is still a manual process; no one has developed software that can establish these hyperlinks with more than about 90 percent accuracy. If the technique can be honed, though, it could make text searches more meaningful.