Digitizing the Harvard library

It’s now been over 12 years since I created one of the first books using optical scanning: 80 Readings. You can actually still get the book on Amazon.com. At the time, the procedure I used to produce the book was quite radical: I scanned the readings into a Macintosh computer, used Optical Character Recognition software to translate the digital bits into formattable text, then applied consistent formatting (using good ol’ Microsoft Word) to make it look like a “real” book. Since I only had a 300 DPI printer, I printed out the pages at 150 percent of their final book size, which the book printer then shrunk down to give an effective resolution of 450 DPI. The process was painstaking; the OCR software was nowhere near perfect, so the book had to be carefully proofread to suss out any errors.

A few years later, as I was explaining the process to a friend of mine (Dan Cerutti), he wondered if this same process could be used to put every book on the Internet. This was back before Google was even a glimmer in Sergey Brin’s eye, when AltaVista was the search engine of choice. At that point, comparatively few resources were available online, and all “serious” research was still being done in libraries. What if, say, Harvard University started scanning every book in its collection, the same way I had done with 80 Readings. The problem would not be the time it took to physically scan the pages — with a high-speed scanner and some sort of automatic page-turning device, this might only take a few minutes per book. The problem would be the proofreading, which would take a minimum of ten minutes per page. Multiply by, say 300 pages per book, and then by Harvard’s 15 million books, and you’re talking some serious hours logged.

But Dan pointed out that the main reason you wanted the books proofread at all was because you needed them to be searchable — after all, most of the volumes spend most of their time just sitting on the shelves. The problem is, no one has the time to look through all of them to see what’s inside. Even if there was a considerable number of errors, users would still be able to search the volumes and figure out which ones were relevant to their research. Then they could go find the real books on the shelves to confirm the results of their search. Or even better — Harvard could provide the original, raw scans for readers to compare, and even correct the results of the automated scanning process! Then researchers wouldn’t even need to venture into the stacks to find a copy of the original text.

Fast forward ten years. We now have Distributed Proofreaders, an organization dedicated to doing this same thing, in order to place books in the Project Gutenberg archive. But this project is still not as ambitious as the one Dan and I imagined, because it covers only texts in the public domain. For now, this is mainly texts created before 1923. With ever more restrictive copyright legislation, this means that much of the most valuable research material will be locked up until its value to researchers becomes negligible.

Now, according to a The New York Times article, Google will be working with a group of universities including Harvard to digitize millions of books — and not just the public domain materials. The Times article speculates that for books still under copyright, Google will publish only extracts and not the complete text, thus getting around copyright law. This is the way Google Print works now. For instance, a search for Romeo and Juliet only allows you to view the first three pages of the text. The words in your search are highlighted in the text, but printing is disabled. Google wants to make money with this service, and they do so by encouraging you to buy the book, rather than just use the online service.

But of course, Romeo and Juliet is in the public domain. Why should we have to pay to get the whole book? The reason is that Cambridge University Press still owns the copyright for the extensive notes to its edition of the play, which is why, even for classic works of literature, Project Gutenberg is always careful to use editions that are safely out of copyright.

The Times article implies that the complete text of works in the public domain will be made available, but, of course, there is no assurance that it will. Google seems to be quite close-mouthed about its plans.

Eventually these works will make their way online, and students’ dream of writing a term paper without ever leaving their dorm rooms will be realized. What then? What will change about the way we do research? First of all, I suspect the three-page limit will have a substantial and recognizable effect on the way materials get found and used in research. Long, rambling texts may be cited less frequently than concise ones, because lazy researchers won’t be willing to put forth the effort required to obtain complete texts.

Second, I think publishers of reference works will litigate or negotiate with Google to be omitted from the index. After all, who would subscribe to the Oxford English Dictionary when you can easily get the OED definition of any word just by using Google? The three-page limit would mean the Google version is just as valuable as the complete 30 (or whatever) volume set. Or reference publishers might opt-in to Google, realizing that Google would otherwise make them obsolete. But with declining revenues, reference publishers may be forced to make their works less thorough.

Finally, I wonder where academic journals would fit in to this model. Would Google scan those as well? And again, what would happen to the library subscriptions that form the basis of their revenue streams? A typical journal article might run 30 pages or so, depending on the discipline. Would there be a convenient way to order a reprint of the entire article? Perhaps this will be the new way for journals to cover their expenses.

Ultimately, I hope that most academic journals and other resources move to an open-source model such as that espoused by the Public Library of Science. In the meantime, we’ll have to settle for the drips and drabs Google gives us.