Thursday, October 4, 2007

The Problems of Mass Digitization

Like many others whose research focuses on the 19th century, I have been thrilled with the enhanced access—to discover, search, and read texts—that is facilitated through Google’s massive digitization project, Google Book Search.One quickly discovers, however, that there are serious quality issues with Google’s scans and metadata.

Included at the beginning of downloaded copies of Google’s books, is the following statement:

This is a digital copy of a book that was preserved for generations on library shelves before it was carefully scanned by Google as part of a project to make the world’s books discoverable online.

It has survived long enough for the copyright to expire and the book to enter the public domain. A public domain book is one that was never subject to copyright or whose legal copyright term has expired. Whether a book is in the public domain may vary country to country. Public domain books are our gateways to the past, representing a wealth of history, culture and knowledge that’s often difficult to discover.

Marks, notations and other marginalia present in the original volume will appear in this file - a reminder of this book’s long journey from the publisher to a library and finally to you.

A few comments:First, one would hasten to add, hopefully, that original books will still be preserved for generations on library shelves even though they have been “carefully” scanned.Secondly, given Google’s praise of the public domain, it is unfortunate that they are defining the public domain rather narrowly.Finally, it is worth adding that distorted images, images of thumbs, and other oddities resulting from digitization are reminders of these books’ quick journeys from libraries to Google.

The image above is from Google’s copy of the University of Michigan’s copy of Archibald Alexander’s The Duty of Catechetical Instruction (Philadelphia: Presbyterian Tract and Sunday School Society, 1836).The Google copy is incomplete, ending with page 14 (in the gutter, you can see part of the page that follows).Page 11 appears twice, at the beginning (before anything else) and in its proper place.Google’s metadata claims that this tract is 12 pages long.