Digital Libraries: Mass Digitization

In my very first LJ Column, I wrote, “Only a very small fraction of the millions of print items currently held by the world’s libraries will ever be in digital form” (LJ 11/15/97). In 1997, the company that would prove me wrong was starting up in a Stanford lab. Now, near-weekly announcements of massive digitization projects by Google and the Open Content Alliance (OCA) promise to change forever the way people find and use books.

According to Google, we are only a few years away from having the entire collections of large research libraries completely digitized and searchable. Owing to copyright restrictions, only “snippets” of many books will be displayed, but they will be newly discoverable, and many full-text books will be browsable online.

These efforts are likely to be transformational in ways we have yet to realize. Mass digitization, like many heralded new technologies, will take longer than predicted to make a serious mark and will create unforeseen impacts and enable unpredicted kinds of interactions with books. Whatever the outcome, libraries will be affected. We just don’t know exactly how yet.

Discovery

Chip Nilges, vice president of new services for OCLC (and an LJ Mover & Shaker 2005), is working on getting information about digitized books into Open WorldCat to make them more discoverable. If the book is available at the Google Book Search web site in snippet view only, owing to copyright barriers, Open WorldCat stands ready to show users which nearby libraries hold the book.

Down the line, OCLC also plans to provide related services. “I absolutely want to do this in such a way that libraries can download records [of digitized books] to their local catalog. It’s definitely the next step,” Nilges said. OCLC also intends to work with the OCA to make sure its records are also in WorldCat. Libraries can also use OCLC’s collection analysis service to discover which items in their collections have been digitized by these ­projects.

Access

Some books digitized by Google can be browsed in full-page image and downloaded in PDF format from the Google Book Search site, along with publisher-provided texts in either “snippet” or “limited preview.” Since Google Book Search does not provide access to the thousands of public domain books available via Project Gutenberg and similar efforts, users are better off searching regular Google for them.

The library catalog promises to provide access to digitized books as well as print books. The University of Michigan (UM) is the first library participating in the Google Library project to do so. MBooks allows users to search for books that have been scanned by Google and other content digitized by the UM libraries. Once users get to a specific digitized title, they can search within that item.

Where Google makes the books it digitizes available for downloading only in Adobe Acrobat format, the OCA makes all the files available for downloading—from the raw images to the metadata. Books digitized by the OCA are freely available for browsing and downloading on the Internet Archive site.

Issues

The libraries involved with mass digitization must deal with a logistical nightmare; thousands of books each day must be paged, packed, and shipped off-site for scanning. The same number must be received and reshelved. The scans themselves are sometimes poorly done or are missing pages. Projects scanning massive numbers of books sometimes sacrifice quality.

In the end, the sheer number of books involved is amazing. Perry Willett at the University of Michigan estimates they now have around 100,000 books from the Google project digitized and online. “I think we are seeing the first glimpses of what a real digital research collection will look like,” he said. “It’s definitely going to change the way we think about everything we do in libraries. I hope this starts a lot of discussion about ‘what this all means,’ because the potential is just limitless.”