Universities launch elephantine 78 terabyte digital library

Major US research universities have pooled their scanned books and other …

A group of the nation's large research libraries has launched a new online digital repository for scanned books and other documents, essentially making the digital holdings of each institution available to all of the others. Called the HathiTrust, the new group already holds 78 terabytes of information—that's over 731 million pages of content.

It's (not) Google Book Search

If it sounds more than a bit like Google Book Search, it should. Many of the core libraries included in the new project (such as the University of Michigan) have long been partners with Google and have used the search giant's money and expertise to scan millions of volumes in various scanning centers around the country. While the Big G gained access to all that digital goodness for use on its own projects, the libraries also got to keep their own copies. Rather than just providing access to the University of Michigan archive to users who visit Michigan's libraries, or providing only access to Michigan's collections from Michigan's online catalog, the HathiTrust wants to make all of the archives available to all schools.

John Wilkin, a University of Michigan librarian who now heads HathiTrust, described the problem this way: "Before this collaboration, the collections in each library existed in isolation. Now we are bringing them together, pooling resources and eliminating redundancies, and producing a valuable research tool that will be greater than the sum of its parts."

What makes the project different from Google Book Search is that HathiTrust isn't limited by Google's world-dominating, profit-making objectives desire to provide access to all the world's information on its own terms. Instead, HathiTrust hopes to offer information scanned by the Yahoo-backed Open Content Alliance, along with nonbook digitization projects from special collections and archives around the country. And then there's the emerging class of "born-digital" works that need archiving.

Copyright, as usual, may prove to be the sticky wicket here. Google has famously gotten itself into massive legal battles with the publishing industry over Google Book Search, though the issue may be coming to a conclusion. The search giant scans (and displays brief snippets from) copyrighted books, as well as public domain titles; HathiTrust will also archive both kinds of works.

Only public domain works will be shown to users in full-text form—which means that people unable to get to the library stacks to look at a paper copy of a book can view only 17 percent of the titles in HathiTrust's archive. Page scans of all public domain works are already available and accessible through sites like the University of Michigan's online catalog (and no login or registration is required to access the material). PDF display and full-text search are both available from HathiTrust, which provides the display infrastructure as libraries link out to its electronic archives.

Accessing a HathiTrust scanned book

Knowing what works are in the public domain is complicated, and it varies by country, so HathiTrust uses IP address filtering and other tools to apply rules based on apparent country of origin. For US users, everything published in the US before 1923 is fair game, though after that stipulation, the rules get far more complicated.

In control

HathiTrust, named after the Hindi word for "elephant," is pronounced either "HAH-tee" or "hah-TEE," depending on whether one reads the press release or the web site. (Update: the online version of the press release now confirms it is "hah-TEE.") Why elephants? Because they don't forget, because they're strong, and because they evoke "wisdom." Also, they are lumbering and slow, which is probably not what HathiTrust has in mind (though it is a common criticism of these sorts of academic IT projects when compared with industry initiatives).

The move puts libraries in charge of their own digital destiny. Schools could simply sit on their own archives, providing researchers with free access to Google Book Search, and call it a day. But this wouldn't allow them to innovate in ways especially beneficial to their academic users; it wouldn't allow them to add non-Googlized content, and it wouldn't give them any backup plan should Google decide to drop a bazillion ugly ads into Book Search or shut down the service altogether. Right now, sites like the University of Michigan catalog point to both resources.

Michigan's current catalog offers options

Given the prestige (and associated financial resources) of the schools involved in the project, HathiTrust certainly sounds like it has the potential to become, in the future, the main national digital archive for universities. The University of California system is onboard, as is the entire Big Ten (Wisconsin, Michigan, Illinois, etc.), the University of Chicago, and the University of Virginia.