How the HathiTrust Digital Library Handles 11 Million Digitized Volumes

Jennifer Zaino is a New York-based freelance writer specializing in business and technology journalism. Her work appears in publications including The Semantic Web Blog, RFID Journal, Smart Enterprise Exchange, and more.

Digitization is a widely used means of preservation reformatting for print and analog materials, especially with the large-scale capabilities that efforts such as the Google Books Project and the Internet Archive are bringing to many research and academic libraries.

But large-scale digitization means that libraries increasingly require large-scale, preservation-grade infrastructure that’s also suitable for providing access to materials at scale. The HathiTrust Digital Library is answering that call. Launched by the 12-university Committee on Institutional Cooperation (CIC) and the 11-university libraries of the University of California system, HathiTrust is collectively undertaking preservation with access. Today it has more than 80 partners, more than two dozen of whom are depositing content in its repository.

“Member institutions saw the benefit of working collaboratively to preserve and provide access to their collections in digital form, and the impact this could have on ways that they store and manage their print collections as well,” says Jeremy York, assistant director for HathiTrust.

Collaboration among partners helps to ensure lower costs around leveraging the infrastructure and preserving materials while providing better access to them.

HathiTrust is fast heading to 11 million digitized volumes, up from 2 million at its formal debut in October 2008. It comes as no surprise that it takes a lot of storage to accommodate all that information. The repository started out with two instances of 80 terabytes of storage at each of two data centers within the CIC system — Indiana University and the University of Michigan. Today each site is at 750 terabytes, and it will add 45 terabytes more to each before year’s end, York says. At the same time, it will replace 250 terabytes of existing storage at each location.

What It Takes

The IT staff at the University of Michigan manages both sites, with some on-site assistance from Indiana University IT staff. With expectations of rapid growth from the start, and with just four storage administrators to help him, it was important that Cory Snavely, Library IT Core Services manager at the University of Michigan, invest in storage that is as hands-off as possible.

“I knew we’d grow quickly, so I wanted to buy a system that would scale easily — one that would continue to perform well but not require more administrative staff overhead as we expanded it,” says Snavely, who’d had experience with digitizing library collections for the university before HathiTrust was formed. Even at just that effort’s single-digit terabyte scale, he and his team had to devote a lot of time to moving data around manually and conducting complicated upgrades.

“I knew from my experience at a smaller scale that data migration at the scale of HathiTrust would be simply unworkable,” he says.

Snavely and his team deployed an EMC Isilon scale-out NAS storage system, which offered the advantage of being a true cluster architecture with complete modularity. Adding more capacity to the system is simply a matter of attaching more nodes, an ease of scalability that is critical, given that the project has been adding storage every year. HathiTrust has completed six capacity expansions to date, and three hardware replacement cycles in conjunction with the last three expansions, with no manual data migration.

“There isn’t any manual migration of data from one piece of hardware to another,” Snavely says. “It’s all handled transparently by the system.”

What is digitized (so far) at HathiTrust?

10,832,619 total volumes

5,673,207 book titles

283,184 serial titles

3,791,416,650 pages

128 miles (books standing side by side, as on a shelf)

3,476,775 volumes in the public domain

Source: HathiTrust Digital Library

At this project’s scale, he says, Isilon’s capabilities are essential in order to avoid growing staff overhead. The single file system, which unifies and enables access to expanding file-based data stores, has proved to be the right fit for the project’s goal of creating one massive library.

“We wanted one big pool of storage that we could easily expand and manage, and that’s what the system provides,” Snavely says.

Isilon also has evolved to better support the HathiTrust repository’s data integrity requirements. Snavely’s team performs fixity checking to inspect all archive files about every three months to ensure that the files are the same as when they were deposited, which ensures the trustworthiness of the repository. Those efforts are now complemented by similar capabilities EMC introduced within the Isilon system.

“As it is being used, it will constantly do those kinds of integrity checks,” Snavely says, and if it finds something wrong, it fixes it on the fly, “That perfectly aligns with the needs of a digital preservation system.”

The HathiTrust Digital Library, York notes, helps its partner institutions with their critical preservation and access work but also serves the public good, providing access to close to 3.5 million volumes in the public domain.

“There is a growing recognition that we simply cannot afford not to collaborate in areas where we can,” York says. “HathiTrust is a demonstration of the value of library services in aggregate — of what we can achieve when working together.”