Technological Profile

HathiTrust is intended to provide persistent and high-availability storage for deposited files. In order to facilitate this, the partnership uses a storage architecture with a rich set of features designed for fault tolerance and long-term data retention.

Central to the storage architecture is the use of two synchronized instances of storage with wide geographic separation (located in data centers in Ann Arbor, MI and Indianapolis, IN) and an encrypted tape backup with 6 months of previous-version retention (located in a third data center several miles from the Ann Arbor storage instance). All data centers meet the requirements for Uptime Institute Tier II classification. All storage is physically secure, locked in racks that are accessible only to authorized IT personnel.

The need for continuous integrity checking is fundamental to HathiTrust's data management strategy and underlies the choice of online (spinning magnetic disk) media for primary storage. Internally, each storage instance uses N+3 Reed-Solomon parity redundancy, which is analogous to but more fault-tolerant than conventional RAID 5 storage due to the additional parity redundancy. The storage system internally performs in-flight data integrity checks as well as periodic integrity checks of all at-rest data, and makes use of parity redundancy to permanently repair any errors encountered. External to the storage system, HathiTrust also conducts periodic validation of data with stored checksums to ensure that data has been ingested correctly and remains intact.

Storage equipment replacement is an ongoing annual process and assumes that equipment has a useful lifetime of 3-4 years. The storage system is modular and virtualized, with files split into blocks that are distributed across nodes of a cluster and automatically redistributed as needed to balance storage utilization equally. Storage replacement therefore requires no manual movement of data, as this balancing is a normal housekeeping function of the system. Storage nodes that have reached retirement age may be removed from the cluster with an administrative command, and new nodes may be added, with all movement of data managed internally while employing the in-flight integrity checks described earlier. The remove and add processes neither disrupt services nor diminish the N+3 redundancy.

The following links provide more detailed information about our storage, backup, and disaster planning:

HathiTrust Digital Library Profile

A profile of the repository based on the Evaluation of Open-Source Electronic Publishing Systems (Cyzyk and Choudury, 2008[5]) and the framework developed at Johns Hopkins University as part of the Mellon-funded grant, A Technology Analysis of Repositories and Services[6] (2006) is given below. Links to information about specific components of HathiTrust's technological infrastructure are included.

1) Institutional affiliation and other indicators of the viability of the project

Name of system

HathiTrust Digital Library

Current version of system

HathiTrust is comprised of multiple applications, each with components (e.g. Perl, MySQL, Linux) that are versioned, but as of January 2010 does not have overall versions of these applications or of HathiTrust as a whole.

HathiTrust is a collaboration of major research institutions (see http://www.hathitrust.org/community[8] for a list of partners). It is supported with base funding from these institutions, not grant or other temporary funding. It has been funded for an initial 5-year period (January 2008-December 2012) with a formal review of governance and operations to be conducted by the partnership in 2011. For institutions that have deposited content, HathiTrust is the long-term preservation strategy for that content.

Degree of deployment

Repository is located at the University of Michigan with a full mirror site including load balancing and fail over at Indiana University’s Indianapolis campus.

Type of open-source license

HathiTrust itself does not have an open source license, but it is built using open source technologies (e.g. Perl, MySQL, Solr, Linux). HathiTrust is configuring a development space where all partners will have access to the source code and be able to make modifications and improvements.

Licensing notes

---

Other documentation (Webliography)

Information about HathiTrust, including its mission and goals, governance, and objectives, as well as partnership information, papers and presentations, and documentation of rights management and preservation policies and procedures, APIs, accountability considerations and technical infrastructure are available at http://www.hathitrust.org/about[9].

System and software development in HathiTrust is driven by the need to solve particular problems (as opposed to implementing specific software). This has resulted in a modular architecture where discrete systems fulfilling different OAIS functions (e.g. object ingest, storage, metadata management, indexing and dissemination) communicate and interoperate as an integrated whole. Disaggregation of the functional components of the repository allows agile response to problems that arise (e.g., issues with ingest, storage, or access systems are localized and may be addressed separately) and sharing of development responsibilities across partner institutions. Although many repository systems and services sit on central servers, the modular architecture and orientation toward open standards and open systems make it possible for partner institutions to develop services and key pieces of repository functionality.

The University of Michigan Library would like to acknowledge the generous provision of a source code license by Kakadu Software[10] which is instrumental in the creation, maintenance, and delivery of JPEG2000 images in HathiTrust.

Required skills

Significant knowledge of UNIX, Perl, Apache, MySQL

Internal backup and restore functions

Backup and restore functionality is provided at a system level and consists of a) file system backup and b) database backup. Backup services are currently provided by Tivoli Storage Manager.

Scalability: Application

Applications are lightweight and served by multiple web servers; additional web servers can be added to increase application performance.

Scalability: Data

HathiTrust uses Isilon storage, which is a clustered storage system that scales to over 20TB in a single instance by adding new nodes to the storage cluster.

API: Code extensibility

HathiTrust uses a CVS repository. Modifications can be made by developers at partner institutions with requisite privileges.

API: Batch ingest

Ingest is handled by the GROOVE application (Google Object-Oriented Validation Environment). GROOVE is capable of ingesting upwards of 500,000 volumes in a single month and additional ingest servers can be added to increase throughput. Although originally created to handle ingest from Google, GROOVE is the ingest mechanism for content from other sources as well. Please see the HathiTrust Ingest to Access Workflow[11] and Notes[12], and ingest specifications and checklist[13] for contributing partners for more information.

The HathiTrust Data API is used to retrieve object packages, including image files and metadata, for individual volumes or batches of volumes from the repository. Specifications for the API are available at http://www.hathitrust.org/data_api[15].

HathiTrust makes limited metadata files for all volumes in HathiTrust available via the web for download (http://www.hathitrust.org/hathifiles[17]). This metadata can be used to retrieve full bibliographic records from OCLC or the University of Michigan via Z39.50.

Security (2006 framework)

Access control

Access to items is determined by copyright status and is handled through the HathiTrust PageTurner. A description of the PageTurner application is available at http://www.hathitrust.org/pageturner[18].

User management

HathiTrust manages lists of staff members with privileged access to repository content and lists of users ids for personalized services (e.g., Collection Builder). Profiles of users and other information are currently managed by the University of Michigan, but will be managed locally at partner institutions when Shibboleth is implemented (see Authentication mechanisms below).

Policy management

HathiTrust adheres to the information technology security policies of the University of Michigan Library, where it is hosted. The University Library participates in distributed organizational model where units across the University (of which it is one) have prime responsibility for planning and managing security within their units, coordinated by campus Information Technology Security Services (ITSS).

Content is accessed via the HathiTrust PageTurner application. Public domain volumes and works that rights holders have opened access to, are available to anyone with a web browser. In-copyright works and those with undetermined copyright status are searchable only (the search application returns location information where query terms occur in a given volume). The HathiTrust Data API is another mode of accessing content in HathiTrust (http://www.hathitrust.org/data_api[15]).

Remove data

The files that compose digital objects are contained in a directory in a file system. When objects are deleted (this has happened only once on record at the wishes of the rights holder), content files are deleted and a tombstone record is made available in the user interface to indicate that the content once existed.

Manage metadata

Bibliographic metadata is managed in an Integrated Library System (Aleph). Rights information is managed in a rights database (http://www.hathitrust.org/rights_management[19]). Preservation, technical, and structural metadata are contained in a METS file for each object. Preservation metadata (PREMIS) is updated when actions occur on an object.

Aggregation

(2006 framework)

Create aggregation

HathiTrust aggregates objects based on namespaces, which identify the different sources of materials (e.g., objects from Indiana University, the University of Michigan, Wisconsin, California, etc.), for management purposes.

The Collection Builder application allows users to create their own aggregations of objects, regardless of how they are structured in the repository

Remove aggregation

Objects can be removed in aggregate (as part of, or separate from an “aggregation”) from the repository.

Personal collections of volumes created in Collection Builder can be deleted. It is also possible to delete individual volumes or groups of volumes within those collections.

Change aggregation membership

Only if an object identifier is changed will it be associated with a new aggregation in the repository.

Objects can be copied or moved from one Collection Builder collection to another.

Find aggregation members

All object tracking makes use of the object namespace, making it possible to list the identifiers within a namespace or find the namespace affiliated with an object.

It is possible to facet search results in the HathiTrust catalog to limit results to objects from a particular institution, which corresponds in most cases with namespaces.

It is possible to find and search Collection Builder collections through the web interface.

Other (2006 framework)

Locking

The Integrated Library System currently employed (Aleph) locks records that are being edited.

Virtual object representation

TIFF and JPEG2000 images in the repository are dynamically converted to PNG format for viewing in the PageTurner.

Transactions

HathiTrust is configured to allow large-scale transactions on the content. Some of these that have taken place are the modification of METS and PREMIS in object packages across the repository. Objects are routinely zipped and unzipped for ingest purposes and display in the HathiTrust interface.

3) Submission, peer review management, and administrative functions

Support for multiple, discrete publications

As of January 1, HathiTrust contains more than 5 million individual volumes.

Multiple administrative roles

The ability to change repository code and content is managed with Unix permissions held by a very limited group of developers and administrators.

Administrative roles configurable

These are configurable according to Unix permissions

Submission into system initiated by authors

N/A

Editorial workflow configurable per publication

N/A

Automated email alerts to authors

N/A

Automated email alerts to editors

N/A

Automated email alerts to reviewers

N/A

Style sheets, customizable look and feel per publication

The interfaces to PageTurner and Collection Builder are created using XSL style sheets. Although HathiTrust currently maintains a consistent interface across all content, certain collections can be branded according to user preferences. The University of Michigan Press collection is an example: http://babel.hathitrust.org/cgi/mb?a=listis;c=622231186[20].

Versioning

Versions of content are not kept in the repository. When content is modified, the old object is deleted and a new object added with the same identifier. This action is recorded in the PREMIS metadata.

4) Access, formats, and electronic commerce functions

Accessibility of system

The HathiTrust system and interface are designed to provide access to all digitized materials (regardless of copyright) for users with print disabilities, including users with low- to no-vision and learning disabilities. In addition to accessible interfaces for applications that make up HathiTrust (the temporary catalog, Collection Builder, and PageTurner), text-only interfaces exist for the PageTurner and Collection Builder that are optimized for the specific needs of users with print disabilities (including navigation keys, sections markers, descriptive metadata where images or blank pages occur, and appropriate use of headings and labels). HathiTrust is additionally configured to grant full-text access to authorized users (to enhance usability with screen readers, digital Braille devices, etc.), regardless of a work's copyright status.

HathiTrust delivers content in the user interface as page-images, OCR-text, or in PDF format.

Document formats supported

TIFF ITU G4, JPEG2000, UTF-8 text

Plug-in requirements

---

Usability notes

---

Citation linking

Each volume has a permanent URL, formed using the Handle service (http://handle.net/[22]).

OpenURL resolver

HathiTrust is currently being configured as a target for SFX.

RSS feed

An RSS feed of catalog search results will be available in February 2010.

Digital rights management

HathiTrust performs an automated rights evaluation of incoming objects based on bibliographic data. These rights may be manually overwritten after copyright review has taken place, if bibliographic information is updated, or if rights holders open access to volumes. A permission agreement for opening access to volumes is available at http://www.hathitrust.org/rights_management[19]. All rights information is stored in a rights database (http://www.hathitrust.org/rights_database[23]).

HathiTrust is working with OCLC to create a production-level catalog for HathiTrust. In this catalog it will be possible to search HathiTrust collections and broaden the search to include collections held in WorldCat more generally.

Authentication mechanisms

Authentication is used for two purposes in HathiTrust: personalization services (e.g., the Collection Builder application) and uses or services requiring authorization (staff uses such as access to works for copyright review and services for authorized users with print disabilities).

Authentication is currently handled for affiliates of partner institutions via Shibboleth[26]. Non-partners may authenticate via the University of Michigan’s CoSign implementation to create permanent collections in the Collection Builder[27] application.

Subscription services

There are currently no subscription services in HathiTrust.

Electronic commerce functions

Volumes contributed by the University of Michigan Press are available for print on demand from the UM Press website. Public domain volumes digitized by the University of Michigan are available for print on demand via Amazon.com and Expresso Book Machine.

Context-sensitive Help support

Feedback links and a contact email address are provided on the website. Users are routed to the help services as appropriate (e.g. copyright information, technical services corrections, help with metadata download, etc.).