Update on August 2008 Activities

September 12, 2008

September 12, 2008

News

Release of the HathiTrust web site – The HathiTrust released a web site at http://www.hathitrust.org/[1]. The primary purpose of the website is currently to share information with partners and prospective partners, and to bring together technical documentation and resources like the HathiTrust API. In the future, we will work to develop the website into a mechanism by which users can explore the content in the HathiTrust repository.

Establishing Indiana mirror site – In August, we performed the first of several stages of work to create a mirror site in Indiana. Indiana University staff members configured the space and networking; University of Michigan staff installed and configured servers. Configuration of the servers is happening remotely. The second instance of storage will be shipped to Indiana in a subsequent phase of work (see “Forecast for September Development” below).

Other deployment issues – Organizing large numbers of digital objects in a file system presents some unique challenges with regard to scalability. We have adopted the “Pairtrees for Object Storage” scheme elaborated by the California Digital Library, now a draft IETF specification (http://www.ietf.org/internet-drafts/draft-kunze-pairtree-00.txt[2]). The ingest process is organizing newly-received materials using the pairtree scheme. In August, the scripting to reorganize previously ingested materials was nearly complete and being tested.

Large-scale Search – In August, significant work was devoted to testing Solr for support of large-scale search. We developed a test suite and began the process of building indexes against which test searches can be performed. A full report on the status of this work can be found in the update on the Short- and Long-term Functional Objectives (see http://www.hathitrust.org/objectives[3]).

Wisconsin Ingest – Through August, we continued to encounter validation issues with Google’s transmission of Wisconsin’s JPEG2000 files; however, at the end of August, Google appeared to have resolved those issues. Testing continued to ascertain whether we could move forward with ingest of this content. (As of this Update, ingest was re-initiated.) By the end of August, we had loaded 133,035 Wisconsin bibliographic records in preparation for ingest from GRIN, and 118 Wisconsin volumes were in the repository.

Growth

132,807 volumes were added in August.

As of September 1st, the repository contained a total of 1,608,562 volumes.

21,379 public domain volumes were added in August, bringing the total number of public domain volumes to 290,696 (18% of the total content).

Forecast for September development

We will perform more intensive work on testing large-scale search scalability. By the end of September, we hope to have indexes of various sizes built and at least the preliminary benchmarking data gathered.

We plan to have data synchronized on the second instance of storage (subsequent to the pairtree reorganization) by the end of September. We hope to ship the second instance of storage to Indianapolis in October.

We are planning the first expansion of the existing storage and will negotiate configuration and pricing with the vendor in September. Adding capacity to the system is a non-disruptive process. With scanning happening at such a rapid pace, we will need to acquire nearly 25% more storage than planned and will have 180TB online at each site by the end of the year.

Outages

PLEASE NOTE: We still do not yet have contact email addresses for institutions for notification. As the service becomes more widely used, this will be an essential means of communication. Please contact Chris Butchart-Bailey (chrisbu at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.

We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:

For major work, Friday evenings (8pm-1am) and Sunday mornings (5am-10am);

For minor work, weekdays from 6:30am-8am.

Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.

Outages in August: On Tuesday, August 26 at approximately 9:00am EDT, a database server was brought down to move to Indianapolis. Prior to shutting this server down, we did not update a manual failover configuration, causing volumes to be inaccessible to some users. The problem was resolved at 11:15am EDT.