Update on April 2008 Activities

April 11, 2008

Shared Digital Repository
Update on April 2008 Activities
9 May 2008

This is the second regular update on activities in the Shared Digital Repository (SDR). These updates will be made available monthly, typically on the 2nd Friday of the month, and will provide a variety of information about the general health of the repository and updates on the development of the SDR. Each update will be sent via e-mail to an official representative (typically the library director) of a participating institution, and will be posted on the SDR website. We plan to make an RSS feed for the updates available soon, in order to share the information as broadly as possible. Throughout this update, we refer to the draft Short-Term and Long-Term Functional Objectives (being articulated by the CIC’s SDR committee) as a work item relates to those Objectives.

SDR Governance

A final form of an agreement was concluded with the CIC. The agreement was signed by Michigan on April 8th; we are currently awaiting a signature by the CIC.

The executive management committee of the SDR meets monthly and continues to work on a variety of issues ranging from SDR finances to development priorities. Although we are awaiting execution of the agreement with the CIC, we are working to establish a date for a preliminary meeting of the SDR’s Operational Advisory Board in anticipation of completion of the agreement. In that meeting we expect to address a variety of issues, including coordination between the SDR and the CIC.

We continue to have productive conversations with other several other institutions about possible participation in the SDR and hope to provide information on our progress in this regard in future Updates.

Growth of the SDR

As of April 30th, the SDR contains:

1,122,007 volumes

791,460 titles

Approximately 393 million pages

213,379 individual volumes in the public domain (19% of total)

Archival certification

In a future update, we will provide a link to our draft response to the required elements in the Trustworthy Repositories Audit & Certification (TRAC): Criteria and Checklist. As mentioned in the last update, we coordinated a site visit by a team from the Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) effort in the European Union. Their report, which gives an extremely favorable review of the SDR, should be released publicly soon. (CIC SDR Short-Term Functional Objectives)

Infrastructure Development

Basic hardware deployment: Work continues between Indiana University and the University of Michigan to prepare the facilities in Bloomington for the second instance of SDR storage.

Deployment issues: As mentioned in our previous update, we worked through a number of important issues to reduce the management and storage impact of maintaining nearly 1 billion small files in the archive. This work included bundling all of the page images and text files for each volume into a single archive file. The resulting strategy increases the scalability of the archive in a number of ways. At the same time, we began introducing PREMIS preservation metadata into the METS files that document the individual volumes. That work continues successfully, and as of May 1st more than 50% of the volumes have been transformed. We will report on progress in that area in future months.

Service Development

Ingesting Wisconsin content: UM has begun to load records and content from Wisconsin. The first test records and digital files are online and in our development system. This process has involved: modifications to load routines to take into account non-UM identifiers; modifications to record routines to store information about the SDR and non-UM content; development of routines to identify where a copy is already present from Michigan so that the additional version is stored as an added-copy; modifications of the pageturner to deal with multiple identifier namespaces; and modifications of GRIN routines to deal with download for different institutions. Although the results are preliminary, the changes have all been successful and in May we should begin routine transfers. At the same time, loading Wisconsin’s content foregrounds issues such as labeling of page images and identification of content. We hope to release all of this in production in May and to work with Wisconsin staff to get initial feedback on the implementation.

Large-scale search: We have been working to deploy SOLR infrastructure and expect to have a report on that progress next month. In preparation for that, we have been developing strategies for handling more robust configurations (including selection of a servlet engine), how to scale performance, how to provide failover, and how to virtualize services to provide multiple indexes for development and testing. We hope soon to begin disseminating some numbers and approaches to benchmarking performance on large bodies of text. (CIC SDR Long-Term Functional Objectives)

Institution-specific pageturner: We continue to work in development with an institution-specific pageturner and will soon finalize work with Wisconsin on appropriate graphics and color choices. (CIC SDR Short-Term Functional Objectives)

Services for visually-disabled users: We previously released a working, UM-specific system in production and are developing a new and separate interface for visually impaired users, optimized for use with JAWS and other screen readers. This new interface will be included in the May release. An intern from Michigan’s School of Information will join us for the summer and aid us in conducting user testing on this system. (CIC SDR Short-Term Functional Objectives)

Fedora programmer: We have posted a position for a full-time Fedora programmer who will work not only to add basic Fedora support to the SDR, but will help us leverage Fedora to develop an “open service definition to make it possible for CIC libraries to develop secure access mechanisms and discovery tools” (CIC SDR Long-Term Functional Objectives).

Collection Builder: We have completed substantial work on the creation of a Collection Builder, which should allow users and staff to “publish virtual collections” (CIC SDR Short-Term Functional Objectives). Work is now progressing to include large-scale search capability in the Collection Builder. We are projecting a July production release of the Collection Builder.

API development: Several institutions have requested the development of an API similar to Google’s recently released GBS API (see also CIC SDR Long-Term Functional Objectives). An early release is being tested internally. Individuals interested in testing should contact Jon Rothman (jrothman at umich.edu).

Forecasting development

May: We will begin distributing bibliographic information about the contents of the SDR to participating libraries so that they may enhance or add records to their catalogs. The mechanisms will be publicized in the Update on May Activities.

May: We will publicly release of multi-institutional pageturner. The interface will be visible to users at the University of Wisconsin.

May: We will release Wisconsin content in the SDR.

May: We will move our SOLR infrastructure to production. In the Update on May Activities, we will begin a discussion about benchmarking.

May: We will publicly release a mechanism that permits10-page PDF chunks (rather than 1-page chunks).

July: We are hoping to release the Collection Builder.

Status/availability of the SDR

We schedule system maintenance work that requires a system outage during time windows (in Eastern time) where academic user activity is generally lowest:

For major work, Friday evenings (8pm-1am) and Sunday mornings (5am-10am);

For minor work, weekdays from 6:30am-8am.

Advance notice for scheduled outages is given on business days and at least 24 hours in advance. Notice of unscheduled outages is given upon discovery, and additional updates are given as appropriate.

Please contact Phyllis White (pmwhite at umich.edu) with email addresses of individuals or groups that should be added to our system outage mailing list.

At this time, the following outages are scheduled:

There were no interruptions in service in April.

May: We will be scheduling a brief outage for a storage system software upgrade.