DLF evaluation of the Open Archives Initiative

The DLF is supporting the development of a small number of
Internet gateways through which users will access distributed
digital library holdings as if they were part of a single uniform
collection. The gateways will be built using a technique known as
metadata harvesting. That technique is documented in a technical
framework developed by the Open Archives Initiative (OAI). As such, the
gateways developed by the DLF will contribute to a practical
evaluation of the OAI's harvesting technique and its application
within libraries. These pages describe this development work and
provide an up-to-date account of its progress.

Contents

In May 2000, with funding from The Andrew W. Mellon
Foundation, two meetings were held at Harvard University to
explore various technical and organizational issues involved in
the development of metadata harvesting services in digital
libraries. The meetings' aims are set out in a funding proposal that also acted to brief
invited participants and to focus their discussions.

Participants quickly concluded that there were numerous,
potentially very valuable harvesting applications in the digital
library. Rather than re-invent a harvesting protocol, however,
participants agreed to concentrate on desirable revisions to the
protocol that had been developed recently by the OAI and
documented in the Santa Fe Convention. The meetings produced
three outcomes: a vision statement describing how harvesting
services could be developed to the advantage of libraries and
their patrons; a set of recommended changes that were formally
put to the OAI; and a road map for the development of harvesting
services that would help the library community evaluate the OAI's
technical framework in particular and the potential value of
metadata harvesting in general.

The vision statement was produced in
printed and electronic forms, and circulated widely to a broad
cross-section of the library community. The statement reflected
on libraries' persistent concern to pool the records they had
developed in order to document their respective holdings. It
evaluated the various mechanisms that had been used to achieve
this shared aim (e.g. union catalogs and distributed search
services) and the difficulties that those mechanisms encountering
in trying to integrate records pertaining to digital as well as
non-digital information. Metadata harvesting, the statement
suggested, promised to overcome some of these limitations. In
addition, the vision statement demonstrated how harvesting could
support the construction of Internet portals or gateways; that
is, websites that organize access to a rich variety of
information resources (potentially in any format) to meet very
specific user needs. Thus, the vision statement mused about
harvesting services that organized access to information relevant
to those interested in a particular field of study (e.g. American
history, biomedical ethics), a particular kind of information
(electronic books, digital images, maps and cartography), or
information available in a certain region (e.g. the southwestern
United States). It also envisaged Internet search services,
equivalent to those offered commercially, for example, by Alta
Vista and Lycos, but focusing more exclusively on scholarly
information including that which exists in databases and which is
as such hidden from the commercial search engines' view. In this
regard, the statement suggested that the harvesting technique
could be used to build what members of the Association of
Research Libraries were at that time beginning to refer to as
"the scholarly commons".

The second result of the Harvard meetings was a set of three
recommendations that were taken formally to the OAI. These were
intended to help the OAI generalize the framework so that it
could be applied beyond the e-print community where it had
originated. The first recommendation was technical and urged
adoption of unqualified Dublin Core as the protocol's common
metadata element set. The OAI had originally proposed an Open
Archives Metadata Set that was smaller and more prescribed than
the Dublin Core, and more closely tailored to the needs of the
e-print community. The second recommendation was greater
organizational stability for the OAI, including a steering
committee, an official home for the OAI web site, and a clear
locus of responsibility for maintenance of the protocol. The aim
here was to stabilize the framework long enough to encourage
institutions to invest in its practical application. The third
recommendation sought to generalize the initiative by focusing it
on technical rather than operational issues. Hitherto, the
technical framework had been developed to support a particular
application that aimed at making electronic pre-prints
publications more widely accessible and without cost to
end-users. Participants in the Harvard meetings envisaged (and
did not want to constrain development of) applications that
reflected very different organizational and business objectives.
These three recommendations were among those discussed by the
Open Archives Initiative at its second public meeting in San
Antonio, Texas, in June 2000, and helped to encourage
developments that are reported elsewhere from these pages.

The road map for developing harvesting services involved
progress on two closely related fronts: developing a pool of
harvestable metadata focusing principally on metadata available
from library systems; and building a small number of online
services with metadata harvested from the pool. The work is being
undertaken in close collaboration between the DLF (whose 25
member libraries share an interest in integrating access to their
distributed collections) and The Andrew W. Mellon Foundation. An
account of its progress is set out below.

In June 2000 the DLF began construction of a simple database
to list its members' nearly 300 public domain online digital
collections. The database, available from http://www.hti.umich.edu/cgi/d/dlfcoll/dlfcoll-idx,
was developed in part to identify sources of harvestable metadata
upon which the prototype harvesting services might rely.

Also in June, the Andrew W. Mellon Foundation invited the DLF
to locate institutions interested in contributing to evaluation
projects either by contributing metadata or by harvesting
metadata and building services. A call for
expressions of interest issued by the DLF in July produced 13
responses. Responses described 9 or 10 potential harvesting
services. They also offered metadata from nearly 50 digital
library collections representing between well over a million
unique information objects.

In October 2000, a meeting of interested project participants
was convened by The Andrew W. Mellon Foundation to explore
technical, organizational, and resource issues and to identify
possible next steps. The meeting is reported fully elsewhere. Briefly, participants
identified at least four service types with particular
possibilities for libraries and including:

services capable of supporting inquiry about a particular
subject ("Americana" was emphasized, in part because of the
availability of relevant digital holdings);

services developed by a single library or library consortium
and customized to meet its patrons' specific needs; and

services supporting a simple Lycos-style search across
available metadata irrespective of the subject matter, format, or
location of the information objects to which the metadata
referred.

There was finally interest in using harvesting services to
integrate information about digital as well as non-digital
objects and in this way, to capitalize on the substantial
scholarly wealth represented, for example, in union bibliographic
databases and online archival finding aids.

At present, those who have offered metadata are building OAI
conformant services that will allow them to make the metadata
available for harvesting. A list of metadata collections
provisionally to be made available for harvesting is supplied below. In the meantime, discussions
with potential service developers continue.

1600 records for scanned historic scenery backdrops in common
use in US 1890-1920

Visual Resources Association Core Categories and Dublin
Core

"

400 records for images from the history of computing

Visual Resources Association Core Categories and Dublin
Core

Cornell University

Records for digitized books and images from the Making of
America collection including the 100,000 plus articles in
nineteenth century serial publications and the 267 volumes of
nineteenth and twentieth century monographs

Text Encoding Initiative Headers

"

2,000 records for nineteenth and twentieth century
agricultural texts

MARC and Dublin Core records

"

Records for 571 digitized pre-1914 math books

MARC records

"

Records pertaining to geospatial data for New York State

Federal Geographic Data Committee

"

Records for c3,000 electronic resources available through the
Cornell Library gateway

Dublin Core

Emory University

Finding aids for nearly 45 special collections and item level
descriptions for 8,000 texts and photographs

Encoded Archival Descriptions

"

Records for the full SGML and XML encoded texts in the Emory
women writers project

Text Encoding Initiative Headers

"

Records for the full SGML and XML encoded poetry

Text Encoding Initiative Headers

"

Web resources created by Emory faculty for research
purposes

NA

"

Social science data produced at Emory University

NA

Harvard University

Metadata from VIA - an online index of visual resources in
the arts, architecture, material culture, and history from
Harvard libraries, archives, museums, etc., including 15,000
records linked to digital images

Visual Resources Association Core Categories and Dublin
Core

Harvard University Virtual Data Center

1000s of metadata records for data in the Harvard-MIT Data
Center and the Virtual Data Center

Data Documentation Initiative

Indiana University

Metadata records from the nearly 185 searchable works by
nineteenth-century Victorian women's writers

Text Encoding Initiative Headers

"

Metadata for the 2832 titles in the Lyle Wright bibliography
of American Fiction 1851-1875 being digitized by CIC
institutions

Text Encoding Initiative Headers

"

Hoagy Carmichael Collection - metadata for thousands of items
in libraries and archives

Encoded Archival Descriptions, MARC, Text Encoding Initiative
Headers

Library of Congress

Selected records from American Memory

Dublin Core

Research Libraries Group (RLG)

Metadata records from the visual resources available from the
Cultural Material Alliance

In June 2001, The Andrew W. Mellon Foundation funded seven
projects that proposed construction of various online services
employing the OAI metadata harvesting protocol. The following
services were funded at DLF member institutions:

At Emory University, two grant projects -- AmericanSouth.org and MetaArchive.org -- have
been collaboratively conjoined and are being carried forward in
cooperation with partner institutions SOLINET and ASERL. Services
will integrate access to digital collections dealing with the
American south; integrate finding aids for archives of papers of
major American political figures, and of records of theological
institutions, and Africana.

The University of Illinois is creating a web portal for searching
materials focusing on cultural heritage and coming from variety
of institutions including library special collections, museums,
historical societies, and public libraries.

The University of Michigan is building a service (OAISTER) that will
integrate access to digitally reformatted materials irrespective
of their subject, whether art or zoology, and format, whether
text or image.

The University of Virginia is working to integrate access to
digital Americana, all formats.

Many other DLF members are contributing metadata to these
harvesting services.