The MetaCombine Project

Introduction

Welcome to the home page of MetaCombine, a Mellon-funded project hosted at Emory University. MetaCombine is a part of Emory's MetaScholar Initiative. The goal of MetaCombine is to experiment with methods to more meaningfully combine digital library resources and services, and, whenever possible, demonstrate the deployment of these methods. Below is an overview based on the initial project proposal to Mellon.

Overview

July 2003Executive Summary
Emory University seeks to conduct practical experimentation with improved techniques
for organization and access to scholarly information via the Open Archives
Initiative
Protocol for Metadata Harvesting (OAI-PMH) as well as the World Wide Web. Through
the proposed project, Emory will explore combinations of information and services
at
various levels of abstraction: combined search of OAI and Web resources, combined
semantic clusters of information, and combined digital library components
acting
as
a whole. Hence, the project name: MetaCombine.

Key Points

1) The MetaCombine project will assess the effectiveness
of several specific semantic
clustering techniques (see glossary for a description of this approach to information
organization) for improving organization and access to bodies of metadata
exposed
via
the OAI-PMH as well as Web resources. The focus will be on two prominent techniques:
the support vector machine (SVM) class of algorithms and multidimensional
scaling
(MDS)
visualization (see glossary for an overview of these methodologies).

3) Further, the project will experimentally develop and evaluate
a framework for coordinating
loosely coupled components of digital library services in an extensible manner,
based on a
new approach to using the OAI-PMH and Web services as underlying means of system
integration.

4) The MetaCombine project will build on the technical expertise
and working relationships
with scholars accumulated in the AmericanSouth and MetaArchive projects conducted
at Emory
University.

5) Emory University will use and develop only open source
software (OSS, see glossary for
details concerning this class of software), specifically software that can
subsequently
be used freely by other research institutions.

6) This project can potentially have broad impacts on many
other initiatives, by advancing
the current understanding of several areas of digital library technology and
scholarly
communication.

Problems Addressed

During the course of the Mellon Metadata Harvesting
Initiative projects, and generally during the
first few years of experimentation with the OAI-PMH, several problems have
come into focus.
Through the proposed project, Emory seeks to address these problems, summarized
below.

1) Problems in Categorizing and Browsing Harvested Metadata:
To date, virtually no OAI-PMH harvesting project has developed effective means
of browsing
metadata aggregations by subject, author, or other systematic categorization
scheme.
These problems stem from the fact that metadata aggregations harvested from
multiple
institutions suffer from a lack of controlled vocabulary and authority control
in the
underlying Dublin Core (DC, see glossary) metadata distributed via the OAI-PMH.
Without this consistency, there is no effective way to browse metadata across
institutional
boundaries. This problem is vexing, especially when dealing with archival collections
of
interest to humanities scholars. Such collections are topically narrow and
deep.
Archivists typically must implement their own specialized controlled vocabularies,
because generalized systems (such as the Library of Congress Subject Headings,
or LCSH)
do not provide enough specificity. This problem has been encountered by groups
ranging
from the MetaScholar Initiative at Emory University to the UC San Diego Union
Catalog of
Art Images (UCSD UCAI) project. Automated mechanisms (such as the semantic
clustering
experiments described in the next section) for categorizing multi-institutional
aggregations
of metadata could potentially be applied ex post facto to metadata aggregations
to remediate
these problems.

2) Problems in Searching Across OAI and Web Realms:
The OAI-PMH is a nearly perfect mechanism for distributing metadata from databases
and other
dynamic content management systems. While the protocol is steadily increasing
in deployment
and availability, there is an enormous and still growing realm of Web content
providers that
are unlikely to soon expose their metadata via the protocol. This produces
problems for
services attempting to provide comprehensive search capabilities for some targeted
subject
domain. Specifically, a lot of the information that researchers would ideally
like to be
able to search is spread across the separated OAI and Web realms.

3) Problems
in Coordinating Federated Digital Library Infrastructures: The success of lightweight
protocols (see glossary) like the OAI-PMH has accomplished two things: provider
services based
on such protocols have proliferated, and integrating services have struggled
to find effective
models and frameworks for attempting to amalgamate such distributed provider
services into
larger systems that work as a virtual whole. Digital libraries are still evolving
at a rapid
pace and will likely remain loose assemblies of distributed infrastructures
for some time to
come. Current consortial efforts to establish interoperable digital library
federations like
the AmericanSouth.Org system must proceed from the fundamental assumption that
such federations
will be loosely coupled. Tightly coupled systems with strong underlying programming
infrastructures are not practical given the foreseeable level of coordination
between digital
library efforts and the rate of change we are experiencing. Libraries need
more mechanisms
like the OAI-PMH that can provide abstraction interfaces (see glossary) via
lightweight
protocols. This approach offers the promise of enabling interoperability among
systems that
will remain loosely coupled.

Plan of Work

The project will produce three broad deliverables,
described below. Each deliverable will include

1) development of a working experimental infrastructure
applied to either or both the
AmericanSouth.Org and MetaArchive.Org scholarly portals,

2) assessment by means
of multiple
techniques, and

3) reporting results to the profession, both in conference
presentations and
in the professional literature.

A. Semantic Clustering Experiments

Summary: In this experiment,
open source semantic clustering software will be applied to several
collections of information aggregated from multiple institutions in order to
categorize
this information and make it browsable by researchers.

Background: Semantic
clustering techniques appear to be a promising approach to remediate the
problems of categorization described above in relation to harvested metadata.
The most
prominent and successful technique that has emerged in recent years for semantic
clustering is the support vector machine (SVM) class of algorithms. SVM is
the best
currently known technique for automatically categorizing information, and is
currently
anticipated to be a powerful tool for automated organization of metadata. Another
long-standing technique for semantic clustering is multidimensional scaling
(abbreviated MDS). MDS provides a simple technique for graphically displaying
the
similarities and relationships of clustered information, as opposed to simply
categorizing
information for related item browsing. MDS is therefore graphically complementary
to the results of SVM categorization.

Rationale: Applying the SVM and MDS techniques
to a series of metadata and Web information aggregations
will constitute a valuable experiment in organizing unstructured information
for purposes
of scholarly communication. The collections of information under consideration
below
are typical in that they lack effective overall classification categories,
and
therefore cannot be browsed by subject category.

Benefits: This is a practical experiment in that we hope to
use the organized information resulting
from these experiments in the AmericanSouth scholarly portal. The experiment
further has
broad applicability and therefore potential benefit to many other projects
seeking to
organize unstructured information for scholarly communication purposes. An
example of
such projects is the UCSD UCAI project mentioned previously. Emory intends
to share
information on this topic with the UCAI and similar projects for mutual benefit.

Details: This experiment will use open source semantic
clustering software. Emory will conduct
an evaluation of the many available SVM software tools (see http://www.support-vector.
net/software.html), and select one or more for this experiment. MDS is a general
statistical
technique that is supported by many open source statistical environments (an
example is the R
language environment, see http://www.r-project.org). Emory will use SVM to
categorize
various combinations of information of scholarly interest in the study of the
culture and
history of the American South and make this information browsable. Because
SVM is most
effective when subject experts provide guidance and feedback to the system,
Emory will
employ the scholarly design team of the AmericanSouth project to train the
SVM system to
produce effective interdisciplinary categories of use to humanists. A system
for MDS
visualization of clustered information will be developed and applied to the
results of
these clustering experiments for graphical display and comprehension of results.

1) AmericanSouth Metadata Clustering.
The purpose of this experiment is to test whether or not SVM OSS can (with
minimal guidance from
experts) effectively categorize and make browsable all DC metadata harvested
in AmericanSouth,
for use by scholars researching the culture and history of the American South.
The resulting
body of winnowed and categorized metadata will be made browsable via the derived
categories.
Effectiveness of the categories will be gauged by feedback from scholarly consultants
(see
section on staff resources).

2) AmericanSouth Web Clustering.
This experiment will test whether SVM OSS can categorize and make browsable
the crawled Web
content represented in the Web links section of the AmericanSouth portal (Web
sites identified
by scholarly consultants as including content of scholarly research value),
tested under
conditions similar to those listed in A1 above. The process and results of
clustering Web
content as opposed to metadata will be compared to understand similarities
and differences.

3) Multi-Dimensional Visualization. A system will be developed
to test whether effective means
of visually displaying the SVM-derived subject categories is feasible using
MDS graphical
display of the results of both of the above experiments. Assuming that the
experiment results
in a display that provides comprehensible and worthwhile display of relationships,
the
following clustering results will also be visualized.

4) AmericanSouth Combined Clustering. This experiment will
test if SVM OSS can categorize and make
browsable a union set of harvested metadata and crawled Web content, to evaluate
whether SVM
semantic clustering can effectively organize such disparate information sets.
This builds on the
findings of MetaScholar Initiative projects to date, namely that both harvested
metadata and
crawled Web content should be integrated for comprehensive scholarly information
discovery
purposes.

5) American Memory Metadata Clustering. Finally, the project
will conduct an experiment to test
whether the DC metadata available from the American Memory OAI provider can
be effectively
culled and categorized for use by scholars researching the culture and history
of the American
South (as opposed to generalized American Studies). The resulting body of winnowed
and categorized
metadata will also be made browsable via the derived categories.

B. Experiments with Combined OAI / Web Search

Summary: Open
source tools will be used to make a combination of harvested metadata and crawled
Web
content searchable in the context of the AmericanSouth portal.

Background: As mentioned, the MetaScholar Initiative has concluded
that both harvested metadata and
crawled Web content should be integrated for comprehensive scholarly information
discovery
purposes. However, this presents a challenge, as the two types of information
are
fundamentally different in nature, metadata being an abstraction of content,
and Web
pages being an instantiation of content. Both of the component tasks (harvesting/
searching metadata, and crawling/searching Web content) are now relatively
well understood.
What is not well understood is the tasked of combined searching of these information
realms.

Rationale: Experiments to bridge the OAI and Web information
realms are needed by the MetaScholar
Initiative, and would benefit other groups. There are two obvious ways that
the OAI and
Web realms might be bridged: subsuming OAI into the Web, or subsuming the Web
into OAI.
Each of these approaches should be evaluated.

Benefits: The findings of this
experiment will have great practical benefit for the AmericanSouth portal,
and will have potential application to virtually any other project seeking
to automatically
assemble a large amount of information for scholarly information discovery
purposes. Emory
intends to share information on this topic with other projects for mutual benefit.

Details: There are a variety of open source software tools that can usefully be tested
for this
purpose. The MetaScholar Initiative has accumulated extensive experience with
the ARC
software for OAI-PMH metadata harvesting and searching from Old Dominion University.
Old
Dominion plans to release a new, re-architected version of the software termed
ARCHON that
may include some capabilities for integrated metadata harvesting and Web crawling.
There
are a large number of open source Web search engines [Morgan, 2001] that could
be adapted for
this experiment. Finally, DP9 is a gateway service developed by Old Dominion
University that
enables indexing of an OAI data provider by an Internet search engine (see
glossary for more
information). DP9 is the logical mechanism to test the case of making the relevant
OAI metadata
searchable via the Web context. DP9 has not been tested by groups beyond Old
Dominion to date.

Specific experiments: Two experiments will be undertaken
in this area.

1) Combined Search Via Web Crawling. This experiment will
test whether or not an open source Web
search engine can be effectively applied to the union of the AmericanSouth
harvested metadata
(exposed via the DP9 gateway service) as well as the Web content represented
by the AmericanSouth
Weblinks. Focus groups of graduate researchers and scholarly consultants will
evaluate the
effectiveness of the resulting combined search system for scholarly discovery
purposes.

2) Combined Search Via OAI-PMH. In this experiment, Emory
will create an OAI provider for the
metadata resulting from the experiment in clustering web content (# A2 above),
and this metadata
will be harvested and made searchable together with the existing metadata in
AmericanSouth. Focus
groups of graduate researchers and scholarly consultants will evaluate the
effectiveness of the
resulting combined search system for scholarly discovery purposes.

C. Federated Digital Library Framework ExperimentsSummary:

A framework for loosely-coupled federations of
digital libraries will be iteratively
developed as an improved mechanism for coordinating components of such an infrastructure.

Background: The success of lightweight protocols like the OAI-PMH has accomplished
two things: provider
services based on such protocols have proliferated, and integrating services
have struggled to
find effective models and frameworks for attempting to amalgamate such distributed
provider
services into larger systems that work as a virtual whole. There has been increasing
attention
to this problem in the research community. [Fox, 2002 and Castelli, 2002]

Rationale: The MetaScholar Initiative and other distributed/federated
digital library infrastructures need
better organizing frameworks for coordinating the operations of loosely coupled
constituent
systems, and enabling an extensible scheme for specifying proposed additions
to such
infrastructures. Experiments to utilize emerging industry standards such as
the Web services
framework (see glossary) and research standards such as the OAI-PMH would address
this need.

Benefits: This experimental work will contribute
to Emory’s
efforts to increase the usefulness of the
internet for scholars, and potentially might have broader impacts on humanities
scholarship if
it works well. If the framework developed is flexible enough that various digital
library
services could modularly interact and share information then a large number
of initiatives
might stand to benefit. As a hypothetical example, if the Perseus Digital Library
and
AmericanSouth.Org could collaboratively build up interoperable Web services,
both digital
libraries would benefit.

Details: There are a number of promising new standards that
will be utilized in this experiment. The
OAI-PMH will be used as the underlying mechanism for distributing configuration
and status
information of virtual digital library systems. Web Services Definition Language
(WSDL,
see glossary) expressions will be disseminated via this OAI-PMH mechanism,
representing the
metadata for the digital library services of modular federations. The master
configuration
specifications for this framework will be expressed using the 5SL standard
developed by
Virginia Tech. [Fox, 2002]

Specific experiments: Emory will undertake two experiments:

1) Initial Federated Framework. An initial framework will
be designed and implemented in the
MetaArchive portal as a means of dynamically configuring virtual digital library
federations.
The only services that this initial framework will necessarily provide as modules
are a
federation coordinating service, an interface to a semantic clustering service,
and a combined
OAI/Web search service. The test to experimentally apply to this framework
is whether or not it
effectively enables the rapid and flexible creation of new federated digital
libraries. Through
this work, Emory seeks to develop a simple framework based on OAI that is both
lightweight enough
to be easily added to existing services and an effective means of configuring
recombinant
federations of digital library services.

2) Revised Federated Framework. Emory will design and implement
a revised framework that will attempt
to include targeted connection modules for a selection of other digital library
services, such as the
CLiMB toolkit from Columbia and the NITLE semantic indexing toolkit. The experimental
test here is
simply whether or not a working system can be devised incorporating these other
tools in addition to
the previous tool set. Through this work, Emory will explore the feasibility
of a lightweight
strategy for federating digital library services more generally, in the same
way that the OAI-PMH
enables simple federation of metadata resources. If this can be demonstrated,
it will constitute a
powerful approach for integrating digital library services across institutions.