The First Instance of a CORDRA Registry

Abstract

The Advanced Distributed Learning Registry (ADL-R) is a newly operational registration system for distributed e-learning content in the U.S. military. It is the first instance of a registry-based approach to repository federation resulting from the Content Object Repository Discovery and Registration/Resolution Architecture (CORDRA) project. This article will provide a brief overview of CORDRA and detailed information on ADL-R. A subsequent article in this month's issue of D-Lib will describe FeDCOR, which uses the same approach to federate DSpace repositories.

Introduction

Discovery of and access to distributed, heterogeneous collections of information has long been a challenge across many areas of endeavor. The growth of digital information, high speed computing, and ubiquitous networks has given us the tools to tackle this problem, but a great deal of work remains to be done. This challenge has been taken up by the participants of the CORDRA project. CORDRA is a collaborative activity led by the Advanced Distributed Learning (ADL) [1] initiative of the U.S. Department of Defense, the Corporation for National Research Initiatives (CNRI) [2], and the Learning Systems Architecture Laboratory (LSAL) [3] [Note 1]. The goal of the project is to create a global infrastructure for the federation of content repositories. While the project began in the e-learning space, it immediately encountered the requirement to provide access to any type of digital collection needed in support of distributed learning, which effectively includes all types of content. Groups of repositories will form federations by registering their content in a central registry and those federations will themselves register in higher level Registries of Registries (RofRs) thus forming federations of federations, culminating in a root level Master Registry of Registries. There is no notion of control at that root level, only the start of a path to any of the registered collections in any of the federations. The individual federations will vary in the specifics of metadata standards, access policies, organizational principles, and so on, but CORDRA will define an abstract model, some working code, and a set of standards for federating the federations. ADL-R is the project and operational registry representing the first of these federations.

CORDRA

The CORDRA project was announced by ADL early in 2004 at the first ADL International Plugfest in Zurich. ADL is an important player in the world of distributed learning standards, primarily through its creation and support of SCORM (Sharable Content Object Reference Model) [4], a suite of standards defining an interoperability framework for e-learning content and associated instructional systems. SCORM has been highly successful and has helped move e-learning from an era in which proprietary platforms required proprietary content formats and packaging to one in which it is possible to create content that will work across multiple systems and to create systems that can deliver content from many different sources. Having thus created an environment that enables the re-use of content, or even the real time aggregation of fine-grained distributed content, ADL faced the next logical problem  finding and accessing the distributed content. And unlike the library and publishing world, there was no established tradition of bibliographic control to at least provide a starting point for identifying and describing abstract works, manifestations, copies, and so on.

ADL launched the CORDRA project to address this problem  the discovery and access of distributed learning content. LSAL, which had made significant contributions to SCORM and other aspects of distributed e-learning systems, was added as a collaborative partner. CNRI was asked to join as a third partner, based largely on our work in network architecture, especially identifier systems.

The basic CORDRA approach is to create federations of repositories by registering, in a central registry, the metadata for each content object from a set of independent repositories. That set of repositories and the associated registry are then said to comprise a federation. The model assumes multiple federations and any given federation is assumed to represent a community of practice. Such a community could have its own set of metadata standards, access policies, collection policies, and so on, and the metadata registry that is the focal point for the federation should reflect those specific practices. Common practices and relatively homogeneous content allow for consistent and detailed description of content objects, which in turn allows for optimal searching and organization of the resultant metadata. That is, the best discovery and access information is naturally available across collections of metadata that are both detailed and internally consistent, and that situation is most likely to pertain in a set of relatively homogeneous content collections within a given community. In the case of ADL-R, for example, the focus is e-learning material within the military. While there will be a great deal of diversity within that collection, much of the metadata required to usefully federate a set of repositories containing that content, e.g., SCORM sequencing data or military classification, simply will not be useful or even make sense in many other environments. This initial stage of repository federation is illustrated in Figure 1.

Figure 1. Initial Stage of Repository Federation

How does one then federate the federations and so provide search and retrieval services across a disparate set of collections? The CORDRA model calls for another registration process in which the first level registries, those collecting data directly from content repositories, provide data to a Registry of Registries (RofR). The initial formulation assumed a single RofR to which all federations would contribute. Subsequent analysis and initial experimentation has shown the need for intermediate level RofRs, culminating in a Master RofR, which would serve as the CORDRA root. This is the approach illustrated in Figure 2.

Figure 2. Community Federation

Many technical and organizational issues remain to be worked out and are still the subject of research and prototyping, including the details of federation level metadata. Precisely what data gets pushed or pulled up from one level of registry to the next? The answer to that question will determine the methods that could be used to provide services across federations. The goal, however, is clear. Starting from the Master Registry of Registries, an application should be able to discover and navigate to any individually identified content item anywhere within the complete set of repositories federated according to the CORDRA model.

The rest of this article is devoted to ADL-R, which is the first CORDRA federation. It has been operational for several months in an advanced beta period, and has just recently become fully operational. Other CORDRA federations, based on the software developed in the ADL-R project, are currently being developed both inside and outside of the e-learning community.

ADL-R

ADL-R is the first publicly available CORDRA registry and was developed as a partnership between ADL, CNRI and LSAL, with CNRI responsible for much of its development and implementation. The project went live in December 2005. ADL-R provides a registry of learning content for the U.S. Department of Defense. It enables the military community to register SCORM content objects and encourages their discovery and reuse by other members of the community and, in some cases, the general public. The Defense Technical Information Center (DTIC) [5] will be the future home of ADL-R and will be the DoD agency responsible for maintaining and running the registry and associated services.

The project allows the various military service components and their associated contractors to submit metadata instances about content objects that they author or acquire. The submission of a metadata instance constitutes the registration of the content object being described by the metadata. Multiple metadata instances may be submitted by multiple parties for the same uniquely identified content object. These metadata instances are expressed as LOM [6] (Learning Object Metadata) encoded as XML and are submitted manually or automatically to the registry, which relates them to other metadata instances for the same content object through the identifier for that content object. ADL-R uses the Handle System [7] for identification. ADL-R accepts the registration of content objects already identified by handles, e.g., the entire DOI [8] world, in which case those handles are managed externally to the ADL-R project. It also allows, however, for the registration of content objects not currently identified by handles. In that case, the submitting organizations are assigned handle prefixes and the resultant handles are administered by the registry, thus providing a handle service to those organizations, eliminating the need for them to install and maintain handle servers, and so keeping their environments unchanged.

As a CORDRA registry designed from the ground up for a particular community while respecting the general CORDRA guiding principles, ADL-R provides the necessary interoperability and flexibility to fit smoothly into current organizations without the requirement of major content parsing or re-structuring.

The goal of the registry is to allow community members to make their content metadata instances available and searchable. These metadata instances allow the discovery of particular content object handles that can then be resolved to provide access, where allowed, to the full registered content. The content, and the access conditions for that content, remain with the submitting organizations within whatever repository or other access methods they employ. Updates on both the metadata instances and the content objects remain a responsibility of the registrants, but the registry provides well-defined interfaces and tools for the users or automatic agents to keep this information current.

Conceptual design of a flexible, robust and secure registry implementation

In the context of the CORDRA community, the registry is intended to serve as a primary node for registration and indexing of metadata related to one of three possible components of a CORDRA federation:

A set of described content objects stored in a group of repositories

A set of repositories that are part of a particular community

A set of registries forming a global multi-registry federation

Each level of the federation may use the same code base to deal with registration and queries on different types of metadata. This translates into the submission, validation, and subsequent indexing of the different metadata instances being configurable rather than hard-coded.
At the same time, the different CORDRA communities using an instance of the registry for their own implementation, will most likely want to implement and enforce their own authentication mechanisms, as well as their particular software packages, to store and index the contents of their submissions.

This approach calls for a registry that implements a series of well-defined APIs and protocols to communicate with its particular operational components and allow their exchange or extension in the framework of such protocols. Our answer to this requirement is a configurable piece of software that is capable of implementing a series of operations and streamlining them by means of protocols and APIs that use modular software components to implement the various registry operations.

The registry is also capable of integrating global content and uniquely identifying it across multiple repositories, registries, and even federations. Such universally unique identifiers allow delegated or strict management. Finally, the registry is intended to group itself with other registries to provide federation level cataloging services.

Registry entities and model

Our architecture identifies two types of first level entities:

A described object: a content object, a repository, or a registry about which metadata might be generated.

A metadata instance that represents a metadata statement about a particular object.

Each of these entities is uniquely and globally identified, and that identification constitutes essential data in the system.

We also identify the following second level entities:

A transaction request instantiated in the form of an XML document that conveys the information necessary to generate a series of basic registry operations in the registry. These include:

Registration

Activation

Deactivation

Mirroring

Withdrawal

Each valid transaction request is stored permanently in the registry and is given a transaction identifier local to each particular registry instance. This data is only valid in the context of a particular registry and is only made available in that context.

A transaction log, used primarily to track the specific operations performed inside the registry, for both archival and temporary feedback purposes. This log records the results of the atomic operations performed inside the registry according to a predefined schema.

A query request expressed in the form of an XML data stream used to generate query operations inside the registry. This entity is essentially ephemeral, and it could be stored or not stored in the system according to the preferences of the individual registry manager.

A query response expressed as an XML document conformant to a predefined schema. Once again, this entity is essentially ephemeral and could be retained or not retained in the system.

A set of configuration files expressed according to well-defined schemas that are used to allow the registry to customize its generic parsing, validation, and indexing operations to accommodate the particular needs of each community. Such configurations are globally unique and securely administered with handles.

Handles are used to identify most of the files in the registry, including the schemas, and are used inside of XML documents by means of a handle proxy server.

A third set of data, not directly exposed, is used to perform multiple registry and CORDRA-related operations. This category may evolve over time, as federation evolves, but is currently represented primarily by the Indexing Data, which includes index catalogs and supporting data created and stored inside every registry. This data is an extended subset of the first level entities and is primarily used to perform advanced query and federation operations.

The registry makes only the first level entities described above accessible to the general CORDRA infrastructure and to external users.

In order to maintain data independence and flexibility, the registry provides three layers of isolation for its configuration. This differentiation enables CORDRA aggregation and distributed querying, while maintaining the local independence and flexibility of each CORDRA community.

Figure 3: Configuration Isolation Layers.

As shown in Figure 3:

The authentication layer deals with the community level authentication and authorization mechanisms.

The federation level implements CORDRA community or federation operations and deals with a predefined set of data for the particular community or federation. This data is required at a Registry level to guarantee that a minimum set of operations can be implemented.

A local metadata implementation deals with the particular needs of the community and relates to their particular metadata set or sets, allowing for more extensive indexing and more specific queries to be performed based on data relevant to the particular community.

We envision the top two layers to have strict validation, while the bottom layer might accommodate more lax implementations.

Based on this architecture, any submission will be subjected to a strict validation of the top two layers, an evaluation that will most likely be synchronous or pseudo synchronous, and an asynchronous validation and parsing of the bottom data layer.

Content Object and Metadata Instances

Internally, the ADL-R acknowledges the existence and independence of content objects outside its control as well as the multiple metadata assertions about them. ADL-R uses an encapsulated internal digital object representation and storage in order to represent this structure and allow multiple unrelated parties to make assertions about a particular object previously registered by its owner or responsible party.

In this internal representation, a globally known object identified by a handle is called a Content Object (CO) and its handle is referred to as a Content Object Identifier. The internal abstraction of this object inside the ADL-R is called CORE (Content Object Representation Entity) and is associated with a CORE identifier which is a direct calculation based on the CO's handle, in order to expedite the ID resolution.

It is important to note that the CO Identifiers can be part of any handle prefix [10] while the CORE IDs are exclusively contained and administered by the Registry and fall under the specific Registry prefix. All content object abstractions and their respective metadata assertions or instances inside a particular registry share the same registry prefix.

Metadata assertions, or COMIs (Content Object Metadata Instances), are stored inside COREs. Each metadata instance is exclusive to the specific registry in which an assertion has been made. The process of making a metadata assertion by submitting a metadata instance about a particular content object is called metadata submission. When the first metadata instance is submitted to a specific registry for a specific content object, a content object registration is said to have taken place. The content object registration is nothing but the creation of the internal CORE within the Registry.

Figure 4 shows the internal representation of a content object and its associated metadata instances.

Registry Metadata Instance layers

Registry submissions consist of metadata assertions about particular content objects. These metadata instances are a combination of local community metadata and global federation or CORDRA level metadata. Local communities submitting to a particular registry will have a defined metadata format and metadata fields that they agree upon. This metadata set is meaningful for their particular uses and procedures. A smaller portion of this metadata is intended to be completely indexed by the registry in order to enable discovery of the described object, but the complete set is always stored in full. On top of this metadata, some additional metadata is needed by the registry to perform registry operations. Additionally, some information either extracted from or added to the local metadata will be required by the overall CORDRA community. Therefore, each registry submission incorporates both community metadata and registry level metadata. This is illustrated in Figure 5 below.

Figure 5: Metadata Instances

The registry must distinguish these metadata layers, therefore the main XML submission has two main components following two different XML schemas:

The Registry Submission Schema, specifically known in ADL as the ADL-Reg-T Submission

The Community Metadata Schema, which implements the LOM approach in ADL-R

Additional characteristics of each metadata layer are captured by means of business logic modules at both the Registry/CORDRA level and the Local/Community level.

ADL-R authentication and authorization

ADL-R includes a basic set of authorization and authentication tools that can be replicated in other CORDRA registries and federations. Basic authentication is expressed in terms of the Handle System with users and groups identified by handles. Repositories and content objects as well as registries are identified by handles, and the relationship and rights of particular groups and the users that are part of those groups are registered in handle values that correspond to special registry types [11].

These values express different rights of groups in relation to particular registries, content objects, and repositories, and also reflect the rights of each registry over certain sets of handles for those cases in which the registry also provides handle registration and administration services for content registrants.

CORDRA registries will respect the individual authentication rules and implementations of each community. Thus, the registry allows for community-specific authentication components that plug in to the authorization tools and use generic registry administration utility libraries (RAUL). In the case of ADL-R, an LDAP authentication mechanism is plugged in to the registry and allows authenticated control to both the registry and its administration tools.

ADL-R functionality and components

The ADL-R architecture ensures modularity and scalability by dividing its operations among several interoperable modules that perform very specific tasks and present well-defined APIs, and communicate in standard fashion using XML schema enforced messages.

Figure 6: The ADL-R Internal Architecture

As shown in Figure 6, the ADL-R System has the following major components:

1. Main Registry Module

This module contains the main Registry Engine and an implementation of the web interface to the system. The web interface is provided by means of the HTTP module labeled CORDRAWEB, which deals with the reception of all the requests and the generation of HTTP-based responses. The submissions and status requests come in through an HTTP Post interface. The adlregistry-transaction-status schema is used for its responses.

The Registry Engine module is responsible for implementing, enforcing and keeping track of all operations inside the registry and is, in fact, the central registry coordination module. This module is also responsible for all authentication and authorization tasks for the system. The Registry Engine main library is called RegistryLib.

2. Validator Module

The Validator Module relies on a series of external libraries, tools, and software packages to implement some of its basic operation validation tasks. This module takes into consideration "well-formedness" and business rule adherence for all operations. It is composed of a basic XML Validation module that enforces adherence of the transactions to the adl-reg-T schema as well as the local community XML schema, which for ADL-R is the ADL LOM metadata schema.

Once the validator has determined that the transaction is well formed, the submitted data is validated against a Registry Business rules validator and a Community Business Rules validator.

The Registry Business rules validator enforces registry level rules expressed in the participation rules configuration file. These business rules are intended to guarantee that the Registry is capable of performing its own operations and participate in a larger CORDRA community.

The local Business Rules validator enforces the local community rules that were not expressed in the schema. These rules are purely associated with a particular community and its practices.

3. Registry Repository

The Registry is designed to interact with a RAP-based digital object repository and implements a set of storage and management rules based on the notion that a digital object represents a content object and that each digital object container is to contain zero or more metadata instances describing its respective content object. Each digital object, and the metadata instances it contains, are uniquely identified and locatable using the same handles as those provided in the transaction request.

4. Indexer Module

The flexible indexing module is responsible for the contextual and full text indexing of information relevant to the particular CORDRA community and the global CORDRA infrastructure as a whole.

The indexing is performed according to business rules which are specific to a community and a registry implementation. These rules are read from an object in the registry and all of this is independent of the specific implementation of the search engine, as long as it is capable of reading its key mapping from a schema rules configuration file that lists each key's XML path for extraction. The communication is implemented using HTTP, and the operations are performed following the basic operations for a Lucene search engine [13].

The Indexer module is fitted with an advanced index interface that has pre- and post- processing business rules modules that expose the indexes directly to registry of registries and advanced service interfaces.

5. Handle Service

The registry architecture includes a handle service responsible for the administration of internal handles as well as delegated content object handles for some of the registered content objects and their metadata instances. ADL-R expects to be able to at least resolve content object and instance metadata handles and, depending on the specifics of ownership, administer them. ADL-R is also dependent on the Handle System for identifying and locating the registries with which it is directed to interact.

The handle service is crucial for the correct operation of ADL-R; it is used to configure the registry, locate and connect to the appropriate registry, and identify content objects and metadata instances, and it is also used for authenticating and authorizing users.

System workflow

The registration process is illustrated in Figure 7.

Figure 7: System Workflow Diagram

The registration process is as follows:

A transaction request is generated either by means of a web based form provided by the registry or a software client, which understands the submission.

The request is received by the main registry module (Fig. 7, module 1), which extracts the relevant data and sends it to the Registry Engine. The Registry Engine then resolves the user and password as well as the group information. It does that by using the handle library module. Any errors are registered in the status log.

The Registry Engine then starts a validation process. The validator (Fig. 7, module 2) performs a quick scan of the submission and, depending on its size, decides whether to validate it synchronously or asynchronously. A validation ID is returned if it is determined that an asynchronous procedure is needed. The validator then starts validating the XML file and parsing the batch XML document using the given business rules. Upon successful validation, a transaction ID is generated and returned to the httpd interface along with the information available in the status log about this transaction. The httpd module then produces a response by querying the status log. This marks the end of the validation sequence.

The Registry Engine then proceeds to use the XML parsing module to extract each metadata instance. It extracts the particular content identifier expressed as a handle.

The Registry Engine checks for the existence of the content object and instance metadata handles to determine whether they already exist.

The Indexer (Fig. 7, module 4) is then used to index the metadata instance according to the respective indexing rules.

As the last step, the handles for the different components are created or updated (Fig. 7, module 5).

Once this operation has finished, the Registry Engine updates the status log and sends out an email if the batch processing has finished.

Conclusion and Future Work

The ADL-R has implemented a successful first instance of the CORDRA model that is not only flexible and scalable but modular enough to be reused in multiple new CORDRA communities and scenarios. An example of this potential is the FeDCOR project [14] that uses the same code base to form a DSpace institutional repository community.

Our current focus is on the formalization of the current Registry APIs and modules to allow for even more flexible registry implementations. It is the intention of the CORDRA steering committee to release the final source code for the registry as an open source project.

Extensive work is underway to build the first registry of registries instance and consolidate the first CORDRA aggregation community. This work should yield some interesting results in the near future.

The authors would like to acknowledge the Advanced Distributed Learning Co-Labs who funded most of this work, as well as the collaboration of the Learning Systems Architecture Laboratory, Concurrent Technologies Corporation, Defense Technical Information Center, and all of the members of the ADL-R pilot group who helped in the design and development of the ADL-R.

Note

[1] The three CORDRA project principals at these organizations are Philip Dodds (ADL), Dan Rehak (LSAL), and Laurence Lannom (CNRI). These three also constitute the ad hoc Steering Committee for the project, which will likely move to a more permanent organizational format over time.

Appendix I - Glossary

Note: the following definitions apply to the ADL-R project and are proposed, but not settled, terminology for CORDRA.

ADL-R: Advanced Distributed Learning Registry. The first registry created by ADL to register SCORM-based learning modules for the U.S. Department of Defense (DoD).

API: Application Program Interface. A collection of well defined libraries, programs, and routines that provide generic access to the functionality of a particular programming module.

Authentication: Process of identity validation. Inside the ADL-R this is the process by which the direct relationship between a particular LDAP ID and a user identification handle is established and tested.

Authorization: Process of rights assessment for a particular user associated with a respective group. In ADL-R all rights are expressed as a function of the group and its rights over a particular prefix. Typical CORDRA level rights are

Community: Short for a CORDRA Community, which may be any of: Local Metadata Community associated with a particular Registry; Registry Community associated with a particular Registry of Registries, or the CORDRA Community in general as reflected in the Master Registry of Registries.

Content Object: Resource about which metadata assertions are made in the context of CORDRA; it could be an SCORM module, a technical report, a book, or any structured form of data about which a metadata assertion can be made.

CORDRA: Content Object Repository Discovery Registration/Resolution Architecture: Architecture for the discovery and registration of content objects stored in multiple repositories across different local communities.

CORDRA Community: The aggregated global community of CORDRA compatible registries that may integrate and share their information into a Master Registry implementation.

CORDRA Federation: The successful aggregation of multiple registries and registries of registries to provide a set of discovery, registration and resolution services.