Primer

Introduction

Contents

This document aims to provide an overview of the tasks to be carried out by content providers who wish to contribute their organisation’s content to CARARE and to the Europeana service.

The topics covered in this document may change with the development in technologies and the evolution of the CARARE service.

Metadata

Metadata is probably one of the most important parts of a repository. There exists a very large number of metadata schemas created for different purposes and covering a diverse set of needs. A repository system usually represents a metadata schema internally in a relational database and provides different representations to its users (e.g. HTML, XML, etc.).

Not all repository systems are capable of representing all metadata schemas and the reason for this is two fold: a) because of limitations on the internal representation of the repository and b) because of the complexities and particularities of the metadata schema.

There are two main types of schema representations in repositories: a) those that follow a MARC based logic and b) those that have an XML based logic. A modern and flexible repository should follow the latter approach. Regarding the latter approach, one should be careful to select a repository that provides the flexibility to define a metadata schema more complex than a simple flat Dublin Core based one. A flat metadata schema consists of only one level elements that can possibly have attributes (e.g. a language qualifier). The simple case of Dublin Core provides a characteristic example of this. A more complex metadata schema could have a set of nested elements with many attributes per element. A characteristic example of this is the MODS metadata schema (http://www.loc.gov/standards/mods/).

It doesn't really matter how a repository represents a metadata schema internally but it is important to be able to represents complex metadata schemas. For reasons of interoperability, it is also very important that the repository is able to provide an XML representation of its metadata.

Another factor that must be taken into account is RDF, a powerful semantic representation mechanism very common in modern repositories. It is important for a repository to be able to handle RDF information.

Technical infrastructure

Web based vs desktop client

As the technical infrastructure is concerned, there are many open source and proprietary systems that can be used. It is important that the repository provides a management and administration web based interface (although that is not always the case) because once it is installed centrally on a server, it will allow users and curators to access it through a web browser. This approach greatly reduces the maintenance and support costs. The other case is to use desktop client software that is able to communicate through a network connection with the server. Such a client usually has certain operating system requirements and it is difficult to provide software updated (these problems do not exist in the web based approach).

Unique identifiers

Information inside a repository is organized in small information packages usually called digital objects. It is important for a repository to ensure that every object is uniquely addressed by an identifier. Usually these identifiers are represented by integer numbers (it is preferred to use integers to ensure maximum compatibility across repositories).

Simple vs Complex objects

Information inside a repository is organized in small information packages. In the simplest case (simple objects), a digital object consists of some metadata (following a metadata schema) and a binary file (e.g. a PDF document or an image). Although in many cases this is sufficient, in real world problems (e.g. an institutional repository) digital objects need to comprise of more elements (usually called datastreams). These elements could be: different metadata schemas that can coexist, images, a PDF document and a license the user has signed on depositing the digital object. It is prudent to plan ahead and select a repository that will cover future needs that in the present seem unnecessary because the cost of migrating from one repository to another is great and possibly involves information loss.

Long term preservation

When a real world repository runs for some time, it is common for modifications to take place: new information is added to existing digital objects, existing information is corrected/updated, etc. It is important for the repository to provide mechanisms for the long term preservation of the information it contains. The long term preservation of information is a complicated problem that cannot be analysed in this primer. However, it must be noted that there are simple mechanisms that can provide long term preservation such as: a) keeping versions for all modifications/deletions that take place inside the repository, b) keeping analytical logs for every action that takes place (there exist metadata schemas like PREMIS that can facilitate such logging mechanisms).

Interoperability

Interoperability between repositories involves information exchange between different repositories. The battle for information exchange has to be fought into two fronts: a) metadata schemas and b) interoperability protocols. The first one involves the selection of a rich and a semantically robust metadata schema whereas the second one involves the use of common and widely used protocols for information exchange. Such protocols are usually called PMHs (Protocols for Metadata Harvesting).

Currently, the best well known such protocol is the OAI-PMH protocol (http://www.openarchives.org/OAI/openarchivesprotocol.html) which is implemented by all well known repositories. The OAI-PMH is an asynchronous metadata exchange protocol that consists of two parts: a provider and a harvester. The provider exposes the repositories' metadata to a web based service that implements the OAI-PMH protocol (XML based) and the harvester periodically polls the provider (or providers) for metadata that in turn harvests and ingests into its database.