Owner

Provenance Dimensions

Background and Current Practice

University libraries need to handle metadata from diverse sources that is usually encoded in
incompatible metadata formats and of disparate quality. To facilitate a unified search interface on this heterogeneous metadata accumulation, the metadata formats need to be aligned. Typically, a format that forms a common denominator of all formats involved is chosen and the metadata is converted into this target format using crosswalks.

These crosswalks are usually hand-crafted by metadata experts and then transferred into program logic or transformation stylesheets. In the case of errors in the resulting metadata, the crosswalk has to be improved. The identification of the erronous part of the crosswalk can be tedious and after the crosswalk change, the whole set of resulting metadata has to be recreated, as it can not be determined, which parts of it are affected by the change.

Goal

The goal is to support the maintenance of crosswalks. Additional provenance information is provided for each resulting metadata record that enables efficient debugging.

Use Case Scenario

The program logic that is derived from the mappings is extended to not only write the
resulting metadata elements, but additionally for every element the following information:

the version of the crosswalk used

the number of the mapping rule used

the source fields used

With this information, at least the following maintenance steps can be supportet:

Crosswalk updates: After a change in the crosswalk, we can recreate all records that are affected.

Fixing mapping errors: If an error in the metadata is found, the responsible rule in the crosswalk can directly be identified.

Problems and Limitations

This use-case requires provenance on statement level, which has to be supported by the underlying infrastructure. However, in RDF exist two mechanisms that support this: Reification and Named Graphs.

A drawback is the overhead that is produced by the additional information. As this information has to be stored for every statement, the needed storage space might increase by some factor.