Tuesday, 28 July 2009

Provenance metadata: what and how to record it?

To effectively curate the research data produced by the two research groups participating in the EIDCSR project, it is crucial to capture provenance metadata that explains how the data was generated in the first place. This information enables validation and increases the value of the data.

The research groups in our case, collaborate as part of a BBSRC funded project and generate MRIs and histology data in laboratories using a variety of instruments and techniques, these datasets are then manipulated through process such segmentation to create 3D meshes, volumetric elements, that will serve to run computational simulations.

So what provenance metadata should be recorded and are there any subject specific metadata standards appropriate for these datasets?

Interviews with the researchers involved in the generation of data have shown that they well versed in recording information about their experiments on their lab-notebooks. When writing research articles they go back to these notebooks in order to document their methodologies. Therefore, I believe it is fair to assume that researchers know what information needs to be recorded about their experiments and simulations.

A metadata standard used for experimental data widely used internationally is the Scientific Metadata Model developed by CCLRC (now STFC).The model includes information at the top level describing the study and the set of investigations i.e. experiment, measurement, simulation etc involved in this study. Then for each investigation it records specific information about the data:

Data holding - A logical hierarchy of the Data Collections and Atomic Data Objects and their directory style grouping. The Data Holding can be considered as the ‘root’ of the data file/object system.

§Data description - A description of the data kept in this data holding from the data archive perspective. Including information like name, type, status, quality and software.

-Logical description - Reference to a set of logical description fields such as parameter [Name, id, class, units, value, facilities used, range], time period or facility used.

§Data collection - Data Collections in the hierarchy of data organisation used in this Investigation; much like directories in a file system and they can be nested.

§Related reference - Other Studies/Investigations related to this Data Holding and their type or relationship; e.g. derived from or used by

§Data holding locator - A locator for addressing the overall Data Holding. (URI of top level directory or data)

How can this complex workflow process that involves several research groups with specialists skills and a variety of tools and techniques be recorded?

An answer to this question may be obtained by looking at the work of our colleagues in Southampton. Some weeks ago Simon Coles and Jeremy Frey visited the OeRC to tell us about their work on electronic lab notebooks. They have been involved in projects such as Smart Tea and CombeChem that deal with the management of laboratory information. Initially they had explored the idea of replicating printed lab-notebooks using tablet interfaces that would capture structured information. These have the benefits of good semantic information. In addition to this, they have experimented with the idea of laboratory blogs that allow recording step by step the process followed allowing discussing the data and providing flexibility and the power of web 2.0 technologies.