Contents

Sharing and reuse of data is just an intermediate step on the way to reproducible computational science. The next step, sharing and reuse of processes that transform data, is enabled by process management plans, which benefit multiple stakeholders at all stages of research.

Nowadays almost every research domain depends on data that is accessed and processed using computers. Data processing may range from simple calculations made in a spreadsheet editor, to distributed processes that transform data using dedicated software and hardware tools. The crucial focus is on the data, because it underlies new scientific breakthroughs and in many cases is irreproducible, e.g. climate data. Funding institutions have recognized the value of data, and as a result data management plans (DMPs) have become obligatory for many scientists who receive public funding. Data management plans are initiated before the start of a project, and evolve during its course. They aim not only to ensure that the data is managed properly during the project, e.g. by performing backups, using file naming convention, etc., but also that it is preserved and available in the future.

In order to understand data, as well as research results, data acquisition and manipulation processes must also be curated. Unfortunately, the underlying processes are not included in DMPs. As a consequence, information needed to document, verify, preserve or re-execute the experiment is lost. For this reason, we extend DMPs to “process management plans” (PMPs) [1] which complement the description of scientific data taking a process-centric view, viewing data as the result of underlying processes such as capture, (pre-) processing, transformation, integration and analyses. A PMP is a living document, which is created at the time of process design and is maintained and updated during the lifetime of the experiment by various stakeholders. Its structure as outlined below necessarily appears relatively abstract due to a wide range of requirements and different practices in scientific domains in which PMPs should be used. The proposed structure of a PMP is depicted in Figure 1.

Figure 1: Structure of a Process Management Plan.

The implementation of PMPs is to some extent domain dependent, because it has to incorporate already existing best practices. During the course of the EU-funded FP7 project TIMBUS, our team at SBA Research in Vienna investigated well-structured Taverna workflows, but also unstructured processes from the civil-engineering domain. In all cases, the PMPs can be implemented by integrating already existing tools. For example, the automatically generated TIMBUS Context Model [2] can be used to describe the software and hardware dependencies. The process of verification and validation can be performed using the VFramework. Research Objects [3] can be used to aggregate the high level information on the process, and the existing data management plan templates and tools can be refined to incorporate information on processes.

Figure 2 depicts stakeholders impacted by PMPs. Project applicants will benefit by being able to better identify and plan the resources needed for the research. For example, if the process of transforming the experimental data assumes use of proprietary software with an expensive license, this information will be revealed at an early stage and can be taken into the account when applying for a grant.

Figure 2: Stakeholders impacted by Process Management Plan.

Researchers will benefit by working with better documented processes. This leverages sharing of results and eases reuse of existing research. Moreover, time will be saved when a researcher joins a new project, because useful information will be provided in a structured way by the PMP. This is especially important for cooperation within research infrastructures, where research challenges are addressed through cooperation of various specialists from different areas contributing to only a specific part of the experiment.

From the point of view of funding bodies, PMPs safeguard the investment made into research by ensuring research results are trustable /verifiable, and can be re-used at later points in time. Furthermore, PMPs facilitate cooperation between projects, because they make it easier to reuse processes used in other projects and facilitate exploitation of results of other funded projects. Thus, PMPs lead to sustainable research and can save funding that can be allocated to new research projects.

Repositories which keep the deposited processes and data can better estimate the costs of curation and plan actions needed to maintain the deposited resources. PMPs also support long term preservation (keeping process usable over time) and provide information on possible events triggering necessary digital preservation activities.

PMPs also benefit reusing parties, because their research can be accelerated by reusable processes. The reusing parties also have greater confidence that they can build on previous work, because the quality is higher due to the reproducibility. Furthermore, scientists whose processes are reused for other experiments gain recognition and credit.

We are currently working on automation of PMP creation and verification by extraction of process characteristics automatically from its environment. Specific focus is on tool support to automate many of the various documentation steps. We are also currently evaluating the PMP with stakeholders from different scientific communities.