Find it Fast

Reproducibility of Script-Based Workflows: A Case Study and Demonstration

Duc Vu

Bio:

Duc is a first year PhD student in Electrical and Computer Engineering at the University of Illinois at Chicago. Before his PhD study, he received M.S. degree in Electrical and Computer Engineering from the University of Illinois at Chicago. His research interests include machine learning, data mining, algorithms and signal processing.In his free time, Duc enjoys cooking and collecting stamps.

Project Description:

What does it take to reproduce a script-based scientific workflow?
For example, if the Python or R scripts implementing a workflow are available through an open source repository such as github, are we all set? Not so fast! A user might fail to successfully run the scripts or replicate the results for any of a number of reasons (for starters, the installation may fail due to complex software and version dependencies; or the user may fail to properly run, adapt, or understand the scripts due to lack of documentation, etc.)
In this project we will experiment with a number of technologies and tools that can improve the reproducibility of script-based workflows: e.g., the YesWorkflow (YW) toolkit allows authors to annotate scripts to model and export prospective provenance, i.e., the workflow structure otherwise latent in the script. YW can also be used to reconstruct retrospective provenance or to query other sources of provenance information, e.g., runtime provenance logged directly by the script author or recorded by the DataONE MATLAB tool, the NCEAS recordr, or the noWorkflow system (for capturing Python execution provenance). To manage platform and software dependencies of script-based workflows, docker containers can be used. Last but not least, active elements can be embedded in PDF files to support interactive exploration of published results.
Using one or more example scripts, we will apply these different technologies and study their benefits and limitations. The overall goal is to deploy a prototypical example of a “highly reproducible” script-based workflow using a combination of the above-mentioned technologies.

DataONE is a collaboration among many partner organizations, and is funded by the US National Science Foundation (NSF) under a Cooperative Agreement. Acknowledgement: This material is based upon work supported by the National Science Foundation under Grant Numbers 0830944 and 1430508. Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.