New to workflows? Read more about workflows in old blog posts on the topic, here and here. Basically, a workflow is a formalization of “process metadata”. Process metadata is information about the process used to get to your final figures, tables, and other representations of your results. Think of it as a precise description of the scientific procedures you follow.

After sitting through demos and presentations on the different tools folks have created, my head was spinning, in a good way. A few of my takeaways are below. For my next Data Pub post I will provide list of the tools we discussed.

Takeaway #1: Reuse is different from reproducibility.

The end-goal of documenting and archiving a workflow may be different for different people/systems. Reuse of a workflow, for instance, is potentially much easier than exactly reproducing the results . Any researcher will tell you: reproducibility is virtually impossible. Of course, this differs a bit depending on discipline: anything involving a living thing is much more unpredictable (i.e., biology), while engineering experiments are more likely to be spot-on when reproduced. The level of detail needed to reproduce results is likely to dwarf details and information needed for reuse of workflows.

Takeaway #2: Think of reproducibility as archiving.

This was something Josh Greenberg said, and it struck a chord with me. It was said in the context of considering exactly how much stuff should be captured for reproducibility. Josh pointed out that there is a whole body of work out there addressing this very question: archival science.

Example: an archivist at a library gets boxes of stuff from a famous author who recently passed away. How does s/he decide what is important? What should be kept, and what should be thrown out? How should the items be arranged to ensure that they are useful? What metadata, context, or other information (like a finding aid) should be provided?

The situation with archiving workflows is similar: how much information is needed? What are the likely uses for the workflow? How much detail is too much? Too little? I like considering the issues around capturing the scientific process as similar to archival science scenarios– it makes the problem seem a bit more manageable.

Takeaway #3: High-quality APIs are critical for any tool developed.

We talked about MANY different tools. The one thing we could all agree on was that they should play nice with other tools. In the software world, this means having a nice, user-friendly Application Program Interface (API) that basically tells two pieces of software how to talk to one another.

The software we discussed is very nifty. That said, many of these tools are geared towards researchers with some impressive tech chops. The tools focus on helping capture code-based work, and integrate with things like LaTeX, Git/Github, the command line. Did I lose you there? You aren’t alone… many of the researchers I interact with are not familiar with these tools, and would therefore not be able to effectively use the software we discussed.

Takeaway #5: Closing the gap between the tools and the researchers that should use them is hard. But not impossible.

The reality is that it’s likely to be some combination of these three. Those at the workshop recognized the need for better user interfaces, and some projects here at the CDL are focusing on extensive usability testing prior to release. Funders are beginning to see the value of funding new positions for “human bridges” to help sync up researcher skill sets with available tools. And finally, researchers are slowly recognizing the need to learn basic coding– note the massive uptake of R in the Ecology community as an example.

[…] awareness about the importance of open science, open access to publications, data sharing, and reproducibility. Most of these concepts were easily accomplished in the olden days of pen-and-paper: you simply […]