Facilitating the USGS Scientific Data Management Foundation by integrating the process into current scientific workflow systems

Science Center Objects

Increasing attention is being paid to the importance of proper scientific data management and implementing processes that ensure that products being released are properly documented. USGS policies have been established to properly document not only publications, but also the related data and software. This relatively recent expansion of documentation requirements for data and software may pre...

Increasing attention is being paid to the importance of proper scientific data management and implementing processes that ensure that products being released are properly documented. USGS policies have been established to properly document not only publications, but also the related data and software. This relatively recent expansion of documentation requirements for data and software may present a daunting challenge for many USGS scientists whose major focus is their physical science and who have less expertise in information science. As a proof of concept, this project has created a software solution that facilitates this process through a user-friendly, but comprehensive, interface embedded in an existing scientific workflow system used in the USGS for species distribution modeling. The software produced by this proposal has gone through initial testing, and the project team is currently in the process of using the application to document their first USGS data release. Lessons learned from this initial use case will be used to update and tune the current functionality. This functionality will ultimately be merged into the next released version of the Software for Assisted Habitat Modeling (SAHM). This project could provide an exemplar that the USGS can point to as it initiates new standards for producing repeatable science. The exemplar this team has created could also help ensure compliance pertaining to new requirements.

Accomplishments

The accomplishments for this project are described below.

The explicit data management steps required to archive and document a workflow created with VisTrails/SAHM were identified and documented.

The existing VisTrails/SAHM source code was extended to include tools to automate the steps in this process (fig. 13), such as

making clean copies of the relevant data files needed to reproduce the workflow, as well as the core outputs to create a self-contained archive bundle of the workflow’s outputs;

using the MetadataWizard, now integrated directly into VisTrails/SAHM, to create FGDC-compliant metadata for the archive bundle (fig. 14); and

moving the completed archive bundle to ScienceBase and using the archive bundle’s metadata to populate the item (fig. 15).

The software and data management workflow were tested by a data manager at the USGS Fort Collins Science Center (FORT).

The software is currently being used by FORT scientists to document a data release associated with a new publication. Contingent on the results of this test case, the team hopes to release the tool for wider, more comprehensive testing in the fall of 2016.