Science Center Objects

As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats, and technology. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectiv...

As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats, and technology. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectively evaluate, prioritize, and preserve the data with the greatest potential impact to society. Recognizing these truths and the potential value of legacy data, USGS has been investigating legacy data management and preservation since 2006, including the 2016 “DaR” project, which developed legacy data inventory and evaluation methods and then tested them while preserving and releasing 5 at-risk USGS legacy datasets. This FY17 project will build on those FY16 project successes by:

The methods and tools developed through this project will enable USGS Mission Areas, Programs and science centers to efficiently evaluate their legacy data inventories and cost-effectively preserve their highest-priority legacy data products.

Scope

As one of the largest and oldest science organizations in the world, USGS has produced more than a century of earth science data, much of which is currently unavailable to the greater scientific community due to inaccessible or obsolescent media, formats or technology. These “legacy data” are invaluable for extending our historical understanding of the world’s natural resources, landscapes and hazards but lie unused because ultimately they are undiscovered and potentially unknown. Tapping this vast wealth of “dark data” requires 1) a complete inventory of legacy data and 2) methods and tools to effectively evaluate, prioritize, and preserve the data with the greatest potential impact to society.

In particular, the FY16 DaR project represents a convergence of earlier USGS legacy data projects, new open data policies, and modern information technology to provide USGS Mission Areas and science centers with legacy data preservation support, tools, and methods. The primary objectives and results of the FY16 DaR project were:

Results: We used the Legacy Data Inventory and Reporting System (LDIRS) to conduct a USGS-wide “Request for Legacy Data” (RFD) in May, 2016. We received 43 submissions from 20 USGS science centers with potential impacts across all USGS Missions. This formed the pool of submissions we evaluated and prioritized in Objective 2 (below) and prioritized and selected in Objective 3 (below). Since the RFD, the Fort Collins Science Center and EROS Center have continued to contribute legacy data to the inventory. The current inventory is available at: https://www.fort.usgs.gov/ldi/legacy-products

Results:We developed and tested a method to evaluate the risk and significance factors associated with a legacy data product and a second, algorithm-based method to prioritize legacy data based on its evaluation scores.

Applying the methods we developed in FY16 Objective 2 (above), we selected the top 5 legacy data products and partnered with the data owners to preserve and publish them as official USGS data releases. All legacy data products have started the IPDS review and approval process with official USGS data releases beginning in January 2017.

Develop time and resource estimates to preserve and release legacy data.

For each of the 5 selected preservation projects, we collected data on the time and resources required to complete each stage of data management plan (e.g., plan, acquire, process, analyze, preserve, and publish/share). This operational data will better inform future legacy data preservation and release estimates. These data will be published as case studies.

Technical Approach

Based on FY16 DaR project data and LDIRS user feedback we have identified 3 significant improvements that will improve the legacy data inventory, evaluation, and prioritization processes for USGS staff:

Expand the library of risk and significance factors and refine risk and significance scores and scoring algorithms;

Aggregate the legacy data submission, the risk and significance evaluation, and the inventory prioritization processes into a single process; and

​Create Mission Area, Program and science center inventory dashboards that display multiple LDIRS reports in a single user display.

The FY16 DaR project focused on developing and validating legacy data inventory, evaluation and reporting methods. This work also resulted in engaging, productive community discussions that validated the utility and need for a USGS legacy data inventory. With those positive results to build on, Objective 2 of this project will expand the current USGS legacy data inventory.

To do this we will:

Provide in-person legacy data inventory training and support to two USGS science centers who will conduct inventories of their legacy data collections. Results from these inventories and a third, previous CDI-partnered inventory (Fort Collins Science Center, 2015) will be used to develop case studies and training efforts below.

Develop legacy data inventory case studies that describe the real-world experiences of the three USGS science centers that conducted legacy data inventories. Case studies will be publicly available from the LDIRS web site, as well as presented at the “Legacy Data: Challenges and Solutions” session of the 2017 CDI Workshop.

Create short instructional training videos for USGS data managers, explaining the submission, evaluation, and prioritization processes for legacy data inventories. Training videos and documentation will be available to USGS staff via the LDIRS web site.

Develop a quarterly, opt-in USGS legacy data inventory report that provides USGS managers and data stewards a broad overview of the current USGS inventory from the perspective of a Mission Area, Program and/or science center.

Undeniably, preserving and publishing at-risk USGS legacy data was the most visible and powerful aspect of the FY16 DaR project. Case in point: the strongest feedback we received for this proposal’s FY17 statement of interest were specific requests to maximize the amount of funding for at-risk data preservation, which we have done. In addition, we identified patterns and efficiencies that provided FY17 improvements for users (see “Objective 1” above) through our study of the time and resources required to preserve and publish legacy data . Therefore, project Objective 3 is designed to:

Test the USGS Exit Survey process as a method of identifying at-risk USGS legacy data by conducting exit interviews on two career USGS staff and inventorying and evaluating their legacy data sets.

​Use the legacy data inventory tools and methods developed in FY16 to select up to four more mission-critical, at-risk USGS data sets to preserve and publish in 2017.

The FORT legacy data steward will ensure that all legacy data releases from this project will:

have complete, compliant FGDC-CSDGM metadata

address OSTP (Increasing Access to the Results of Federally Funded Scientific Research), OMB (M-13-13, Open Data Policy – Managing Information as an Asset), and Executive Order 13642 (Making Open and Machine Readable the New Default for Government Information) memorandums.

Golden Eagle International Radio Tracking Data, North America, 1992-1999

Peer Review

Chironomid Specimen Data, The Great Lakes (USA), 1957-2017

Preservation

Project Report

We refined the LDIRS prioritization algorithms to better assess temporal, geographic, and taxonomic extents, resulting in clearer prioritization scores with better intra-record differentiation. In addition, we incorporated the data assessment scoring into the data entry workflow, resulting in real-time prioritization.

We used several methods to continue to promote and expand the USGS legacy data inventory. First, we worked with two career scientists (Susan Skagen; Kathryn Thomas) and two science centers (GLSC and UMESC) to inventory their scientific records as a means of identify legacy data. Second, in September we conducted a second USGS-wide “request for legacy data” to further expand the total LDIRS inventory. Third, we continued to communicate the DaR project accomplishments and methods through USGS groups such as CDI, the FSPAC Data Preservation Subcommittee, the Data at Risk Working Group, the National Geospatial and Geophysical Data Preservation Program (NGGDPP) and the USGS Step-Up Program. In particular, the USGS Step-Up program used the LDIRS prioritization reports to select the North American Bat Banding Program data for their FY18 preservation work, an unfinished CDI-funded preservation project from 2014.

During the FY16 and ‘17 funding periods, the DaR project has selected 13 high priority preservation projects to validate best practices for preserving and publishing USGS legacy data and software. To date, 6 have been published, 3 are in peer review, and 3 are completing data processing. Upon completion each project is summarized as a case study that documents that describes the methods validated and lessons learned.

Legacy data (n) - Information stored in an old or obsolete format or computer system that is, therefore, difficult to access or process. (Business Dictionary, 2016) For over 135 years, the U.S. Geological Survey has collected diverse information about the natural world and how it interacts with society. Much of this legacy information is one-of-a-kind and in danger of being lost forever...

The scientific legacy of the USGS is the data and the scientific knowledge derived from it gathered over 130 years of research. However, it is widely assumed, and in some cases known, that high quality data, particularly legacy data critical for large time-scale analyses such as climate change and habitat change, is hidden away in case files, file cabinets, and hard drives housed in USGS...

The purpose of this project was to integrate the Bat Banding Program data (1932-1972) and the U.S. and Canada diagnostic data for white-nose syndrome with the USGS Bat Population Data (BPD) Project and provide the bat research community with secure, role-based access to these previously unavailable datasets. The objectives of this project were to: 1) integrate WNS diagnostic data into the BPD...

The Community for Data Integration (CDI) is a group that helps members grow their expertise on all aspects of working with scientific data. The CDI’s activities advance data and information integration capabilities in the U.S. Geological Survey and in the wider Earth and biological sciences. This annual report describes the presentations,...