Thoughts on Digital Data

December 14, 2010

Talk of the SocietyJames Crowley

Tsunamis of data, data-intensive computing, data-enabled and data-intensive science. These and related terms were heard often at the fall meeting of the SIAM Committee on Science Policy, November 10 and 11, where distinguished guests from federal funding agencies---the Department of Energy's Office of Science, the National Science Foundation, the National Institute of Standards and Technology, and the National Institutes of Health---spoke about their priorities and programs. The meeting marked Sastry Pantula's first visit to the CSP as director of NSF's Division of Mathematical Sciences.

Data-enabled science was seen as part of a larger picture that also involves computing. We are likely to hear much more about computational and data-enabled sciences over the next several years from organizations like NSF, as well as from the Office of Science and Technology Policy, the office of the President's science adviser. Indeed, one of the last actions of former NSF director Arden Bement was to take initial steps that could lead to the creation of a program in this area.

The Sloan Digital Sky Survey generated 40 terabytes of data during a period of about ten years. The Large Synoptic Survey Telescope will generate similar amounts per night. Another example of a physical experiment that generates enormous amounts of data is the Large Hadron Collider. Mining such data and analyzing it for understanding are major scientific challenges.

The exascale computing systems now on the horizon also make data an issue. This is so not only because current and future supercomputers make it possible to generate enormous amounts of data in the form of output from simulations; data also becomes an issue in the sense that as processing costs decrease, it's the cost of moving data that drives computing. Computational methods must take data into account, replacing the longstanding emphasis on floating-point operations. And these changes may lead to new methods and algorithms on the very-large-scale machines of the future.

Data arose in the CSP discussions in another thought-provoking manner. NSF has issued new rules stipulating that every grant proposal must include a plan for managing data. While there are no requirements on preserving and sharing data at this time, the intent is clearly to move in that direction. And because data management is a reviewable part of proposals, the new rules will put pressure on proposers to begin to think about data issues.

The requirement for a data management plan stems from a 2009 interagency report, Harnessing the Power of Digital Data for Science and Society. The NSF rules are a first step toward possible stronger requirements in the future. Lest readers think that the mathematical and computational sciences aren't affected by this development, consider the fact that many in our community generate a lot of data from computer simulations. Such data, if it is a key part of the research, and the code that generates it are covered by the new guidelines for data management. And so anyone who does scientific computing is clearly affected, whether the goal is scientific discovery or the development of new computing methods and algorithms.

Closely related to this is a SIAM project now under way regarding supplemental materials for journal papers. Supplemental materials include video output from computer simulations, computer code, data, and other forms of information that typically do not appear within refereed papers themselves. The SIAM Journals Committee, led by Randy LeVeque, is tackling the task of devising mechanisms for the sharing of such materials.

Part of the motivation, as described in the report on digital data mentioned earlier, is to make available to other researchers additional information that will help advance the science. Another motivating factor, one especially espoused by Randy LeVeque, is to ensure that results are reproducible. As we demonstrate a new computational method, describe the algorithm, and present sample outputs in tables, it is sometimes difficult to verify and reproduce the results in a way that might be possible with a physical experiment. Hence our interest in giving authors the ability to add supplemental materials, like the output of the simulations and the code that generated it, in repositories associated with their papers. Such supplemental materials would be neither required nor refereed, but they might enhance the ability to test and reproduce the results obtained in the papers.

SIAM has experimented with supplemental material for several years, with SIAM Journal on Applied Dynamical Systems taking the lead. SIADS encourages authors to submit multimedia material (images, video, and so on) when it enhances the understanding of the papers. The associated website DSWeb, which is not peer-reviewed, has a section for the storage and sharing of code used in SIADS papers.

The SIAM Journals Committee continues to discuss this issue and will present its findings to the SIAM Board of Trustees at some time in the future. Readers with ideas on the topic are encouraged to contact Randy LeVeque or me.

***

A note on valuable multimedia material of another kind: Readers interested in revisiting, or perhaps experiencing for the first time, some of the excellent talks given at SIAM Annual Meetings (and a few conferences) are reminded that videotaped versions of selected talks are available on our website. Go to siam.org, select "Conferences," and then "Invited and Prize Lectures." And let us know what you think!