Data-intensive computing to large science discoveries

The advancement in computing technology has enabled scientists to collect massive amounts of data, taking us a step closer to solving complex problems such as global climate change and uncovering the secrets hidden in genes. The exponential growth in the amount of data collected in research, however, has created an urgent technical challenge.

"While technology has made massive data storage relatively cheap and bandwidth abundant, the ability to extract valuable knowledge from multiple types of data obtained from multiple sources and scales, and to achieve this in real time remains a daunting challenge," said Steve Elbert, manager of the Computer Sciences Initiative at the Pacific Northwest National Laboratory.

Elbert and a talented group of computational scientists are leading the effort at PNNL to tackle the challenge through a major initiative on high-performance and data-intensive computing. PNNL's data-intensive computing initiative will attempt to accelerate the creation of computational solutions to support the study of problems of national scope involving large amounts of data from very complex systems.

Problems of scale and complexity. Many government agencies, including the Department of Energy, the National Institutes of Health, the Department of Homeland Security, the Department of Defense and the intelligence community need computational capabilities well beyond the current state of the art to solve problems involving large, complex data sets.

Data-intensive computing capabilities are fundamental for advancing data-intensive sciences, including systems biology, as well as analyzing huge volumes of complicated data related to national security.

The need is great for computational capabilities to address the essential complexities in biological, environmental, and chemical sciences; monitor and control large and intricate systems such as the nation's power grid; and rapidly discover, characterize, and respond to threats to the nation.

"To meet these challenges, we must be able to increase our sampling capability by two to three orders of magnitude, aggregate the data, and perform timely analysis," Elbert said. "A common theme in many of these challenges is being able to model the information across a wide range of scales. We will use computational science and mathematics to formulate solutions to these problems that are as broad and crosscutting as possible."

Data-centric computing. "We're developing a fundamentally new approach to high-performance computing based on informatics rather than physics that will address the need to produce, collect, store, explore, analyze and quickly share huge amounts of scientific information," Elbert said. PNNL's approach will be data-centered rather than the traditional system based on dynamic relationships where predictions come from inference rather than solving equations.

Much of the effort is centered on creating algorithms, software, operating systems and new computational and storage systems to solve a broad set of problems involving large, complex heterogeneous data. The approach is to develop a novel parallel-computing architecture that operates on a shared, massive memory platform optimized for petascale memory references.

"We also are developing a test bed to evaluate a large assortment of special-purpose, pattern-matching, and flow-solving hardware," Elbert said. A new Advanced Computing Technology Lab was established in January 2005 to investigate and evaluate key technology components and emerging high-end systems for data-intensive computing.

On the software side, PNNL researchers are creating new scalable data-analysis tools and new tools for discovering patterns in large heterogeneous databases and for integrating data across differences in space and time scales, and sources.

A national center for innovative computing. The data-intensive computing initiative provides the basis for PNNL's long-term goal to become the national center for data-intensive computing focusing on large science discoveries and solutions to complex global problems.

"PNNL's unique combination of resources in areas of data management, information analytics, mathematical and computer sciences, bioinformatics and access to primary data resources provides a strong foundation for our vision," Elbert said. Moving forward, data-intensive computing will be vital to furthering the research in these areas.

"Whether it's in national security, energy systems, biology or climate modeling, next-generation, high-performance computing platforms and large-scale optical networks--together with collaboration, simulation, and modeling software--are essential for enabling research to achieve a whole new understanding of our world," Elbert said.

The Department of Energy's Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time.