Retaining and retrieving data more effectively

Randy Burris examines the archive of tape drives and disks of a High-Performance Storage System (HPSS) at ORNL. The HPSS, which was developed by ORNL and several partners, is storage-system software that leads the computer industry in data capacity and transfer speeds. At ORNL the HPSS is used for DOE’s Atmospheric Radiation Measurement, or ARM, data archive. This archive contains more than 4 million files representing more than 25 terabytes of data. (Top photo by Tom Cerniglio; bottom photo by Curtis Boles)

A scientist needs data about how different types of clouds reflect, absorb, and transmit the
energy of sunlight. The data, based on measurements taken by instruments on the ground
and aboard airplanes and satellites, will help the scientist improve the accuracy of a
computer model in predicting the influence of industrial emissions of greenhouse gases on
global warming.

The scientist accesses a Web-based interface and requests 100 files of data from the
Department of Energy’s Atmospheric Radiation Measurement (ARM) data archive,
located at ORNL. In this archive are tape drives (for slower-speed but higher-capacity
storage) and disks (for high-speed access.) They contain more than 4 million files
representing more than 25 terabytes of data. Three robots retrieve the tapes on which the
requested files are stored and load them for copying on the disk drive of the ARM Web
site server. Within an hour, the scientist can access the requested files.

For the past four years, the ARM data archive has been
using the High-Performance Storage System (HPSS),
storage-system software that leads the computer industry
in data capacity and transfer speeds and is standard for
storage systems in the high-performance computing
community. The ARM project is one of two large
customers for HPSS at DOE’s Center for Computational
Sciences (CCS) at ORNL, where Laboratory researchers
provide and support the data archive. The HPSS system
manages the hierarchy of devices, storing more than 3.5
billion measurements. It can place 12,000 new files a
day into storage. It will eventually be able routinely to
find and retrieve up to 5000 files an hour to meet the
growing requests for information related to global
change.

The other large customer is the group of climate
prediction modelers using ORNL’s supercomputers.
They can produce a run of cal-culations generating 1
terabyte of data that needs to be stored. These results
may also be sent from ORNL to the data archives at
DOE’s National Energy Research Scientific Computing
Center (NERSC) in California in chunks of 250
megabytes.

HPSS was developed by a consortium of DOE national
laboratories and IBM. The DOE participants are ORNL,
Sandia, Lawrence Berkeley (LBNL), Los Alamos, and
Lawrence Livermore national laboratories. HPSS, which
received an R&D 100 Award in 1997, is marketed by IBM. ORNL researchers Deryl
Steinert, Vicky White, and Mark Arnold have been developing the graphical user interface
between the operator and the HPSS for running, monitoring, and otherwise managing the
system. More than 70 terabytes are now stored in ORNL’s production HPSS installation,
managed by Stan White, Nina Hathaway, and Tim Jones.

The ORNL mass-storage program also includes the Probe Storage Research Facility,
operated by Dan Million. In one probe project, researchers Nagiza Samatova and George
Ostrouchov investigate the use of data mining to extract meaningful information from
massive scientific datasets.

Probe resources are also used for developing new software to send larger chunks of data
more rapidly over the network to such facilities as the CAVE virtual reality theater at
ORNL (see Visualization Tools).

“Our Probe staff recently accomplished one of our goals,” says Randy Burris, manager of
data storage systems for CCS. “Thanks in part to work by ORNL researchers Tom
Dunigan and Florence Fowler on network protocols, we are now using the bandwidth
between CCS and NERSC more effectively. We are now transmitting more than 12
megabytes per second over ESnet, DOE’s semiprivate portion of the Internet.”
Probe researchers also have a role in several projects funded by DOE’s Scientific
Discovery through Advanced Computing (SciDAC) program. For the Scientific Data
Management Integrated Software Infrastructure Center, a SciDAC project led by Arie
Shoshoni of LBNL, Probe resources will be used to develop ways to improve data access
and transfer and to test and implement other concepts. Probe resources are also being
used in the DOE Science Grid and the Earth Systems Grid II projects. The SciDAC
project on climate prediction, led by John Drake, will be using the Probe facility to
determine how to transfer bulk amounts of data over the wide-area network. In work for
the SciDAC project on astrophysics modeling, led by Tony Mezzacappa, Ross Toedte
will be using Probe resources as he develops an effective visualization of the details of a
stellar explosion. Finally, Net100 project researchers will use Probe resources as they
seek to improve computer operating systems so excellent network throughput will be
easily achievable without extensive application-specific tuning.

The production and research elements of ORNL’s mass-storage program are providing
and promising valuable services to computational scientists throughout the Laboratory.

The Department of Energy's Office of Science is the single largest supporter of basic research in the physical sciences in the United States and is working to address some of the most pressing challenges of our time.