Preserving digital data for the future of eScience

From the August 30, 2008 issue of Science News

Libraries and other archives of physical culture have been struggling for
decades to preserve diverse media — from paper to eight-track tape recordings —
for future generations. Scientists are falling behind the curve in protecting
digital data, threatening the ability to mine new findings from existing data
or validate research analyses. Johns Hopkins University cosmologist Alex Szalay
and Jim Gray of Microsoft, who was lost at sea in 2007, spent much of the past
decade discussing challenges posed by data files that will soon approach the
petabyte (1015 —
or quadrillion — byte) scale. Szalay commented on those challenges in Pittsburgh
during an address at this summer’s Joint Conference on Digital Libraries and in
a follow-up interview with senior editor Janet Raloff.

Scientific data
approximately double every year, due to the availability of successive new
generations of inexpensive sensors and exponentially faster computing. It’s
essentially an “industrial revolution” in the collecting of digital data for
science.

But every year it
takes longer to analyze a week’s worth of data because even though the
computing speed and data collecting roughly doubles annually, the ability to
perform software analyses doesn’t. So analyses bog down.

It also becomes
increasingly harder to extract knowledge. At some point you need new indexes to
help you search through these accumulating mountains of data, performing
parallel data searches and analyses.

Like a factory
with automation, we need to process and calibrate data, transform them,
reorganize them, analyze them and then publish our findings. To cope, we need
laboratory information-management systems for these data and to automate more,
creating work-flow tools to manage our pipelines of incoming data.

In many fields,
data are growing so fast that there is no time to push them into some central
repository. Increasingly, then, data will be distributed in a pretty anarchic
system. We’ll have to have librarians organize these data, or our data systems
will have to do it themselves.

And because there
can be too much data to move around, we need to take our analyses to the data.

We can put
digital data onto a protected system and then interconnect it via computer
networks to a space in which users can operate remotely from anywhere in the
world. Users get read-only privileges, so they cannot make any changes to the
main database.

For the Sloan
Digital Sky Survey data, we have been giving an account to anyone with an
e-mail address. People with accounts can extract, customize and modify the data
they use, but they have to store it in their own data space. We give them each
a few gigabytes.

We currently have
1,600 users that are using [Sloan data] on a daily basis. Those data become a
new tool. Instead of pointing telescopes at the sky, users can “point” at the
data collected from some portion of the sky and analyze what they “see” in this
virtual universe.

This is leading
to a new type of eScience, where people work with data, not physical tools.
Once huge data sets are created, you can expect that people will find ways to
mine them in ways we never could have imagined.

But key to its
success is the need for a new paradigm in publishing, where people team up to
publish raw data. Perhaps in an overlay journal or as supplements to research
papers. Users would be able to tag the data with annotations, giving these data
added value....

The Sloan Digital
Sky Survey was to be the most detailed map of the northern sky. We thought it
would take five years. It took 16. Now we have to figure out how to publish the
final data — around 100 terabytes [0.1 petabyte].

The final
archiving of the data is in progress. There’s going to be paper and digital
archives, managed by the University
of Chicago and Johns
Hopkins libraries.

Today, you can
scan one gigabyte of data or download it with a good computer system in a
minute. But with current technologies, storing a petabyte would require about
1,500 hard disks, each holding 750 gigabytes. That means it would take almost
three years to copy a petabyte database — and cost about $1 million.

We generally try
to geoplex, which means keeping multiple copies at remote geographic locations.
That way, if there is a fire here or a meltdown there, backup copies are
unlikely to be affected. We’re also trying to store data on different media.
Eventually, I think we’ll probably load data on DVDs or something, which can go
into cold storage. We’ll still have to recopy them periodically if we want
digital data to survive a century or more.

This is something
that we have not had to deal with so far. But it’s coming — the need to
consider and plan for curation as data are collected. And it’s something that
the National Science Foundation is looking at: standards for long-term
digital-data curation.