CT Visionary :: Michael Keller

Stanford University (CA) Librarian
Michael Keller was among the leading
digital archiving experts who
headed to Paris this past November for the
inaugural meeting of the Sun Preservation
and Archiving Special Interest
Group,
a Sun Microsystems-sponsored
community dedicated to working
on the unique problems of storage and data
management, workflow, and architecture for
very large digital repositories.
Sun PASIG brings together a large group of organizations
for an ongoing global discussion of their research
and to share best practices for preservation and archiving.
Here, CT asks Keller for his perspectives on the
effort, and on Sun PASIG's overall goals.

What sparked your professional interest in the work
of Sun PASIG? More than 10 years ago, we in the library
profession began to realize that we had to take responsibility
for preserving-both for the long term and for
access-the digital objects that were coming to us in
increasing waves and numbers of flows, from varying
sources. Over those 10 years, a lot of developments took
place and a lot of projects launched, but none of them
were particularly large-scale-at least the ones that anybody
can talk about. We know that the government and
secret agencies are doing a lot of big-scale gathering,
but we don't know whether they are preserving anything.
So, we need both software and hardware technology
that can [work well] across very complex hardware
arrays, but can also ingest across a very wide variety of
data formats and what we might call "digital genres."

What is your own institution's perspective on that
need? At Stanford, we recognized about five or six years
ago that the university was producing various kinds of
digital information on the order of 40 terabytes per
year-as well as consuming information on the order of
40 terabytes per year. And, of course, that number has
only increased in the intervening five years. Within those
years, Stanford also signed on for the Google Book
Search project, which, if our original
ambitions are realized, will initially yield something on the
order of a petabyte-and-a-half of digital information, for
an initial database at Stanford of the books sent forward
for Google to digitize. And that would be the first copy of
the material, before we do anything to it. So, with those
kinds of numbers floating around, we realized that we
had to have a comprehensive solution to the problem of
preservation of bits and bytes, the problem of access to
copies of those files [for redundancy], and the problem of
ingesting at a very, very high level in order to get the digital
goods into the digital repository.

Starting four years ago, we acquired a big-tape robot
and some spinning disk technology that was intended to
help us understand how to manage the huge flow of digital
objects that we need to ingest and preserve. We found
that what was missing was a very effective spinning disk
technology. We'd experimented with a few of them, and
frankly, there were points of failure with most of them that
revealed themselves in operation. But Honeycomb [a storage
technology recently introduced by Sun Microsystems]-
which we've tested very extensively, subjecting it
to all the same stress tests and the same experiences as
other technologies-has proven to be quite robust. In fact,
it was when Sun came up with the Honeycomb technology,
which we started beta testing a little more than a yearand-
a-half ago, that we realized we had the last piece in
what I hope will be the first generation of hardware architecture
that will handle very, very large digital archives.

Is it true that the Sun Honeycomb technology is combining
the storage disk array with compute functions? Yes, Honeycomb has CPUs to run programs in the array.
From my perspective, that's the beginning of creating interoperable
information objects: so the storage array can
compute on the objects that are in it. And I'm not sure how
far that goes with this version of Honeycomb, but it seems
clear to me that's where it's heading.

And what about Honeycomb's approach to redundancy? That's an important matter. Think of Honeycomb as an array
of 32 spinning disks. The firmware that runs those disks
takes the files in, and it distributes a few copies of those files
onto several of those spinning disks. It distributes them so
that should one of those disks fail, various others will contain
the digital object that you put in. It's an instantaneous redundant
storage solution that manages itself. So, it handles the
redundancy and it does so without us having to manage it.

Even so, do you need more than one approach to protect
such vast amounts of important data? We know that we
have to have a combination of magnetic disk technology,
near-line tape storage, and offline tape storage-which we
will have to carefully manage for the very long term, until we
see different technologies becoming available to us or different
technologies becoming more robust and more appropriate
to the various missions we have set for ourselves. At
Stanford, we're doing both near-line and online storage, and
indeed, offline as well, to protect us in the long haul. We're
also looking for one or two partners in storage away from
North America, who would support one another against catastrophic
failure in a kind of warm failover site situation.

Finally, from a broader perspective, what has the Sun
PASIG set out to accomplish? This past June, at an initial
meeting [to lay the groundwork for] the preservation and
archive special interest group, we had a dozen institutions
in attendance, such as the British Library, Oxford University (UK), the Bibliotheque Nationale de France, The Johns
Hopkins University (MD), and the National Library of Sweden.
The goal for this gathering was to be a kind of an
instant peer review group. That meeting was focused mainly
on business drivers and architecture. But the [full, inaugural
meeting this past November] was much larger in terms of
numbers of institutions and people, and its concerns were
expanded to workflows, policy, and use cases. We still
wanted to spend time on architecture specifications,
design specifications, and software and hardware choices,
but our intent was to broaden the conversation because
there are serious issues around what will certainly become
very, very large digital archives.

Previous developments like DSpace and Fedora are providing us with
evidence that if the initiating institutions work hard and produce
some experiences to discuss in our professional publications
and meetings, then in the next generation, we may
end up at a place where institutions without the same kind of
IT prowess-without the kind of great IT support in the form
of programmers, database analysts, systems administrators,
and managers-may be able to run equally large digital
archives without needing that initial big investment.