Case Study

Researchers at the Georgia Tech Research Institute (GTRI) are
sharing results of advanced file-format recognition research with The National
Archives of the United Kingdom. The
effort could enhance worldwide capability to manage the vast array of file formats
created since the computer age began.

Improving archivists’ ability to categorize
and access hundreds of different computer file formats is critical in the
digital age. Increasingly, archives
receive large quantities of government and other records in a wide variety of
digital formats.

“The ultimate problem we’re
addressing here is technical obsolescence,” said William Underwood, a principal
research scientist leading the file-recognition
effort for GTRI. “As software programs have been superseded over the years, it’s
become critical to automate the enormous task of categorizing, verifying and
viewing hundreds of past and present file formats.”

One major facilitator of that task is
the PRONOM service, developed by The National Archives of the UK. This file-format registry, which can be utilized
online by archivists and others worldwide, employs a database containing
details of more than 750 different digital file formats. Those formats, in
turn, are accessed by a file-format identification tool called DROID.

Underwood explained that archivists
face the task of distinguishing among data files in hundreds of different
formats. At the most basic level, categorizing these data formats requires software
tools that examine file extensions, which are the identifying characters such
as “doc” or “pdf” found at the end of filenames.

Yet a file extension – an external identifier
that is easily modified or deleted -- can be inaccurate. More critical is the capability to identify correctly
the distinctive internal signature that characterizes a file’s format.

GTRI, in cooperation with the U.S. National
Archives and Records Administration (NARA), is helping the United Kingdom expand
the roster of internal signatures in the PRONOM database. GTRI has added more
than 50 such signatures to PRONOM in the past months, increasing the number of
signatures in the database by almost a quarter, with more additions expected
next year. This work is being performed at the request of the National Archives
Center for Advanced Systems and Technologies (NCAST), a NARA unit.

Currently, about a third of PRONOM’s
750 file formats have internal signatures. Increasing the number of internal signatures
is important, Underwood said, because it helps the DROID tool identify files
more accurately. In turn, increased accuracy enables digital archivists to better
identify older, obsolete file formats and develop appropriate migration
strategies and preservation tools.

“We are
grateful to NARA and the Georgia Tech Research Institute for the work they have
recently undertaken on file-format research,” said David Thomas, director of
technology at The National Archives of the UK.
“The decision to share their work … has significantly improved the
PRONOM database and will be of enormous benefit to the wider digital
preservation community.”

The technology contributed to The
National Archives of the UK is derived from GTRI’s research into Advanced
Language Processing Technology Applied to Digital Records, a project sponsored by the U.S. Army Research Laboratory and by NCAST. This work applies computational
linguistics technology to summarizing, accessing, reviewing and preserving
electronic records of the Department of Defense, federal agencies and presidential
administrations.

"In
PRONOM/DROID, The National Archives of the U.K. has responded to an essential
need for preserving and providing sustained access to valuable digital
information,” said Kenneth Thibodeau, director of NCAST. “We are happy to be able to contribute to
enhancing a tool that we use in NARA's Electronic Records Archives system. This
helps us and also benefits anyone who needs to preserve digital assets."

The first
version of PRONOM was developed by The National Archives’ Digital Preservation
Department for internal use in March 2002 and was launched as a free online service
to the public in February 2004. In 2007 The National Archives won the Digital
Preservation Award for its development of the PRONOM and DROID tools.

In 2011, PRONOM
data will be released in a linked, open format. This move will make it easier
for others to reuse the data, and will provide a means to extend and develop
the dataset. More information is available at http://labs.nationalarchives.gov.uk/wordpress/.

“The GTRI computational-linguistics
team will certainly continue to contribute to PRONOM,” Underwood said. “We’re eager to use our experience in
language-processing technology to support the evolution of this internationally
important file format database.”