Description Peddlers and Data.gov: Two Peas In a Pod

As you may have heard, the National Archives issued a press release today announcing the release of three data sets on Data.gov:

The first milestone of the Open Government Directive was met on January 22 with the release of new datasets on Data.gov. Each major government agency has uploaded at least three datasets in this initial action. The National Archives released the 2007—2009 Code of Federal Regulations and two datasets from its Archival Research Catalog. This is the first time this material is available as raw data in XML format.

The Archival Research Catalog, or ARC, is NARA's primary access system for archival description, representing 68% of NARA's entire holdings. This breaks down to the following:

2,720,765 cubic feet

520 record groups

2,365 collections

102,598 series

3,265,988 file units

292,887 items

In addition, there are 6,354,765,793 logical data records and 465,050 artifacts described in ARC.

NARA's decision to share this data is a breakthrough for archives and people who love data. The size of the data provided by NARA in ARC is also immense; the ï»¿combined descriptions plus contextual information on represented organizations totals approximately 21 gigabytes when uncompressed.

Obviously, transferring this much data is difficult, and I was quite shocked when I discovered that NARA didn't bother to compress this data in the first place when I first decided to get my grubby paws on it. Not to be outdone, I corresponded with a few people over Twitter who were just as interested in the data, specifically Simon Spero at the UNC School of Information and Library Science, and Richard Urban, at UIUC's Graduate School of Library and Information Science. The three of us made a concerted effort to grab the data from NARA's web server and make a compressed version available.

We've talked about posting a torrent, but between the compression and the high bandwidth available from ibiblio, it doesn't seem to be quite as a pressing need. However, if you'd like, it could be arranged. More detail on the datasets, including detailed information about the tags and structure of the data within, can be found on Data.gov.