Transferring “Libraries of Congress” of Data

The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Web Archiving Team.

If science reporters, IT industry pundits and digital storage and network infrastructure purveyors aretobebelieved, devices are being lab-tested even now that can store all of the data in the Library of Congress or transmit it over a network in mere moments. To this list of improbable claims, I’d like to add another: by the most conservative estimates, I transfer more than a Library of Congress’ worth of data to the Library of Congress every month.

By Flickr user MysteryBee (Henrik Bennetsen) under CC BY-SA 2.0

Clearly, that doesn’t make any sense, but allow me to explain. You may have noticed that the “data stored by the Library of Congress” has become a popular, if unusual, unit of measurement for capacity (and the subject of a previous Library of Congress blog post, to boot). More cautious commentators instead employ the “data represented by the digitized print collections of the Library of Congress.” My non-exhaustive research (nonetheless, corroborated by Wikipedia) suggests that in instances where a specific number is quoted, that number is most frequently 10 terabytes (and, in a curious bit of self-referentiality, the Library of Congress Web Archiving program is referenced in Wikipedia to help illustrate what a “Terabyte” is). From whither, 10 terabytes?

The earliest authoritative reference to the 10 terabytes number comes from an ambitious 2000 study by UC Berkeley iSchool professors Peter Lyman and Hal Varian which attempted to measure how much information was produced in the world that year. In it, they note with little fanfare that 10 terabytes is the size of the Library of Congress print collections. They subsequently elaborate their assumptions in an appendix: the average book has 300 pages, is scanned as a 600 DPITIFF, and, finally, compressed, resulting in an estimated size of 8 megabytes per book. At the time of the study’s publication, they supposed that the Library of Congress print collections consisted of 26 million books. Even taking these assumptions for granted, the math yields a number much closer to 200 terabytes. Sure enough, the authors note parenthetically elsewhere in the study that the size of the Library of Congress print collections is 208 terabytes. No explanation is offered for the discrepancy with the other quoted number.

By Flickr user mandiberg under CC BY-SA 2.0

For whatever reason, though, it’s the 10 terabyte figure that took hold in the public’s imagination. To be sure, 10 terabytes is an impressive amount of data, but it’s far less impressive than the amount of data that the Library of Congress actually contains (and, I suspect, even just counting the print collections). While I’m neither clever nor naïve enough to propose what a more realistic number might be, returning to my original provocation, I did wish to further discuss a digital collection I know quite well: the Library of Congress Web Archives.

As explained previously in The Signal, we currently contract with the Internet Archive to perform our large-scale web crawling. One ancillary task that arises from this arrangement is that the generated web archive data (roughly 5 terabytes per month) must be transferred from the West Coast to the Library of Congress. This turns out to be non-trivial; it may take the better part of a month with near-constant transfers over an Internet2 connection to move 10 terabytes of data. For all the optimism about transmitting “Libraries of Congress” of data over networks, putting data on physical storage media and then shipping that media around remains a surprisingly competitive alternative. Case in point: for all of the ethereality and technological sophistication implied by so-called cloud services, at least one of the major providers lets users upload their data in the comparatively mundane manner of mailing a hard drive.

Of course, transfer is just the initial stage in our management of the web archive data; the infrastructure demands compound when you consider the requirements for redundant storage on tape and/or spinning disk, internal network bandwidth, and processor cycles for copying, indexing, validation, and so forth. In summary, I doubt that we have spare capacity to store and process many more “Libraries of Congress” of data than we are currently (though perhaps that’s self-evident).

Suffice it to say, I look forward to a day when IT hardware manufacturers can legitimately claim to handle magnitudes of data commensurate with what is actually stored within the Library of Congress (whatever that amount may be). In the meantime, however, I suppose I’d settle for the popular adoption of fractional “Library of Congress” units of capacity (e.g., “.000001% of the data stored at the Library of Congress”) – likely no more or less realistic than what the actual number might be, but at least it’d more appropriately aggrandize just how much data the Library of Congress has.

6 Comments

I’ve always had a problem with the 10 TB number. It’s one of those numbers you see tossed around with too great a frequency lacking any reference (kind of like the “we only use 10% of our brains”, which I’m inclined to actually believe in the case of certain reality TV stars) to take seriously. The Library of Congress is, in effect, a mystical unknown to most Americans, and represents a romantic ideal of preservation and data, and as such is an evocative measure of data.

The text alone, without formatting, of the 2002 edition of the Encyclopedia Britannica adds up to about 264 GB, so it’s preposterous to assume the LoC amounts to a mere 40 times that number.

Even dealing with raw text, the encoding can matter quite a bit. Unicode UTF-8 and UTF-16 can increase sizes used in a document by quite a bit. For most English documents and other languages which use a Latin script, Unicode encoding automatically doubles the file size. Admittedly, this is only necessary when the source document uses characters not present in ASCII, but it certainly is worth considering.

I’m using this number as a trivia note in a project where, if I’m wrong, won’t negatively affect anyone in any important way, and I’m going to take the liberty of citing the 208 TB number as a “Very conservative estimate circa 2000”, which I think is a lot more reflective of reality than the 10 TB number.

I like the LOC, but I also view it as tentacle of state power with a stranglehold on information. So I’m interested in its size and the corresponding size of the internet, the same way as I’d be interested in the size of two fighters when placing bets before a match. Too bad the admins at the LOC have apparently decided it’s too difficult to weigh their fighter.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully
responsible for everything that you post. The content of all comments is released into the public domain
unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless,
the Library of Congress may monitor any user-generated content as it chooses and reserves the right to
remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and
may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's
privilege to post content on the Library site. Read our
Comment and Posting Policy.

Disclaimer

This blog does not represent official Library of Congress communications.

Links to external Internet sites on Library of Congress Web pages do not constitute the Library's endorsement of the content of their Web sites or of their policies or products. Please read our
Standard Disclaimer.