Tuesday, January 29, 2013

I blogged three weeks ago about the Library of Congress ingesting the Twitter feed, noting that the tweets were ending up on tape. It is over 130TB and growing 190GB/day. The Library is still trying to work out how to provide access to this collection; for example they cannot afford the infrastructure that would allow readers to perform keyword searches. This leaves the 400-odd researchers who have already expressed a need for access to the collection stymied. The British Library is also running into problems providing access to large collections, although not as large as Twitter. They are reduced to delivering 30TB NAS boxes to researchers, the same approach as Amazon and other services have taken to moving large amounts of data.

I mentioned this problem in passing in my earlier post, but I have come to understand that this observation has major implications for the future of digital preservation. Follow me below the fold as I discuss them.

Tuesday, January 22, 2013

At IDCC2013 in Amsterdam I presented the paper Distributed Digital Preservation in the Cloud in which Daniel Vargas and I described an experiment in which we ran a LOCKSS box in Amazon's cloud. Or rather, I gave a talk that briefly motivated and summarized the paper and then focused on subsequent developments in cloud storage services, such as Glacier. Below the fold is an edited text of the talk with links to the resources. I believe that video of the talk (and, I hope, the interesting question-and-answer session that followed) will be made available eventually.

Friday, January 18, 2013

Following on from my talk at the 2012 Fall CNI meeting on 11th December Gerry Bayne interviewed me about the economics of using cloud services for preservation. The edited 12-minute MP3 has been posted on the Educause website. I think I did a pretty good job of explaining the fundamental business reasons why institutions are going to continue waste large amounts of money buying over-priced storage from the commercial cloud providers.

Tuesday, January 15, 2013

At the suggestion of my long-time friend Frankie, I've been reading Trillions, a book by Peter Lucas, Joe Ballay and Mickey McManus. They are principals of MAYA Design, a design firm that emerged from the Design and CS schools at Carnegie-Mellon in 1989. Among its founders was Jim Morris, who ran the Andrew Project at C-MU on which I worked from 1983-85. The ideas in the book draw not just from the Andrew Project's vision of a networked campus with a single, uniform file name-space, as partially implemented in the Andrew File System, but also from Mark Weiser's vision of ubiquitous computing at Xerox PARC. Mark's 1991 Scientific American article "The Computer of the 21st Century" introduced the concept to the general public, and although the authors cite it, they seem strangely unaware of work going on at PARC and elsewhere for at least the last 6 years to implement the infrastructure that would make their ideas achievable. Follow me below the fold for the details.

Monday, January 7, 2013

MIT's Technology Review has a nice article about Scott Ainsworth et al's important paper How Much Of The Web Is Archived? (readable summary here). The paper reports an important initial step in measuring the effectiveness of Web archiving, and Scott and his co-authors deserve much credit for it. Below the fold I summarize the paper and raise some caveats as to the interpretation of the results. Tip of the hat to the authors for comments on a draft of this post.

getting to the point where they have caught up with ingesting the past, even though some still remains to be processed into its final archival form,

and having an automated process in place capable of ingesting the current tweets in near-real-time.

The numbers are impressive:

On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.

As of December 1, 2012, the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies.

Notice the roughly 10-to-1 compression ratio. Each copy of the archive would be in the region of 1.3PB uncompressed. The average compressed tweet takes up about 130*1012/2*170*109 = 380 bytes, so the metadata is far bigger than the 140 or less characters of the tweet itself. The library is ingesting about 0.5*109 tweets/day at 380 bytes/tweet, or 190GB/day, or about 2.2Mb/s bandwidth (ignoring overhead). These numbers will grow as the flow of tweets increases. The data ends up on tape:

Tape archives are the Library’s standard for preservation and long-term storage. Files are copied to two tape archives in geographically different locations as a preservation and security measure.

The scale and growth rate of this collection explain the difficulties the library has in satisfying the 400-odd requests they already have from scholars to access it for research purposes:

The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost-prohibitive and impractical for a public institution.

This is a huge and important effort. Best wishes to the Library as they struggle with providing access and keeping up with the flow of tweets.