Q&A: Twitter Goes to the Library of Congress

Soon enough, all public tweets will be permanently enshrined in the annals of the world’s largest library. That includes everything from seemingly inane good-morning tweets to those documenting the bloody aftermath of the #iranelection.

Associated Press

The Library of Congress

Matt Raymond, communications director at the Library of Congress, tweeted a 106-character announcement Wednesday about the partnership with Twitter.

Library to acquire ENTIRE Twitter archive — ALL public tweets, ever, since March 2006! Details to follow.

That would date back to the first tweet ever by co-founder Jack Dorsey (“just setting up my twttr“). Within four years, Twitter has come a long way, now touting 105 million registered users who send out 55 million tweets a day. Mr. Raymond believes the library’s vast soon-to-be acquired collection of tweets can be as valuable a tool as its collection of letters and photographs. In the following Q&A, Mr. Raymond walks us through how equipped the library is to handle this new collection and potential implications of this project.

What’s the idea behind this initiative between Twitter and the Library of Congress?It’s an extension on the type of work the library has done for 210 years really and in terms of digital material for at least a couple decades. In the last 10 years, we had a congressional mandate to identify and acquire materials that are “born digital” so a lot of that has been Web sites, but we’re also leading a nationwide partnership with 130 institutions to identify and gather additional types of born digital information: state government records, data sets, geo-spacial information, that type of thing.

Not only is it the kind of born digital record we’re looking to keep, but it is directly related to the cultural and historical heritage of the country. It’s just a new type of technology about people documenting our society and what life is like at this time in history just as letters and journals and photographs and maps were in previous centuries.

What are the costs involved?To date, in the last 10 years, we’ve taken in about 167 terabytes worth of Web sites, and the addition of Twitter is actually just a little less than 5 terabytes to that total.

Keep in mind consumer hard drives right now at the 1 terabyte level are a couple hundred dollars. So this is really less than 5 terabytes, and in the grand scheme of storage, this is something we can handle. We’ve got 167 terabytes, and we have a facility in Culpeper, Va. that is digitizing video-audio collections at the rate of 3,000 to 5,000 terabytes per year. We have the practice, we have the capabilities to acquire and manage digital information.

Does the Library do similar things with blog postings or Facebook updates?Facebook I think would be much more complicated. Twitter is characters, that’s all it is. Facebook is characters and videos and pictures and different levels of privacy settings. It would be vastly more complex and from a user perspective also would be much more complex. We aren’t contemplating, we’re not talking with any other social media sites right now.

This is the first time we’ve worked with a social media site and acquired an archive. Curators who have done some of the Web capture in the past have always crawled the Web to archive selected content. So for instance, our law library did a project during the hearings for Sonia Sotomayor to the Supreme Court, and they gathered a lot of the legal blogs and tweeters and people who were commenting on them all in one place. Clever researchers, mathematicians, people who are designing algorithms to explore this vast trove of information—of what they can come with is only limited by the imagination.

So, besides it being lightweight, why Twitter?Twitter came to us. They asked us and said, “Is this the type of thing that might have value?” and we said yes. And there were discussions and the agreement was reached, and we’re now in the process of working out how the hand off [of data] will be made, handled, categorized, made accessible and all those types of details.

Is user-generated content going to become a major mission for the Library?To the extent that content itself is somewhat being redefined. We have always collected user-generated content but generally that has been in different formats, whether that was a manuscript that someone wrote, a letter that someone wrote to the president. We have the papers of 23 presidents. We have the papers of 40 Supreme Court justices. This is just another way of generating material. It’s being facilitated in a much more democratized fashion now, but the library has always had an interested in “common people” or ordinary citizens. We’ve done oral history projects to get people’s reactions after the Pearl Harbor attack or after September 11. We have a veterans history project that has been going on 10 years and collected first-person accounts from almost 70,000 people who are either veterans or people who are in wartime or affected by wartime, and it’s proven to be a valuable historical resource.

What do you think the impact will be of this archive when looking back on history?I think this is a situation where we might not know what the impact is going to be. I think as we sit here in the year 2010, we can say, well there have been many historical events whether it’s been natural disasters, the protests in Iran or elections, that there’s obviously some value with the observation of not just people on the outside but the participants themselves at that time. I think you can paint a very interesting picture from that. But more than that, we’re getting a sense of how people saw the world, how they saw themselves, their society, what was important to them, even things sometimes people tend to trivialize, pop culture for instance. But that all helps make up the mosaic of us as Americans and people around the world—it’s not just an American phenomenon. I think you talk to historians, you talk to archaeologists, they seem to have the most interest in finding out at the microlevel what were people like, what did they do, how they acted and behaved that was similar to or different from our own today.

It’s not a real-time type of thing like if you were to create a hash tag right now or if you were to look at what was a trending topic. This will be the mint edition copy of Twitter for posterity, for researchers so people will be able to go back and slice and dice the data however they want it, by year, by keyword. Again I think there’s some very brilliant people that will devise some tools that will allow people to make these types of discoveries. I’m excited to see what they come up with.

How long will it take for the Library of Congress to archive all present tweets?I believe when they do the hand off of the data, it’s going to be everything from the beginning of Twitter up until at least yesterday when the agreement was formally signed. I don’t know if they can do that in real time. Like say if they’re ready to give it to us in a month, we’ll do everything up until that day. The important thing is it’s an ongoing basis, so it’s not like they’re giving us a big chunk of data and that’s it. It’ll be an ongoing partnership and we have to put these details, procedures, policies in place between now and roughly the fall when it will be made available to researchers. And then there will be a six-month window before it goes into the research archive.

Are you a tweeter yourself?I did it very idly before I came to the library in the very early days of Twitter and didn’t do much with it. And then after I got here, I’ve been working with a lot of colleagues to try to move the library into more social media realms. We launched a blog and YouTube and Facebook and Twitter and iTunes U—all kinds of stuff. We didn’t step into these areas until we saw a lot of worth for them. When we decided to do that with Twitter and all these other areas I was taking more responsibility for, I just sort of phased all of those aspects from my personal life and said, “Yeah if i’m going to tweet I’m going to do it for the Library of Congress.”