Why the Library of Congress cares about archiving our tweets

Ars speaks to the Library of Congress and gets the backstory behind the …

"Happiness is knowing yourself, loving yourself, and being yourself, F**** anyone who doesn't get you"

"Surprising. Obama did not claim Beck, Hannity, Limbaugh as dependents even though their income totally dependent on him."

"If queue numbers in the Store are correct, Blizzard is making over half a million dollars an hour on the Celestial Steed."

The US government is paying good money to archive tweets such as the above for posterity, but why?

Those "top tweets" from the past week will join billions of others—every tweet since Twitter launched in 2006, in fact—in a new archive at the 210-year-old Library of Congress. There, they will reside in air-conditioned comfort on servers that also hold the Library's current 167TB archive of Web data.

This was big news—so big that when the Library of Congress blogged about it last week, traffic to the site brought down the entire loc.gov server.

Oh, I'm going to be in the Library of Congress because I tweet! Awesome!

Um, I'm going to be in the Library of Congress because I tweet? Creepy.

Wow, what a terrific resource for research.

What a complete waste of taxpayer money.

Anderson herself takes the pragmatic view. "We don't think the whole thing is the most precious treasure" on earth when you drill down to individual tweets, she says, but looked at in the aggregate—yes, it will be a terrific resource.

The Library sees Twitter as a "technology change" in the way we communicate. Those don't happen often, and the Library has an interest in documenting each major new communications platform.

For instance, "one of the things we see emerging is that Twitter is a news distribution medium now," she says. Issue a press release, especially on paper, and response is minimal. But when the library tweeted about its plans, feedback was immediate and overwhelming.

Not that the Library went after this archive, though; the initiative came from Twitter, which approached the Library about taking a custom data dump of its complete public archive (private tweets will not be included).

After "long discussions with Twitter over this," Anderson and other LoC officials agreed to take on the data with a few conditions: it would not be released as a single public file or exposed through a search engine, but offered as a set only to approved researchers.

Details on how the data will be provided are still being worked out. Anderson expects that it will eventually come through in XML markup, but the two groups are still working out formatting issues for the data.

Challenges exist, of course; for instance, how should the Library handle shortened URLs? Decades from now, such shortening services may be gone and the URL will leave no trace of what was originally linked in the tweet. That's a major part of Twitter's importance, and it shows what links are being shared. One can imagine the terrific charts that could be drafted based on which Twitter communities linked to which sites with what frequency.

And pictures—how should they be handled? Twitter itself currently does not host photos, so Anderson doesn't envision the Library trying to crawl beyond Twitter in order to get copies of the pictures. "We don't normally go beyond the domain" that's being archived, she says; the same approach will probably be taken here.

As far as "format rot" goes, the Twitter data set should be easy enough to use, even decades from now. It's not locked up in some custom and ancient video or audio codec; this is XML-structured text, so preservation should be a straightforward matter. (The bigger problem may be parsing the sometimes cryptic slang used to cram tweets into 140 characters; while some academics now study the medieval handwriting of scriptorium monks, one can imagine a future in which people specialize in interpreting early 21st century TXTSPK.)

Twitter has just announced a major new feature: arbitrary metadata can be associated with tweets in the future, opening up new opportunities for location awareness, etc. The LoC will also gain access to this data.

Who wants to read this? Your grandkids

For researchers, this could prove to be a tremendous archive. Imagine delving into Elizabethan England and wanting to know more about daily life—what did people eat and drink, what jokes did they tell, how mobile were they? This data exists in limited forms, like personal journals or letters, but it's fragmentary and limited largely to the elites.

Now imagine how much we would know with a Twitter archive from the period:

This sort of thing is a gold mine of local information, rather than just "the documents of courts or the documents of legislators," in Anderson's words.

As geo-tagging of tweets become common, the possibilities multiply. Anderson recently saw one researcher generating digital maps of 17th century Spanish smuggling networks to analyze how they worked; similar maps could be applied to tweets. One could plot the use of the word "hoagie" on a US map, for instance, or track "soda" vs. "pop" across time. Political and social trends could be mapped, as could the mobility of Twitter users.

"We're not really looking at these as individual tweets," says Anderson.

But there could be a future use even for the individual tweets: geneaology. Imagine reading the complete archive of your grandmother's Twitter stream and what a window into her life it might provide. (This being Twitter, that window can sometimes be cracked a little too wide for comfort, of course.)

As one of our chief national archives, it can be hard for the Library to draw lines around what content should be maintained and which left free to blow away in the digital breeze. Given the new arrangement with Twitter, I ask Anderson if Facebook might be next.

There's a laugh. "If Facebook became something that was really useful or needed [by us], then we might, but there are no plans," she says.