The battle to store the world's tweets for the future

I TWEET. Quite a bit. A quick scroll through my feed will show messages that run the gamut: a link to an article about Wi-Fi on Mount Fuji (see “Web posts help to spot drugs’ side effects“), a joke with my colleagues about the Victorian origins of the word “sext”, a GIF of an ostrich chasing a giraffe.

All those thousands of tweets, and billions more by others, will be saved for all time in the hallowed halls of the Library of Congress in Washington DC. One of the largest libraries in the world, it promised in 2010 that it would organise and preserve every tweet ever sent.

Advertisement

Five years later, the archive still isn’t available. What’s taking so long? And, some might wonder, why even bother in the first place?

Twitter is an important piece of our cultural legacy, says Jessamyn West, a library technologist in Vermont. “A library doesn’t just tell the one story you read in history books, but all of the interwoven multicultural threads and stories that are going on,” she says.

Twitter helps to tell the story of some of the biggest political and social movements of the last decade, including the Arab Spring, Occupy Wall Street, and ISIS. “It keeps track of who we are and what we’ve been,” West says.

Some tweets that seem boring today could be much more interesting in the future. Someone currently tweeting in obscurity could one day become prime minister. Even anonymous mundane musings about the weather or what someone ate for lunch might be valuable to researchers. Aggregated, the messages capture how people use language and communicate with one another.

“Tweets that seem boring today will be interesting in the future if that person becomes prime minister”

But saving all of Twitter poses problems of daunting magnitude. Michael Zimmer, a privacy and internet ethics specialist at the University of Wisconsin-Milwaukee, has tracked the library’s progress, and last week outlined some of its greatest obstacles (First Monday, doi.org/56k).

One problem, of course, is the sheer number of tweets. When the library started the project, there was already a four-year backlog of of about 21 billion tweets. Now, half a billion tweets are sent daily.

Another problem is that a tweet isn’t just a string of 140 characters. Each comes with a packet of metadata, including when and where the tweet was sent, who sent it, and how many people marked it as a favourite, shared it or responded. There are over 100 information fields in all.

For help managing all this data, the Library of Congress turned to Gnip, a social media aggregation company in Boulder, Colorado, that has since been bought by Twitter. It slices the Twitter stream into hour-long segments for the library to download. The library checks each chunk for corruption, notes down some statistics and then, as a security measure, saves copies of the file in two different locations.

But just storing tweets isn’t enough. In the library’s latest update on the project, it said that a single search of that initial batch sent before 2010 could take 24 hours – far too slow to be usable.

Figuring out how to make the search function work needs to be a priority, says West. “People don’t just want to know that the data is being handled. They want access to the content, tools to interact with it.”

Meanwhile, many people don’t realise that their tweets, even public ones, are being recorded, says Zimmer, who is concerned that the library has not announced plans to give people control over their own archives. He thinks the site should consider giving people an opt-out, or even the ability to delete tweets that they regret.

“For a lot of people, it was never in their frame of understanding that their tweet would be kept on a historical record for all eternity,” he says.