DCW Volume 1 Issue 16 – Archiving Twitter

This week the Digital Scholarship Commons (DiSC) at Emory University released “Tweeting #OWS,” a project that archives and maps over 10,000,000 tweets associated with the Occupy Wall Street Movement. The project is particularly important in showing how the archiving and mapping of tweets can help preserve history. Natasha Lennard of Salon.com recently reported that a massive amount of data associated with the Arab Spring has already disappeared from the web. Lennard, citing a new report by two researchers working at the Dominican University, found that over 27% of linked content has already disappeared. This is due to a number of reasons, as Lennard recounts in the article: content providers might shift the servers that host the information, access paywalls may restrict information exchange, or Twitter’s own practice of making it difficult to find tweets that are over a week old might limit access.

ProfHacker covered the demise of apps like TwapperKeeper a couple of years ago, and reported that Twitter’s API policies make it difficult to keep track of Tweets for extended periods of time. Scott Turnbull, head developer at Emory Library, designed a new program he calls “Twap” to help harvest the tweets used in the DiSC #ows project. As explained on Turnbull’s GitHub site, Twap “uses the tweetstream library and django to query the twitter streaming api for tweets filtered by terms in a searchlist.” Tweets are archived using JSON data that includes information like geolocation and the time the tweet was published. Scott began his harvesting of the Tweets last October, after having conversations with Andy Famiglietti of UT-Dallas, myself, and the rest of the DiSC team. More recently, graduate fellows Sarita Alami, Moya Bailey, and Katie Rawson worked with Brian Croxall, Stewart Varner, and the Digital Scholarship Solutions Analyst Jay Varner to create heatmaps, graphs, and a web interface that can help users navigate the data. Topic modeling and distant reading methods were also used to determine the frequencies of specific words associated with certain topics. The project even envisions future applications for a wider sample of data: “Do people who have decided to locate themselves have different discourse strategies or topics than people who don’t? What would happen if we added in other locating factors (origin of user, names of places)? What would this map look like if we charted different places over time?”

Twitter has become an indispensable part of our contemporary historical archive, especially when communication about historical events happens largely through social media outlets. In a very real way, Twitter is shirking its duty to the public by limiting access to its archive. But “Tweeting #OWS” also shows us a new way to engage in public history: by harvesting, archiving, and distant reading millions of Tweets, we can get large-scale maps of how historical events develop. The project also shows quite powerfully how the digital humanities points the way forward for historical research. If archives are not openly available for research, sometimes it’s important to program your way to the required resources.