UTR

August 15, 2017

I’ve always intended to use this blog as more of a place for rough working notes as well as somewhat more fully formed writing. So in that spirit here are some rough notes for some digging into a collection of tweets that used the #unitetheright hashtag. Specifically I’ll describe a way of determining what tweets have been deleted.

Note: be careful with deleted Twitter data. Specifically be careful about how you publish it on the web. Users delete content for lots of reasons. Republishing deleted data on the web could be seen as a form of Doxing. I’m documenting this procedure for identifying deleted tweets because it can provide insight into how particularly toxic information is traveling on the web. Please use discretion in how you put this data to use.

So I started by building a dataset of #unitetheright data using twarc:

twarc search '#unitetheright' > tweets.json

I waited two days and then was able to gather some information about the tweets that were deleted. I was also interested in what content and websites people were linking to in their tweets because of the implications this has for web archives. Here are some basic stats about the dataset:

Deletes

So how do you get a sense of what has been deleted from your data? While it might make sense to write a program to do this eventually, I find it can be useful to work in a more a more exploratory way on the command line first and then when I’ve got a good workflow I can put that into a program. I guess if I were a real data scientist I would be doing this in R or a Jupyter notebook at least. But I still enjoy working at the command line, so here are the steps I took to identify tweets that had been deleted from the original dataset:

First I extracted and sorted the tweet identifiers into a separate file using jq:

jq -r '.id_str' tweets.json | sort -n > ids.csv

Then I hydrated those ids with twarc. If the tweet has been deleted since it was first collected it cannot be hydrated:

twarc hydrate ids.csv > hydrated.json

I extracted these hydrated ids:

jq -r .id_str hydrated.json | sort -n > ids-hydrated.csv

Then I used diff to compare the pre and post hydration ids, and used a little bit of Perl to strip of the diff formatting, which results in a file of tweet ids that have been deleted.

Since we have the data that was deleted we can now build a file of just deleted tweets. Maybe there’s a fancy way to do this on the command line but I found it easiest to write a little bit of Python to do it:

After you run it you should have a file delete.json. You might want to convert it to CSV with something like twarc’s json2csv.py utility to inspect in a spreadsheet program.

Calling these tweets deleted is a bit of a misnomer. A user could have deleted their tweet, deleted their account, protected their account or Twitter could have decided to suspend the users account. Also, the user could have done none of these things and simply retweeted a user who had done one of these things. Untangling what happened is left for another blog post. To be continued…