What sucks, who sucks and you suck

Parsing a Twitter Archive

2015-03-11

I just signed up to Pinboard because I wanted a permanent resource to
capture all the links I’ve posted and retweeted on Twitter. While Pinboard
integrates well in terms of capturing links from your ongoing feed, it
will only work backwards to the previous 3200 tweets due to Twitter’s API
limit. So the first thing I wanted was to do was process my long term
Twitter archive to get everything from the previous four years. You’d
think other people would want this too, so something must exist to do it,
right? Wrong.

In the end, I had to write a small Ruby hack
to parse the data from the archive, grep the links and post them to
Pinboard. It was actually fairly easy once I’d found a suitable example
via Google and identified which variant of each Ruby gem I needed was
currently maintained and working. I added a few comfort touches like
expansion of shortened links and decoding of HTML entities (Twitter uses
them; Pinboard doesn’t). What took the most time was understanding the
Twitter archive data format, since there doesn’t appear to be any formal
documentation for it. But it’s basically JSON, so is fairly readable once
you’ve perused a few example tweets. [N.B. I’m not a JSON expert or a
regular coder and have apparently forgotten the little Ruby I ever
learned, so treat all this as the desperate grasps for comprehension of a
total naif.]

Your tweets are all stored in datestamped files within the
data/js/tweets directory of the archive. Each file is formatted in JSON
except for the first line (beginning ‘Grailbird…’), which needs to be
discarded:

What you’re left with is an array of individual tweets for that month,
each element containing hashes of the various components of the tweet,
beginning with the source (i.e. the app or website used to post the
tweet). The key parts required for Pinboard posts are any hashtags and
URLs from the entities hash, the text and the created_at field. Note
that the urls entity doesn’t appear to be populated in older tweets
(circa 2010), so you need to grep the text with a suitable regex to locate
any links if this hash is empty (I used the URI.extract method for this).
If the entity is populated then take the expanded_url field. (The urls
entity is actually an array of URLs but as a Pinboard post can only show
one link, I only take the first element each time. However, there’s a
fallback method to view any others, as discussed further below.) I used the
LongURL module to try to expand each link to the final destination target,
bypassing any URL shortening used (itself often shortened again using
Twitter’s t.co shortcut) and generate a meaningful link.

Similarly, the hashtags entity is an array of hashtags in the tweet, so I
iterate over that and gather the text item from each entry for
Pinboard’s tagtext array parameter.

The text part becomes the Pinboard description field; since this
contains the original text including all the links as posted, it acts as a
backup of the original URI(s) in the event that the link expansion doesn’t
work correctly. One thing I’ve learnt from this: obsolete URL shorteners
are destructive to the Internet’s memory,
since you’re left with no easy way to recover the original link
destination. (Principal offender here is The Browser’s apparently defunct b.rw app,
which means that their older posted links are now all invalid. Bit of a
drawback for a curation site, that.) Also, many sites replace obsolete
page links with redirects to their top level home page (or the page of the
company that bought them out), which is no help at all. I guess that’s the
drawback of relying on an ‘ephemeral’ medium like Twitter for archiving.

The only tricky part concerns (native) retweets: the tweet contains
details of your retweet, including the ‘RT’ header with the retweeted
user’s name and abridged text, while the original tweet is nested in a
retweeted_status field within that tweet. This means that when you find
such a field, you need to pull the relevant details from the surrounding
tweet (I use the text for Pinboard’s description field as it shows the
attribution, and retain the datestamp of the retweet rather than the
original) and then extract the retweet for the actual link (you can
treat this as a normal element; i.e. unwrap the parent element and proceed
as before). I put the original, unabridged tweet text in Pinboard’s
extended description field.

Unfortunately, I haven’t been able to process my entire archive as the one
I’d previously downloaded only extends to February 2013 and, when I try to
request an updated one from Twitter, it first wants to verify my email
address and then fails to send a confirmation email for this purpose,
despite continuing to send notifications successfully to the same address.

I’m surprised that there appears to be little other work undertaken in
mining and analysing Twitter archives, as there are probably a number of
vaguely useful stats and summaries that could be generated from them. But
then I guess most of that isn’t readily monetisable, particularly as the
data isn’t considered current.