HOWTO use Hive to SQLize your own Tweets – Part One: ETL and Schema Discovery

Your Twitter Archive

Twitter has a new feature, Your Twitter Archive, that enables any user to download their tweets as an archive. To view this feature, look at the bottom of the page at your account settings page. There should be an option for ‘Your Twitter archive,’ which will generate your tweets as a json/javascript web application and send them to you in email as a zip file.

Be patient: this process can take several days, in particular if you’ve lots of tweets (I personally have 24K tweets, and it took 4-5 days to get my tweets).

After a few hours or days, you’ll receive an email with a download link. Download your tweets, and unzip them to reveal their contents.

Digging In: ETL

There is a file called tweets.csv, but that is not the file we are interested in. It has very little detail. The files we are interested is in, which contain all the tweet data, are in the data/js/tweets directory. There is one file per month, laid out like this:

The first thing we’ll need to do us remove that line. Once we do so, the file is a large json array. Once we have an array, we can easily convert to the JSON format that Hive expects: one json object per newline.

I’ve created a python script that removes the first line of text, and converts and prints a one-json-object-per-newline format here: convert.py.

Schema Discovery

To load our tweets into Hive, we need a schema. There is no explicit schema for json data, we must infer it. Along these lines, Hortonworks co-founder Owen O’Malley created a tool, available on github as hive-json, that extracts a Hive schema from a collection of JSON documents. Given a collection of schemas, the output schema will be the superset of them all, which creates a reasonable SQL schema: optional fields are often null.