When working with Twitter data, one of the most interesting questions is always what URLs tweets are linking to. As Twitter users discuss any given topic or issue, the URLs they share provide us with an indication of the online media they’re drawing on for information and/or entertainment – and by counting which sites appear most frequently, we’re also able to measure the relative visibility or relevance of such sites).

But of course, there’s a complication: the vast majority of URLs in tweets have been shortened using a variety of URL shorteners, and multiple short URLs may point to the same eventual target; additionally, it’s even possible – and not too uncommon – for shortener nesting to occur: for example, a bit.ly short URL might subsequently be shortened by ow.ly, and finally by t.co, in the course of retweeting. Working with the short URLs themselves is less than useful, therefore – and we must find ways to resolve them to their eventual target.

I’ve never been entirely happy with the approach to resolving short URLs which I’ve come up with previously – this used Gawk (of course) and WGet as a somewhat clunky but generally functional solution. With each URL to be resolved triggering a separate WGet call, and generating temporary files along the way, the results were neither elegant nor particularly fast, though. Speed can only be improved up to a point, of course – the script will still have to ping each short URL to be resolved at least once, to see where it points, and that process is largely dependent on Internet access and server response speeds -, but nonetheless there was plenty of room for optimisation in my previous attempts.

So, here’s a new approach to this problem. Instead of WGet, we’ll be using the command-line tool cURL here, which is able to work with batch lists of URLs to process, and can be made to send its resulting output to a single file which we can then process again. The one downside of cURL is that – contrary to WGet – it only does a single URL resolution hop; this means that where there are nested URL shorteners, our script will need to be run at least twice to resolve the final remaining short URLs (but that’s a small price to pay for greater convenience).

Update: It’s worth noting here that Twitter has recently introduced its own URL shortener, t.co, as a mandatory shortener – even links which have already been shortened using bit.ly or other tools are now shortened again to t.co URLs. Recent yourTwapperkeeper archives will contain only t.co links, therefore. This also means that at least two passes of urlresolve.awk will be required to unshorten those URLs to their eventual destinations: one pass to remove the t.co shortening, and another to resolve any remaining short URLs. You might even want to run a third pass for good measure (later passes should run considerably faster as they’ll find far fewer URLs still to resolve).

Installing cURL

So, the first step to switching over is to install cURL itself, which is available for a wide variety of platforms here. If you’re expecting your Twitter data to contain https://… (secure http) URLs in addition to standard http://… links, make sure you install an SSL-capable version of cURL. On Windows XP, I’ve found the ‘Win32 – Generic’ cURL version by Dirk Paehl to work well for me (for https support, use the version ‘with SSL’, and you’ll also need to install the openssl library available from the same site). On Windows Vista or 7, the ‘Win64 – Generic’ version by Don Luchini is fine; looks like you also need to install the Microsoft Visual C++ Redistributable package, though (details here). If you’re using a Mac, I’m afraid you’re on your own as far as installation goes – a list of package options is here – see Jean’s comment below for instructions.

cURL (and the openssl libraries, if you need them) need to be placed in the command path. Since – if you’ve been using any of our Gawk scripts at all so far – you’ve already installed Gawk, the easiest solution is to place curl.exe (and any openssl .dll files you may also need) in the same directory as Gawk itself. Most likely, this is C:\Program Files\GnuWin32\bin (on Windows XP) or C:\Program Files (x86)\GnuWin32\bin (on Windows Vista/7). To test whether cURL is installed and working, open a command window and try something like

curl --head --insecure https://google.com/

(change the https to http if you’ve installed a version of cURL which doesn’t do secure http). If your shell finds cURL, and cURL itself doesn’t complain that it’s missing a library somewhere, you’re ready to roll.

Resolving URLs

The first step in resolving URLs is to extract them from a Twapperkeeper/yourTwapperkeeper CSV/TSV dataset. The process for this remains exactly as before – the existing urlextract.awk script from our scripts package does this for us (and generates multiple lines in the resulting file if a tweet happens to contain multiple URLs. Simply run the script as follows:

gawk -F , -f urlextract.awk input.csv >output.csv

(I don’t have to remind you to use \t instead of , as the separator if you’re working with tab-separated files, do I?)

Now, then, it’s time to unveil our new urlresolve.awk script, which replaces the previous solution:

This script takes the output from urlextract.awk and resolves all short URLs in the original dataset; it adds the resolved URLs in a new column 'longurl' which is inserted before the existing data. The script is called as follows:

gawk -F , -f urlresolve.awk [maxlength=x] input.csv >output.csv

The (optional) maxlength argument specifies what we consider to be a short URL, and can further speed up the processing time: since the very point of short URLs is that they're, well, short, we can assume relatively safely that comparatively long URLs already point to the final destination URL, and don't need resolving. If maxlength isn't specified, it defaults to a relatively conservative value of 30 characters; to save time, you could drop that value down to 25 or less. A typical bit.ly URL, including the 'http://' part, is 20 characters long, a URL using Facebook's shortener fb.me is 22 characters, and youtu.be URLs clock in at 27 characters, so you'll need to work out your own comfort zone here.

It's also important to note that (again also to save time) the script will automatically skip over any URLs pointing to the image hosting services YFrog, Twitpic, Imgur, Twitgoo, and Instagram. The URLs used by such sites are short, but - to the best of my knowledge - don't resolve any further, so there's no need to process them here. (If you're aware of any other widely used non-resolving short URL services, or if any of the services listed above do occasionally resolve to different URLs, please let me know!)

The urlresolve.awk script creates two temporary files in the working directory: filename_urllist_temp and filename_urllist_temp_resolved. These can be safely deleted once the script has finished, but may also be handy for spotchecking if any problems have occurred during URL resolution.

As I've mentioned, cURL will only take one step in the URL resolution process. A multiply shortened URL won't have arrived at its final destination in one pass, therefore. It's useful to inspect the output file from the first pass visually (e.g. in Excel) to check whether there still are many shortened URLs remaining in the new 'longurl' column which urlresolve.awk has added. If so, simply run the script again, using the output file from the first pass as your input:

gawk -F , -f urlresolve.awk [maxlength=x] output.csv >output2.csv

Since the first pass will have resolved many short URLs already (so that they will now be above the maxlength threshold), this second pass should conclude considerably more quickly. It will add yet another new 'longurl' column to the left of the existing data table. Repeat the process as often as necessary if any particularly obstinate cases remain.

Further Steps

Once you're happy with the URL resolution outcomes, it's easy to use the resulting dataset to find the most cited URLs or examine other relevant patterns in the data. In particular, it may also be useful to examine citation patterns not on the basis of fully qualified URLs, but by looking for the most cited domains only - this provides a better overview of which news or information sources overall were most widely used.

To truncate URLs to their domain name, use our existing urltruncate.awk script (also from our Gawk scripts package). It adds yet another new column, 'domain', to the left of the existing dataset, and is run as follows: