I’ve just put up a small side project I whipped together over the last day or so called xkcd 2. It might not make much sense why I built this thing, so bear with me while I try to explain.

Why

I’ve been reading xkcd for years, but it wasn’t until fairly recently that I discovered most of the comics actually have a caption.1 It’s tucked away in the title attribute on the img tag, and in most browsers you really only see it if you hover your mouse over the image for a few seconds.

Discovering this was a total accident. I read most xkcd comics via the RSS feed in Reeder. If you zoom-pinch on an image in Reeder, it opens up a view like this:

An xkcd.com comic in Reeder

That panel at the bottom displays the title attribute, but it only shows up if you tap once on the image. Anyway, after coming across this in Reeder, I started looking at all the captions every time a new comic would come out. It turns out that in many cases the caption is funnier than the comic itself. Case in point: comic 1049, the one featured in the picture above, is my most recent favorite. The caption perfectly mirrors my own experience with Ayn Rand, and in my opinion makes the comic that much more interesting as a whole.

Anyway, I find it frustrating that the comic captions are not a more visible part of the xkcd site itself, so I decided to put together a separate site that loads the comics from xkcd and simply displays them in a slightly different way – and includes the comic captions directly. The same comic shown above looks like this on xkcd2.com:

An xkcd comic as shown on xkcd2.com

How

Now that you know the why of xkcd2.com, let’s talk about the how.

xkcd 2 is an exceedingly simple Flask app. It’s ~100 lines of Python (not including empty lines of course). A majority of the time was spent messing with the HTML and CSS, not on the Python itself.

Each time a comic is requested, I use the excellent httplib2 library to retrieve the corresponding page directly from xkcd.com. httplib2 behaves more like a browser than a typical HTTP library; it handles caching the page content and everything for me, so if the same comic is requested many times httplib2 will return the page content from its local file cache instead of bouncing the request to xkcd.com every time.2

Once I get the response and page content from xkcd.com, I parse the page into a DOM object using html5lib. Then I use some rather inefficient list comprehensions to find the right elements in the DOM and extract the relevant information. There are probably more efficient means of doing this, but the page structure is simple enough that it’s not too bad performance-wise. I did find myself desperately wanting server-side jQuery…3

Once I have all the comic details from the page, I store it all in a pickle file so I can load it from the local disk rather than parsing the HTML on each request. Thus, if httplib2 reports that it used its cached version rather than retrieving a fresh page (via the response.fromcache property), I discard httplib2’s response entirely and simply load the comic metadata from the pickle file.

Once I have the metadata, I just pass it to the page template and render the page.

But… but…

I considered a couple of different high-level options before settling on the ‘request the page, parse the metadata’ approach. First, since the set of comics is finite (1056 as of this writing), I considered writing a simple scraper that would ‘preload’ all of the comics and just load from that cache. However, in the end this isn’t much different than what I’m doing. I still would have had to retrieve the page, parse out the relevant data, and store it. The current approach does this lazily, which makes a lot more sense since I doubt most people are looking through really old comics (though it is interesting to see how Munroe’s style has changed over the years). I still get the benefits without having this separate ‘build the cache’ step.

I also considered using the RSS feed rather than scraping HTML. This was appealing because the RSS will be less likely to change format. If xkcd.com undergoes a rewrite, my current HTML parsing code will probably break.4 However, the feed is a snapshot of the most recent comics, not all of them, and I really wanted to support all of them. How broken would it feel if someone sent you a link to a comic on xkcd2.com and it didn’t load because there was a new comic published that morning and the old comic was no longer in the RSS feed? Spoiler: very broken.

Anyway, hopefully some folks will find this useful and/or interesting. Code is up on github and the site is live at xkcd2.com.

There are no doubt some subtleties in the httplib2 cache behavior that I did not dive into. My hunch is that a request still gets made to xkcd.com – even just a HEAD request – because httplib2 needs to somehow determine whether its cache is still valid. I’m not an expert on its behavior. Regardless, using its cache is almost certainly better than naively proxying the request to xkcd.com – and waiting for/parsing the response – each time.
[return]

This is probably actually possible using node.js, but it seemed like overkill for this project.
[return]

I do use IDs and classes and avoid relying on actual page structure as much as possible, but sometimes it’s unavoidable. HTML scraping is always brittle no matter what precautions you take or how smart you try to be.
[return]