I spent the last few months studying and implementing some routines that take a raw HTML document (or documents) and do stuff with it (them). Subotai is a library that consolidates some of these routines. In this blog post I will describe what is currently implemented and what the roadmap is.

In manipulating HTML documents for features, I find myself needing to use some operations all the time - removing script tags, comments and the like. This feature-set is available in HtmlCleaner and I thus merged the two libraries to produce enlive-helper.

I wrote this post to teach myself consistent hashing - a simple hash family that Akamai’s founders came up with. This was originally done to prepare for a talk in my grad algorithms class (I made a horlicks of the talk but whatever). I am going to provide intuition, analysis and a clojure implementation.

Java’s string trim routine tests for whitespace using Character.isWhitespace and so does Clojure’s clojure.string/trim.

While processing a dataset off the web containing unicode space characters, the trim routine failed to do anything useful. Luckily, a StackOverflow thread suggested using a routine from Google’s guava library. So in Clojure, you can do this:

Back when Blekko had an API, I threw this implementation of their API together for a small project I was working on. You can only query blekko with this (the API allows slashtag manipulation and all that which I didn’t bother with).

I recently got my hands on the common-crawl URL dataset from here and wanted to compute some stats like the number of unique domains contained in the corpus. This was the perfect opportunity to whip up a script that implemented Durande et. al’s log-log paper.

In 2010, I purchased my first Kindle and since then apart from GEB [1], I haven’t bothered with physical copies. The Kindle store satisfies most of my needs (I find situations where the paperback costs less than the digital copy and refuse to buy the book on principle).

The books can be read on any platform (OS X, iOS for iPad and iPhone in my case and I do remember a rather unpleasant Kindle app on WP7)

One of the benefits of a digital book is that it should be straightforward for me to collect a list of highlights I’ve made about the book. Amazon (in their infinite wisdom) have not provided an API in the 3 or so years I’ve used the Kindle ecosystem and manually transcribing the quotes is not something I am interested in doing. Scraping remains the only alternative. I decided to use clojure for this task.