Exploring data science with cheap servers and cheap tricks

I’d like to start this blog by discussing my first Kaggle data science competition – specifically, the “GigaOM WordPress Challenge”. This was a competition to design a recommendation engine for WordPress blog users; i.e. predict which blog posts a WordPress user would ‘like’ based on prior user activity and blog content. This post will focus on how my engine used the WordPress social graph to find candidate blogs that were not in the user’s direct ‘like history’ but were central in their ‘like network.’

My Approach

My general approach – consistent with my overkill analytics philosophy – was to abandon any notions of elegance and instead blindly throw multiple tactics at the problem. In practical terms, this means I hastily wrote ugly Python scripts to create data features, and I used oversized RAM and CPU from an Amazon EC2 spot instance to avoid any memory or performance issues from inefficient code. I then tossed all of the resulting features into a glm and a random forest, averaged the results, and hoped for the best. It wasn’t subtle, but it was effective. (Full code can be found here if interested.)

The WordPress Social Graph

From my brief review of other winning entries, I believe one unique quality of my submission was its limited use of the WordPress social graph. (Fair warning: I may abuse the term ‘social graph,’ as it is not something I have worked with previously.) Specifically, a user ‘liking’ a blog post creates a link (or edge) between user nodes and blog nodes, and these links construct a graph connecting users to blogs outside their current reading list: