Exploring data science with cheap servers and cheap tricks

This website’s goal is to develop and explain a data science philosophy – overkill analytics – that leverages computing scale and rapid development technologies to produce faster, better, and cheaper solutions to predictive modeling problems. To achieve this goal, one core question must be answered: when attacking data science problems, how can we use CPU as a substitute for IQ? This post will discuss the fundamental ‘overkill’ weapon for addressing this question – ensemble learning.

Ensembles are nothing new, of course; they underlie many of the most popular machine learning algorithms (e.g., random forests and generalized boosted models) . The theory is that consensus opinions from diverse modeling techniques are more reliable than potentially biased or idiosyncratic predictions from a single source. More broadly, this principle is as basic as “two heads are better than one.” It’s why cancer patients get second opinions, why the Supreme Court upheld affirmative action, why news organizations like MSNBC and Fox hire journalists with a wide variety of political leanings…

Well, maybe the principle isn’t universally applied. Still, it is fundamental to many disciplines and holds enormous value for the data scientist. Below, I will explain why, by addressing the following:

Ensembles add value: Using ‘real world’ evidence from Kaggle and GigaOM‘s recent WordPress Challenge, I’ll show how very basic ensembles – both within my own solution and, more interestingly, between my solution and the second-place finisher – added significant improvements to the result.

How ensembles add value: Also for the non-expert, I’ll run through a ‘toy problem’ depiction of how a very simple ensemble works, with some nice graphs and R code if you want to follow along.

Most of this very long post may be extremely rudimentary for seasoned statisticians, but rudimentary is the name of the game at Overkill Analytics. If this is old hat for you, I’d advise just reading the first section – which has some interesting real world results – and glancing at the graphs in the final section (which are very pretty and are useful teaching aids on the power of ensemble methods).

Share this:

If you’re here, you’ve probably already read this, but GigaOM‘s Derrick Harris wrote this great article discussing their WordPress Challenge and the fledgling overkill analytics approach I used to create the winning entry.

Forgive me for the self-indulgence, but here’s a quick excerpt:

And therein lies the beauty of overkill analytics, a term that Carter might have coined, but that appears to be catching on — especially in the world of web companies and big data. Carter says he doesn’t want to spend a lot of time fine-tuning models, writing complex algorithms or pre-analyzing data to make it work for his purposes. Rather, he wants to utilize some simple models, reduce things to numbers and process the heck out of the data set on as much hardware as is possible.

I’m pleased GigaOM found the overkill analytics characterization worth discussing. It’s brought a host of visitors to the site (at least compared to the readership I thought I’d have).

With the additional visits, though, I feel I should put a little more meat on the bones about my development philosophy. I’m sure this will be familiar territory for seasoned data scientists, but below are four principles that guide my approach:

Spend most of your time engineering mass quantities of features. Synthesizing raw data about the subject into pertinent, lower-dimensional metrics always brings the biggest bang for your buck. More features with more diversity are always better, even if some are simplistic or unsophisticated.

Spend very little of your time comparing, selecting, and fine-tuning models. A simple ensemble of many crude models is usually better than a perfectly-calibrated ensemble of a more precise models.

Spend no time making your algorithm elegant, optimized, or theoretically sound. Use cheap servers and cheap tricks instead.

Get results first; explain them later. The statistical algorithms available are powerful and unbiased – they will find the key elements before you do. Your job is to feed them mass quantities of features and then explain and interpret what they find, not guide them to a preconceived intuition about the answer.

To an extent, this is just a restatement of some best practices in predictive modeling. However, where overkill analytics comes in is to take these principles to new and hopefully productive extremes by leveraging cheap cloud computing power and rapid development principles. It is in this aspect where I hope to offer some innovation and new approaches.

Anyway, thanks for the visits. I will publish a couple more posts next week on the WordPress Challenge, one describing and ranking the full set of features I used, and one on the power of simple ensembles to improve results on this type of real-world problem.

As always, thanks for reading.

Share this:

I’d like to start this blog by discussing my first Kaggle data science competition – specifically, the “GigaOM WordPress Challenge”. This was a competition to design a recommendation engine for WordPress blog users; i.e. predict which blog posts a WordPress user would ‘like’ based on prior user activity and blog content. This post will focus on how my engine used the WordPress social graph to find candidate blogs that were not in the user’s direct ‘like history’ but were central in their ‘like network.’

My Approach

My general approach – consistent with my overkill analytics philosophy – was to abandon any notions of elegance and instead blindly throw multiple tactics at the problem. In practical terms, this means I hastily wrote ugly Python scripts to create data features, and I used oversized RAM and CPU from an Amazon EC2 spot instance to avoid any memory or performance issues from inefficient code. I then tossed all of the resulting features into a glm and a random forest, averaged the results, and hoped for the best. It wasn’t subtle, but it was effective. (Full code can be found here if interested.)

The WordPress Social Graph

From my brief review of other winning entries, I believe one unique quality of my submission was its limited use of the WordPress social graph. (Fair warning: I may abuse the term ‘social graph,’ as it is not something I have worked with previously.) Specifically, a user ‘liking’ a blog post creates a link (or edge) between user nodes and blog nodes, and these links construct a graph connecting users to blogs outside their current reading list:

Share this:

Hi. I’m Carter, and this is my new data science site – Overkill Analytics.

If you look around, you’ll see pretty quickly this is my first blog. There are lots of WordPress default settings and content, and everything looks pretty bland. While I love learning new technologies, I’m honestly not that into web design – my style of choice is still white (or better green) text on a black terminal. So please, bear with me as I learn to make this site at least marginally professional.