Thursday, November 08, 2007

I was just looking at my favorite economics blog, The Skeptical Optimist, and saw a post on randomness based on two books the blog author, Steve Conover is reading called The Black Swan and Fooled by Randomness. This caught my eye--a quote from one of the two books (it was unclear to me which one):

Here's an example of his point about randomness: How many times have you heard about mutual fund X's "superlative performance over the last five years"? Our typical reaction to that message is that mutual fund X must have better managers than other funds. Reason: Our minds are built to assign cause-and-effect whenever possible, in spite of the strong possibility that random chance played a big role in the outcome.

He then gives an example of two stock pickers, one of whom gets it "right" about 1/2 the time, and a second who gets it right 12 consecutive times. The punch line is this:

Taleb's point: Randomness plays a much larger role in social outcomes than we are willing to admit—to ourselves, or in our textbooks. Our minds, uncomfortable with randomness, are programmed to employ hindsight bias to provide retroactive explanations for just about everything. Nonetheless, randomness is frequently the only "reason" for many events.

I personally don't agree philosophically with the role of randomness (I would prefer to say that many outcomes are unexplained then say randomness is the "reason" or "cause"--randomness does nothing itself, it is our way of saying "I don't know why" or "it is too hard to figure out why").

But that said, this is an extremley important principal for data miners. We have all seen predictive models that apparently do well on one data set, and then does poorly on another. Usually this is attributed to overfit, but it doesn't have to be solely an overfit problem. David Jensen of UMass described in one paper the phenomenon of oversearching for models in the paper Multiple Compisons in Induction Algorithms, where you could happen upon a model that works well, but is just a happenstance find.

The solution? One great help in overcoming these problems is through sampling--the train/test/validate subset method, or by resampling methods (like bootstrapping). But having the mindset of skepticism about models helps tremendously in digging to ensure the models truly are predictive and not just a random matching of the patterns of interest.