Social Icons

After reading this New York Times article on the Netflix Prize , I thought it might be cool to register and play with some movie data.Netflix is offering a chunk of their movie rating database (2 GB worth) for download, it contains ratings for 17,770 movies and TV shows from about 300,000 customers. With it, contestants in the Netflix competition have to write an algorithm that can predict how customers will rate movies they haven’t seen before, thus enabling Netflix to better recommend movies. This algorithm has to do 10% better than Netflix’ existing recommendation engine. And the prize is a million dollars.And people are close. Really close. If you check out the leaderboard , the top team PragmaticTheory is at 9.65%!So given that my hard drive only has 2.5 GB free (after the data download from Netflix), it was pretty obvious I didn’t have the processing nor storage power to handle all the data at once. I decided to start with a very very small subset (like 11 movies) and get a little taster for what it would take:Pirates of the Caribbean: The Curse of the Black Pearl, Rushmore, Miss Congeniality, Pretty Woman, Forrest Gump, Twister, The Patriot, Independence Day, The Day After Tomorrow, Con Air, The Green MileThe process was to take each person on file, look at how they had rated all these movies and try to classify them into a group. Completely unexpectedly, the program created a very simple model that revolved around… Forrest Gump.Forrest Gump turns out to be a strong predictor of whether people will like other movies (ie. people who love Forrest Gump also love the Green Mile, and surprisingly also really like The Patriot). But it doesn’t work in reverse, other movies are not as good a predictor of whether someone will like Forrest Gump. I wonder if there are a handful of movies which act as strong predictors, from which preferences for other movies can be surmised. Forrest Gump is effectively at the top of a giant movie decision tree, the thickest of trunks leading to more subtle differences in movie tastes.A good analogy is the 20 questions game. Where the power of a decision tree can narrow down to an exact thing, through asking questions that would make the biggest differences first.And decision trees are set to become a hot topic again with the creators of flickr soon to launch a new web service all about decision trees, called hunch.com. Only they’re using humans instead of computers to build’em. Hunch lists any decision you could make (including ‘What movies should I watch?’), building giant decision trees based on user contributed questions. Then (I’m guessing) it uses statistical analysis to figure out what questions sit at the root, the trunk, the branches.