We've developed our world-class movie recommendation system: Cinematch. Its job is to predict whether someone will enjoy a movie based on how much they liked or disliked other movies. We use those predictions to make personal movie recommendations based on each customer's unique tastes. And while Cinematch is doing pretty well, it can always be made better.

Now there are a lot of interesting alternative approaches to how Cinematch works that we haven't tried. Some are described in the literature, some aren't. We're curious whether any of these can beat Cinematch by making better predictions. Because, frankly, if there is a much better approach it could make a big difference to our customers and our business.

So, we thought we'd make a contest out of finding the answer. It's "easy" really. We provide you with a lot of anonymous rating data, and a prediction accuracy bar that is 10% better than what Cinematch can do on the same training data set. (Accuracy is a measurement of how closely predicted ratings of movies match subsequent actual ratings.)

If you develop a system that we judge most beats that bar on the qualifying test set we provide, you get serious money and the bragging rights. But (and you knew there would be a catch, right?) only if you share your method with us and describe to the world how you did it and why it works.

Serious money demands a serious bar. We suspect the 10% improvement is pretty tough, but we also think there is a good chance it can be achieved. It may take months; it might take years.

There is no cost to enter, no purchase required, and you need not be a Netflix subscriber. So if you know (or want to learn) something about machine learning and recommendation systems, give it a shot. We could make it really worth your while.

Sounds like fun. I wonder how much time I'll end up wasting on this one.

If you are thinking of entering the contest, you might be interested to know that much of the Internet Movie Database (IMDb) database is available for download. Another good source for movie content is Amazon Web Services.

Update: I should explicitly point out that this Netflix data is by far the largest ratings data set available to the research community. Most work on recommender systems outside of companies like Amazon or Netflix has had to make do with the relatively small 1M rating MovieLens data or the 3M EachMovie data set. This Netflix data set is 100M ratings. It will be enormously useful for recommender system research.

Update: The comments on this post are starting to get pretty interesting.

Update: On the idea of using external movie data, Ilya Grigorik published data linking the Netflix movie ids to features extracted from IMDb data.

I've spent a couple hours working up some brain dead approaches and it's becoming clear to me that you can get darn close to the Netflix RMSE score without even trying.

I suspect that the complexity to payoff is hyperbolic right around that score. I mean they are claiming just under 1 star average deviation. Doing nothing but predicting 3.5 stars for everybody gets you pretty close to that.

and to anonymous, I have to assume that you can use outside data otherwise this is probably completely unattainable. Chances are that if you can get your hands on the data, Netflix can (through you if necessary).

Additionally, if you step back and look at the whole process there is the element of capturing data (such as the 5 star issue, but also what movies you ask for them to rate, etc.) and then what you do after you have a prediction set. I would wager that the system is improved more successfully on the ends of this process, not in the middle.

Thanks for sharing some of the results of your initial analysis, Aron.

I suspect it is pretty easy to get close to 1 on the predictions. For example, if the ratings are uniformly random, just predicting 3 all the time would get a RMSE of sqrt((4 + 1 + 0 + 1 + 4)/5) or sqrt(2), which is approximately 1.4. The distribution is almost certainly skewed, so getting to around 1.3 should be trivial.

The lower you go, though, the more difficult things get. I wouldn't use the difficulty of getting from 1.4 to 1.3 as an indicator of the difficulty of getting from .95 to .85.

I've been able to get down to .9577 thus far with an approach only mildly more complicated than the ones you just described. I am using the training file as my probe file, which of course introduces a favorable bias for me. How much, I don't know..

Performance is already getting a bit thorny. I am curious what the odds of missing the boat entirely are if one were to focus on say, 1M of the ratings rather than the full 100M. I'm kind of anticipating that the extra 2 orders of magnitude is a 1-2% difference at most.

Some other stats of the training set:Mean: 3.60429# of 1's: 4.6%# of 2's: 10.1%# of 3's: 28.7%# of 4's: 33.6%# of 5's: 23.1%Total ratings: 100,480,507

Seriously, though, this is a beautifully large data set. Take advantage of it. If you do some sampling or thresholding for performance, I'd recommend you be fairly selective about where and how you do that.

Thanks for sharing your numbers on the distribution. I got similar numbers. Sorry, I should have shared them earlier.