Netflix launched its million dollar contest—improve their movie rating prediction system by 10%—nearly three years ago, and like most computer science graduate students, I downloaded the data and got to work. The 10% mark proved a magic number for Netflix. 6% and the competition would have been over in a few months; 11% and it might never have ended. Next week (July 26, 2009), the winners will be announced.

As this contest has been documented at some length (New York Times Magazine, Wired Magazine), I’ll skip the setup and instead describe what has been learned about predicting the ratings people give to movies. To whet the appetite, first consider these two curious graphs. I’ll get to them soon.

The simplest model for predicting a new rating is to just guess a single number each time. What number to guess? The one that minimizes the prediction error. This happens to be the mean (3.7 stars I think). But while we have a very robust estimate of the rating mean, derived from 100 million examples, our model is too simple, and the error rate is around 1.05 stars (the Netflix “Cinematch” system is off by 0.95 stars; the winning system is off by 0.85 stars).

Let’s make the model a little fancier. We’ll add an offset for each movie and for each user to acknowledge that "The Shawshank Redemption" is better than average and I am more critical than average. The predicted rating according to the new model is the overall mean plus a movie offset plus a user offset. My predicted rating for Shawshank Redemption might be 3.7 (overall mean) + 0.8 (Shawshank offset) – 0.3 (my offset) = 4.2. This new model, which adds 500,000 user parameters and 20,000 movie parameters, improves the error rate to 0.97 stars. Again the parameters are estimated from the training data to minimize the error rate.

We’ll call the estimate from this model bui, the baseline predicted rating for user u and item i (I’ll use item instead of movie because these methods are generally applicable). Next, we’d like to improve our predictions by including information about similarities between movies—if I rated Rushmore highly, I’m likely to rate "The Royal Tennenbaums" highly too. So lets add to our baseline another offset, attributed to similarities with a movie’s neighbors. Say we’re trying to predict my rating for "The Royal Tennenbaums" (i).

For every other movie j I’ve rated, we compute (ruj – buj)wij, where wij is the weight we want to place on movie j—large if i and j are similar. Then, if the rating I gave to movie j (ruj) is higher than expected (buj), and movie j is very similar to movie i (wij is large), we boost our new prediction for movie i. If the rating I gave to movie j is just about what the baseline expected, then (ruj – buj) will be close to zero, and we won’t adjust our new prediction for i regardless of the similarity.

Our final new prediction for movie i is estimated as the baseline prediction plus the sum of all the values (ruj – buj)wij, one for each movie j that I’ve rated.

This all seems very sensible, but where do we get wij, the similarity between movie i and j? As with the simpler models, we estimate these parameters from the training data to minimize the prediction error rate. This estimation problem is no longer a matter of calculating averages. We want the set of wijs (there are 20,000 * 20,000 = 400,000,000 of them) such that the sum of the differences between real ratings and predicted ratings over the 100,000,000 examples in the training data is as small as possible. How do we go about finding this set, a tiny needle in an giant haystack?

Here is the key. Consider two sets of values for wij, A and B. Suppose set A gives a lower error rate predicting the ratings in the training data than set B. Then set A is closer to the best possible set of wijs than set B. Put another way, if the wijs are at a local minimum—small changes in the values give worse predictions—then this is the best solution, globally. The minimization function is convex. This means that we can go about finding the best set of wijs simply by comparing our current set with some other one, and keeping whichever works better.

Practically, we choose the direction to move according to the derivative of the minimization function, but that’s starting to get too technical. Read about stochastic gradient descent if you want to know more.

So to learn the best set of wijs, we start with some initial guess (maybe every similarity is 0.5). Then try predicting the rating that user u gave to movie i. Compare the prediction to the real rating, and adjust the relevant parameters: wi1, wi2, wi3, and so on. Then move on to the next rating. One pass through the entire dataset, 100 million examples, is called an iteration. It takes around 20 iterations to find the best estimates of wij. Each iteration takes around an hour on a reasonable computer. This method, with a few adjustments, can reduce the error rate to around 0.90 stars.

So how do you get down to 0.85 stars? One of the most important insights is that ratings change over time, as the graphs above suggest. For each movie, ratings increase over time—on average nearly 10% over 5 years. The best explanation for this phenomenon is that people are less discriminating when they rent new releases. To rent an older movie, they need more encouragement—some indication that they will like it—so these end up better matched to their tastes.

There is something more insidious going on in the first plot: 1500 days after Netflix began collecting rating data, the average rating jumped strikingly, from 3.4 to 3.6 all at once. One hypothesis is that Netflix changed their recommendation system in 2004, improving the match between users and movies. A second, somewhat more appealing hypothesis is that the text accompanying the star ratings changed from an objective scale (excellent, good, fair, …) to a subjective scale (loved it, liked it, …), and people are less accommodating when they think they need to be objective.

Regardless of the underlying causes, modeling temporal dynamics is clearly valuable. Scaling the ratings based on absolute date and the time since their release allows for more meaningful comparison between ratings—like adjusting for inflation. Such adjustments can reduce the error rate to 0.89 stars.

What’s left? Dimensionality reduction methods attempt to represent each movie and user as a short list of factors (violence, comedy, etc…, but because the factors are derived purely statistically, they have no such simple meanings). Including such a technique, like Singular Value Decomposition, in the baseline estimate (above) is good for an error rate of 0.88.

The rest of gap seems to be covered by combining a wide variety of methods, all variants of what I’ve just described. Remember how we computed item-item similarities wij? Well we could have used user-user similarites instead. In the end, four teams all combined results (probably estimating a final rating by taking a weighted average of all the different systems’ predicted ratings). I imagine each system as a blurry lens; stacking up a whole bunch of blurry lenses at slightly different angles, we manage to reduce the blur just a little bit.

References:- Yehuda Koren, a member of the winning team, has written some very nice papers about his work on the contest, on which I based this article.- Bellkor’s Pragmatic Chaos, the webpage for the winning team, has a number of related links.- An article in IEEE Spectrum written by members of the winning team.- The official Netflix Prize leaderboard

Maybe I'm missing something, but there has to be some minimum amount of inherent error due to the fact users can only rate movies in whole stars. When Netflix suggests something at 3.5 stars there is a built in minimum error of 0.5 stars if it can only be rated in whole integers by the user. A very simple fix would be to allow people to rate movies at half star intervals or round of the prediction to the nearest integer (although the latter could possibly make things worse).

I tend to agree. I am often torn between rating a move 3 or 4 stars. In fact, people might give more consistent ratings if they weren't forced to rate in integer values.

However, from the prediction side, it turns out the best strategy (for this particular contest setup) is to hedge -- don't round to the nearest integer. Why not? Because Netflix calculates the root mean squared error. Take the difference between actual rating and prediction, and square that value. If they measured absolute error, you could expect rounding to be no worse or better than not rounding, assuming your predictions differ randomly from the true ratings. But by squaring the error, bigger errors are magnified: you predict 3.5 for a movie rated 4; rounding gives you 0 error where no-rounding gives you 0.25 error. But if you predict 3.5 for a movie rated 3, rounding gives you 1 error where no-rounding gives you 0.25 again. So, on average your error is 0.25 without rounding and 0.5 with rounding!

Note, however, that computing squared errors is somewhat arbitrary. Perhaps absolute error makes more sense. It turns out that the math for computing the parameters, which involves taking derivatives, works out much more conveniently with squared errors...

That's right. The Netflix scale only ranges from 1 star to 5 stars. Ideally, a rating of 1 star means that you "hated" it, though in practice I bet that often people neglect to rate movies they hate. There is no way to figure out when somebody watched but did not rate in the contest, though Netflix has access to this information, and might be able to exploit it.