Sunday, May 10, 2009

An article was posted in the NY Times about the Netflix prize some time ago. It fired some new ideas that I'm investigating right now. It's related to association rule mining.

A bit of lore in data mining is the beer-diapers story. Fathers buying diapers on Thursday or Friday night also bought beer for over the weekend. This apparently caused some supermarkets to put diapers and beer closer together. So far for the lore. In every basket, there are regularities with a certain probability. If there's salad in the basket, there may be tomatoes too, or baskets with milk have a high probability that there are cereals for example.

The association rules for basket research are based on nominal properties: It's there or it's not. The problem of Netflix is more of a continuous problem... It's the question about whether there is a strong correlation, or implication with a level of confidence between ratings of different movies and each movie has five different possibilities. Also, there may be a strong correlation between 1.0f on one movie and 1.0f on the other, whilst there may be little evidence on A@5.0 -> B@4.0f. So, per rating, the support and confidence may differ. What we're trying to find for each movie/rating combination (which is 17,770*5*5*17,770*size(datatype)).

movieA@ratingX -> movieB@ratingY

The idea is also to filter out those correlations that don't have strong correlations. The biggest problem in association rule mining is memory. Because the matrix of 17,770 movies x 17,770 movies x 5 ratings (and sometimes x5 again) is a very large one, you can't really use data types that consume 4 bytes of memory, because that requires 8 GB of memory. So in my current implementation, I squeezed in a float into a single byte, thereby losing lots of precision, and then popping it back in the prediction phase. Since I got 4GB, I can just about run this.

Association rules are highly dependent on frequencies of observation. And it's necessary to take into account the frequency that movies are seen together versus the frequency that a movie is rated in isolation. The higher the ratio that movies are seen together and the stronger the observed rating combination between A and B, the better the predictive strength of the rule.

So, let's say user A rated movie 1 at a 1.0. If he also rated movie 2 at a 1.0 and there are many more users that did this, it means that the rating of movie 2 can be strongly derived from this association rule. It's also possible that many users rated 5.0 for movie 1, but a 4.0 for movie 2. Or rated 4.0 for 1 and 5.0 for 2. These correlations all suggest slightly different things. What I really wanted to do was store the confidence of these rules, but memory doesn't play nicely here and I had to think of something else.

In a sense, association rules are similar to clustering. The differences are mostly in the derivation of strong correlations between some movie A and movie B, whereas in clustering that is never determined. knn and clustering at the moment do produce better results. I'm getting 1.47 for example vs. reported results of 0.95-ish for knn and clustering approaches. Those approaches go down as far as 0.920 with improvements.

The imprecise-ness is very likely caused by incorrect filtering or imprecisions due to the memory space shortage and float compression. Pseudo-code for now looks as below. I've basically loaded all ratings twice. Once grouped by movieId (which customers rated this movie at what?) and the whole set again grouped by customerId (which movies did a customer rate at what?). The most important and wasteful array is the one that maintains compressed floats into unsigned chars (so losing loads of precision there). It's dimensioned by MAX_MOVIES, MAX_MOVIES and NUM_STARS. This means that given movie 1 in the first dimension, find out if there's a strong correlation factor for movie 2 in the second dimension, given that movie 1 was rated at x stars, where x is then another index into this huge array. The retrieved byte is then divided by 63 and 1.0f is added. Thus, each predictor is compressed into a single byte, reducing precision, but allowing everything inside a 1.5GB area of memory or so.

Having the training data twice allows one to access data very quickly. A problematic issue is how to store calculated confidence for each movie.