Wednesday, July 20, 2011

In recommendation systems, a predictor utilizing only the overall average rating, the average rating per user, and the average rating per item forms a powerful baseline which is surprisingly difficult to beat (in terms of raw prediction accuracy). When I asked Charles Elkan about this, he mentioned there are published papers whose reported results failed to exceed this baseline. Although it was good fun, he pointed out (paraphrasing) ``it's not just about accuracy; nobody likes the linear baseline because it makes the same recommendations to everyone.''

I decided to explore this a bit using the movielens 10m dataset. First, for each user I randomly withheld two of their ratings, then added the rest to the training set. I optimized for rating MAE using the analytical importance-aware update for quantile loss. I then computed rating MAE on the test set. I also computed the AUC per user, defined as 1 if the model correctly ordered the two withheld ratings, and 0 otherwise; averaging this quantity across users yields something I've seen referred to as the ``user AUC''. Of course, if I'm optimizing for AUC, I should reduce to pairwise classification and use hinge loss, rather than use quantile loss; however I was also interested in whether the modest gains I was seeing in MAE might result in larger gains in AUC. Here are the results: \[
\begin{array}{|c|c|c|}
\mbox{Model } &\mbox{ MAE (test set) } &\mbox{ User AUC (test set) } \\ \hline
\mbox{Best Constant } &\mbox{ 0.420 } &\mbox{ 0.5 } \\
\mbox{Linear } &\mbox{ 0.356 } &\mbox{ 0.680 } \\
\mbox{Dyadic $k=1$ } &\mbox{ 0.349 } &\mbox{ 0.692 } \\
\mbox{Dyadic $k=5$ } &\mbox{ 0.338 } &\mbox{ 0.706 } \\
\mbox{Dyadic $k=10$ } &\mbox{ 0.335 } &\mbox{ 0.709 }
\end{array}
\] As noted previously, the lion's share of predictive lift (for both metrics) comes from the linear baseline, i.e., a model of the form $\alpha_{user} + \beta_{movie} + c$. Adding an additional dyadic term of the form $a_{user}^\top b_{movie}$ with latent dimensionality $k$ slightly improves accuracy (although I didn't do cross-validation, treating the User AUC as a binomial implies a standard error of $0.004$ which suggests the dyadic lift in User AUC is barely significant).

Next I looked at recommendation diversity. I took a random subsample of the users (which turned out to be of size 11597), exhaustively estimated their rankings for every movie in the data set, and then I computed the top three movies for each user. At each size $n= \{ 1, 2, 3 \}$ I looked at how often the most popular set of movies of that size was suggested ($\max p_n$), and I also counted the number of unique sets of movies recommended (sets, not lists: if one user has recommendations $a, b, c$ and another user has recommendations $c, b, a$ they are considered the same). \[
\begin{array}{|c|c|c|c|c|c|c|}
\mbox{Model } & \max p_1 &\mbox{ unique 1-sets } & \max p_2 &\mbox{ unique 2-sets } & \max p_3 &\mbox{ unique 3-sets } \\ \hline
\mbox{Linear } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } &\mbox{ 1 } \\
\mbox{Dyadic $k=1$ } &\mbox{ 0.478 } &\mbox{ 6 } &\mbox{ 0.389 } &\mbox{ 10 } &\mbox{ 0.218 } &\mbox{ 18 } \\
\mbox{Dyadic $k=5$ } &\mbox{ 0.170 } &\mbox{ 112 } &\mbox{ 0.120 } &\mbox{ 614 } &\mbox{ 0.069 } &\mbox{ 1458 } \\
\mbox{Dyadic $k=10$ } &\mbox{ 0.193 } &\mbox{ 220 } &\mbox{ 0.102 } &\mbox{ 1409 } &\mbox{ 0.035 } &\mbox{ 3390}
\end{array}
\] As anticipated, there is no diversity of recommendation in the linear model. Interestingly, however, the diversity explodes with increasing latent dimensionality, even though the accuracy metrics do not improve dramatically. For $k=10$ the diversity in top-3 suggestion sets is substantial: the largest group of users that share the same top 3 suggestions is only 3.5% of the user base. (The 0.193 result for $k=10$ and $\max p_1$ is not a typo; even though there are more unique movies that are recommended as the top movie for a user, the most popular #1 movie for $k=10$ is more frequently chosen than for $k=5$. Not sure what's going on there.)

If you were to walk over to a product manager and say, "I have two recommendation models which have the same accuracy, but one makes the same recommendations to everybody and the other one makes different recommendations to different people, which do you want?" you can bet that product manager is going to say the second one. In fact, it would probably be acceptable to sacrifice some accuracy in order to achieve recommendation diversity. Given that dyadic models can both improve accuracy and improve diversity, they are a win-win.

1 comment:

"""Chris Dixon, the co-founder of personalization site Hunch, calls this "the Chipotle problem." As it turns out, if you are designing a where-to-eat recommendation algorithm, it's hard to avoid sending most people to Chipotle most of the time. People like Chipotle, there are lots of them around, and while it never blows anyone's mind, it's a consistent three-to-four-star experience. Because of the way many personalization and recommendation algorithms are designed, they'll tend to be conservative in this way — those five-star experiences are harder to predict, and they sometimes end up ones. Yet, of course, they're the experiences we remember."""http://blogs.hbr.org/cs/2011/05/seven_things_human_editors_do.html

Maybe the user should be given the option to pick how risky (high variance) they want their recommendations to be.