MinHash and PLSI are both clustering methods; a user is matched to a cluster of similar users, then they look at the aggregate behavior of users in that cluster to find recommendations. Covisitation is an item-based method that computes which articles people tend to look at if they looked at a given article (i.e. "Customers who visited X also visited...").

The paper does a nice job motivating the use of recommendations for news:

The Internet has no dearth of content. The challenge is in finding the right content for yourself: something that will answer your current information needs or something that you would love to read, listen or watch.

Search engines help solve the former problem; particularly if you are looking for something specific that can be formulated as a keyword query.

However, in many cases, a user may not even know what to look for ... Users ... end up ... looking around ... with the attitude: Show me something interesting.

In such cases, we would like to present recommendations to a user based on her interests as demonstrated by her past activity on the relevant site .... [and] the click history of the community.

The authors then explain that what makes this problem so difficult is doing it at scale in real-time over rapidly changing data.

Google News (http://news.google.com) is visited by several million unique visitors ... [and] the number of ... news stories ... is also of the order of several million.

[On] Google News, the underlying item-set undergoes churn (insertions and deletions) every few minutes and at any given time the stories of interest are the ones that appeared in last couple of hours. Therefore any model older than a few hours may no longer be of interest.

The Google News website strives to maintain a strict response time requirement for any page views ... [Within] a few hundred milliseconds ... the recommendation engine ... [must] generate recommendations.

The authors note that, while Amazon.com may be doing recommendations at a similar scale and speed, the item churn rate of news is much faster than of Amazon's product catalog, making the problem more difficult and requiring different methods.

The Googlers did several evaluations of their system, but the key results are that two versions that combined all three of the approaches in different ways generated 38% more clickthroughs than just showing the most popular news articles. That seems surprisingly lower than our results for product recommendations at Amazon.com -- we found recommendations generated a couple orders of magnitude more sales than just showing top sellers -- but Google's results are still a good lift.

On the reason for the lower lift, I did end the paper with a couple questions about parts of their work.

First, it appears that the cluster membership is not updated in real-time. When discussing PLSI, the authors say they "update the counts associated with that story for all the clusters to which the user belongs" but that this is an approximation (since cluster membership cannot change in this step) and does not work for new users (who have no cluster memberships yet).

This is a typical problem with clustering approaches -- building the clusters usually is an expensive offline computation -- but it seems like a cause for concern if their goal is to offer "instant gratification" to users by changing when they click on new articles.

Second, it appears to me that the item-based covisitation technique will be biased toward popular items. In describing that algorithm, the authors say, "Given an item s, its near neighbors are effectively the set of items that have been covisited with it, weighted by the age discounted count of how often they were visited."

If that is right, this calculation would seem to suffer from what we used to call the "Harry Potter problem", so-called because everyone who buys any book, even books like Applied Cryptography, probably also has bought Harry Potter. Not compensating for that issue almost certainly would reduce the effectiveness of the recommendations, especially since the recommendations from the two clustering methods likely also would have a tendency toward popular items.

I have to say, having worked on this problem with Findory for the last four years, I feel like I know it well. Findory needs to generate recommendations for hundreds of thousands of users (not yet millions) in real-time. New articles enter and leave the system rapidly; newer stories tend to be the stories of interest. The recommendations must change immediately when someone reads something new.

This work at Google certainly is impressive. I would have loved to have access to the kind of resources -- the cluster, MapReduce, BigTable, the news crawl, the traffic and user behavior data, and hordes of smart people everywhere around -- that these Googlers had. Good fun, and an excellent example of the power of Google scale for trying to solve these kinds of problems.

10 comments:

jeremy
said...

First, it appears that the cluster membership is not updated in real-time.

Why, if I may so densely ask, would you want to update cluster membership in real time? Is the one of the big ideas behind personalization the notion that personalization works because of the stability of longer-term user interests? We discussed this just the other day.

So if user profiles are more stable over the longer term, why would you need to update cluster membership in real time? A weekly or even monthly rebuild of the clusters should suffice, non?

You can rely on covisition heuristics for daily news churn.. and basically be led by where the members of your cluster go every day. But why would you need to update the members of your cluster on a constant basis? I'm not saying you don't need to; I just don't quite see why you do.

Hey, Jeremy. Sorry, I wasn't clear. I meant the clusters of which a reader is a member, not the clusters themselves.

The issue there is that, if you express a strong new interest by reading several articles on a topic, the recommendations will not change quickly unless the system recognizes that you now should be a member of different clusters.

On the point you raise, there also may be an issue with the clusters drifting and becoming inaccurate over time, but that likely is less of an issue if, as you say, the clusters really represent strong and stable areas of interest. Even so, I have to wonder if a weekly or monthly full rebuild would be sufficient given that news articles appear rapidly and lose most of their value after just a couple days.

Greg, I was present for this talk today and it was a good presentation. After the talk, I asked the speaker whether or not they had considered tournament-style Naive Bayes instead of PLSI and he said that they hadn't, but that having to build the corpus would have defeated the unsupervised nature they were going for. I asked because in some cases Naive Bayes has been shown to outperform LSI (and obviously vice versa) so I thought it was something they'd want to have looked at, but the speaker obviously knows his problem area better than I do ;-)

That is a nice description of the paper, Greg.I was wondering how long does it take to build these clusters using methods like LSI. Aren't they too expensive, esp. on 'google' scale? And, for applications like this one, I wonder how many cluster would there be? In the order of thousands?I am just amazed how one can apply these costly methods to such a large repository.

The issue there is that, if you express a strong new interest by reading several articles on a topic, the recommendations will not change quickly unless the system recognizes that you now should be a member of different clusters.

Hmm. I guess I still have doubts about whether or not the recommendations should change quickly. Sometimes I'll get an email from a friend with a few interesting or funny links. Or I'll find something on Digg. I'll read intensely about those things for about twenty minutes, and then I'll never go back. In those cases, I would actually not want my cluster memberships to change at all.

Then again, if the updates really are real-time, maybe what you are saying is that, after my 20 minutes of intense activity are finished, the system would update itself again, and I would go back to my pre-diversionary personalization cluster membership profile. So my concerns are really non-issues. Right?

Hi, Jeremy. No, I am arguing that the system should change immediately and permanently when the user shows any new behavior.

I guess I do not entirely understand the counterargument. Are you saying that, if we took a system and delayed its learning for hours or days, that it would have no negative impact? At the very minimum, new users would be badly impacted, wouldn't they?

Or are you saying that knowledge of fine-grained behavior (e.g. clicking on a specific book) is less important that knowledge of general trends (e.g. an interest in computer books)? So, at some point, more data doesn't matter, because I already have a complete list and reasonably accurate list of your high-level subject interests?

Or are you saying that what I am doing right now should be ignored if it appears to divert from what I have done in the past? The system should ignore tangents and focus on what I consistently appear to like?

Greg: I wasn't actually trying to make any counterargument. I was trying to restate what I thought you were saying, to make sure I understood it. In fact, I knew you were not saying that it should change permanently.. I was trying to agree with you on that.

But to quickly answer some of your other questions, I think Anonymous, above, pretty much sums it up. Some users might want rapid cluster re-membershipping, because they want help with everything they are doing right that moment. Other users might have a different web information browsing style, and go off on a lot of never-to-be-repeated tangents. Without seeing empirical data, I could not tell you which type of person is in the majority or what the best thing to do would be.

I only note that there probably does exist these two contradictory types of users.. one with long-term stability and short term diversions, and one with short-term exploration needs, and less-focused long-term behavior.

If I may be abstract for a moment, I can make an analogy to the classic graph / AI search strategies of breadth first search vs. depth first search. If I am exploring a topic in a DFS manner, I probably do want rapid, real-time re-evaluation of cluster membership. The personalization needs to keep up with what I am doing right now, because I am going depth-first, headlong into new topic areas. On the other hand, if I am exploring a topic in a BFS manner, then I probably don't want my cluster membership to constantly be changing. I would want to remain more tied to a single "parent" node, while I explore all the "children" nodes (BFS) rather than have every new web page I visit become my new "parent" node (DFS).

But again, are most people BFS or DFS searchers and browsers? I really have no clue.

From the google paper: "We use three test datasets for our comparative study. The ﬁrst dataset, MovieLens dataset, consists of movie rating data collected using a web-based research recommender system.The dataset, after some pruning to make sure that each user has at least a certain number of ratings, contains 943 users, 1670 movies, and about 54,000 ratings, on a scale from 1 to 5. The second dataset consists..."

cool, right?Nope, because the movielens data set they are talking about does not have 54000 ratings, no. It has 100,000 ratings! and also not 1670 movies, but 1682 movies.Movielens has only two data sets, this one and another one with one million ratings.Here is the info from the movie lens site:"We currently have two datasets available. The first one consists of 100,000 ratings for 1682 movies by 943 users. The second one consists of approximately 1 million ratings for 3900 movies by 6040 users. Before using these datasets, please review the included readme files for the usage license."link: http://www.grouplens.org/taxonomy/term/14

And movie lens websit is correct, I have been playing with this data set.

THE STORY IS Google has removed 50% of the ratings from the movie set and 12 movies. The movie lens data set is already cleaned up (no user had less than 20 ratings for example and those who have not provided demographic info have been taken out too).