tag:blogger.com,1999:blog-6569681.post4851533403612646327..comments2015-03-16T10:12:42.472-07:00Comments on Geeking with Greg: Google News Personalization paperGreg Lindennoreply@blogger.comBlogger10125tag:blogger.com,1999:blog-6569681.post-90076423539311655522008-01-24T05:26:00.000-08:002008-01-24T05:26:00.000-08:00I'm late to the party, but as far as I understand ...I'm late to the party, but as far as I understand the covisitation-based algorithm is exactly what <A HREF="http://ijcai.org/papers07/Papers/IJCAI07-444.pdf" REL="nofollow">this paper calls ItemRank</A>, or is there a difference?Boonoreply@blogger.comtag:blogger.com,1999:blog-6569681.post-77354326969944484912007-05-12T22:01:00.000-07:002007-05-12T22:01:00.000-07:00Google "Cooking Results"!?From the google paper: "...Google "Cooking Results"!?<BR/><BR/>From the google paper: "We use three test datasets for our comparative study. The ﬁrst dataset, MovieLens dataset, consists of movie rating data collected using a web-based research recommender system.<BR/>The dataset, <B>after some pruning</B> to make sure that each user has at least a certain number of ratings, contains 943 users, 1670 movies, and about <B>54,000 ratings</B>, on a scale from 1 to 5. The second dataset consists..."<BR/><BR/>cool, right?<BR/>Nope, because the movielens data set they are talking about does not have 54000 ratings, no. It has 100,000 ratings! and also not 1670 movies, but 1682 movies.<BR/>Movielens has only two data sets, this one and another one with one million ratings.<BR/>Here is the info from the movie lens site:<BR/>"We currently have two datasets available. The first one consists of <B>100,000 ratings</B> for 1682 movies by 943 users. The second one consists of approximately 1 million ratings for 3900 movies by 6040 users. Before using these datasets, please review the included readme files for the usage license."<BR/>link: http://www.grouplens.org/taxonomy/term/14<BR/><BR/>And movie lens websit is correct, I have been playing with this data set.<BR/><BR/><B>THE STORY IS Google has removed 50% of the ratings from the movie set and 12 movies.</B> The movie lens data set is already cleaned up (no user had less than 20 ratings for example and those who have not provided demographic info have been taken out too). <BR/><BR/><BR/>Dr. S.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-6569681.post-45426584900347840422007-05-12T10:26:00.000-07:002007-05-12T10:26:00.000-07:00Greg: I wasn't actually trying to make any counter...Greg: I wasn't actually trying to make any counterargument. I was trying to restate what I thought you were saying, to make sure I understood it. In fact, I knew you were not saying that it should change permanently.. I was trying to agree with you on that.<BR/><BR/>But to quickly answer some of your other questions, I think Anonymous, above, pretty much sums it up. Some users might want rapid cluster re-membershipping, because they want help with everything they are doing right that moment. Other users might have a different web information browsing style, and go off on a lot of never-to-be-repeated tangents. Without seeing empirical data, I could not tell you which type of person is in the majority or what the best thing to do would be. <BR/><BR/>I only note that there probably does exist these two contradictory types of users.. one with long-term stability and short term diversions, and one with short-term exploration needs, and less-focused long-term behavior. <BR/><BR/>If I may be abstract for a moment, I can make an analogy to the classic graph / AI search strategies of breadth first search vs. depth first search. If I am exploring a topic in a DFS manner, I probably do want rapid, real-time re-evaluation of cluster membership. The personalization needs to keep up with what I am doing right now, because I am going depth-first, headlong into new topic areas. On the other hand, if I am exploring a topic in a BFS manner, then I probably don't want my cluster membership to constantly be changing. I would want to remain more tied to a single "parent" node, while I explore all the "children" nodes (BFS) rather than have every new web page I visit become my new "parent" node (DFS). <BR/><BR/>But again, are most people BFS or DFS searchers and browsers? I really have no clue.jeremynoreply@blogger.comtag:blogger.com,1999:blog-6569681.post-19598798700769058782007-05-11T23:51:00.000-07:002007-05-11T23:51:00.000-07:00The system should ignore tangents and focus on wha...<I>The system should ignore tangents and focus on what I consistently appear to like?</I><BR/><BR/>Wouldn't that also depend on the user? That would also have to be part of the cluster. Some users like tangents more often than other users.Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-6569681.post-57329584265816776632007-05-11T18:34:00.000-07:002007-05-11T18:34:00.000-07:00Hi, Jeremy. No, I am arguing that the system shou...Hi, Jeremy. No, I am arguing that the system should change immediately and permanently when the user shows any new behavior.<BR/><BR/>I guess I do not entirely understand the counterargument. Are you saying that, if we took a system and delayed its learning for hours or days, that it would have no negative impact? At the very minimum, new users would be badly impacted, wouldn't they?<BR/><BR/>Or are you saying that knowledge of fine-grained behavior (e.g. clicking on a specific book) is less important that knowledge of general trends (e.g. an interest in computer books)? So, at some point, more data doesn't matter, because I already have a complete list and reasonably accurate list of your high-level subject interests?<BR/><BR/>Or are you saying that what I am doing right now should be ignored if it appears to divert from what I have done in the past? The system should ignore tangents and focus on what I consistently appear to like?Greg Lindenhttp://www.blogger.com/profile/09216403000599463072noreply@blogger.comtag:blogger.com,1999:blog-6569681.post-84666971151794738162007-05-11T18:07:00.000-07:002007-05-11T18:07:00.000-07:00The issue there is that, if you express a strong n...<I>The issue there is that, if you express a strong new interest by reading several articles on a topic, the recommendations will not change quickly unless the system recognizes that you now should be a member of different clusters.</I><BR/><BR/>Hmm. I guess I still have doubts about whether or not the recommendations should change quickly. Sometimes I'll get an email from a friend with a few interesting or funny links. Or I'll find something on Digg. I'll read intensely about those things for about twenty minutes, and then I'll never go back. In those cases, I would actually not want my cluster memberships to change at all. <BR/><BR/>Then again, if the updates really are real-time, maybe what you are saying is that, after my 20 minutes of intense activity are finished, the system would update itself again, and I would go back to my pre-diversionary personalization cluster membership profile. So my concerns are really non-issues. Right?jeremynoreply@blogger.comtag:blogger.com,1999:blog-6569681.post-41947702088886795662007-05-10T21:32:00.000-07:002007-05-10T21:32:00.000-07:00That is a nice description of the paper, Greg.I wa...That is a nice description of the paper, Greg.<BR/>I was wondering how long does it take to build these clusters using methods like LSI. Aren't they too expensive, esp. on 'google' scale? And, for applications like this one, I wonder how many cluster would there be? In the order of thousands?<BR/>I am just amazed how one can apply these costly methods to such a large repository.Shirishhttp://www.blogger.com/profile/12091474325051553432noreply@blogger.comtag:blogger.com,1999:blog-6569681.post-87402085047286567192007-05-10T20:09:00.000-07:002007-05-10T20:09:00.000-07:00Greg, I was present for this talk today and it was...Greg, I was present for this talk today and it was a good presentation. After the talk, I asked the speaker whether or not they had considered tournament-style Naive Bayes instead of PLSI and he said that they hadn't, but that having to build the corpus would have defeated the unsupervised nature they were going for. I asked because in some cases Naive Bayes has been shown to outperform LSI (and obviously vice versa) so I thought it was something they'd want to have looked at, but the speaker obviously knows his problem area better than I do ;-)codeslingerhttp://www.blogger.com/profile/18275795191015170220noreply@blogger.comtag:blogger.com,1999:blog-6569681.post-59702738845944512352007-05-10T19:45:00.000-07:002007-05-10T19:45:00.000-07:00Hey, Jeremy. Sorry, I wasn't clear. I meant the ...Hey, Jeremy. Sorry, I wasn't clear. I meant the clusters of which a reader is a member, not the clusters themselves.<BR/><BR/>The issue there is that, if you express a strong new interest by reading several articles on a topic, the recommendations will not change quickly unless the system recognizes that you now should be a member of different clusters.<BR/><BR/>On the point you raise, there also may be an issue with the clusters drifting and becoming inaccurate over time, but that likely is less of an issue if, as you say, the clusters really represent strong and stable areas of interest. Even so, I have to wonder if a weekly or monthly full rebuild would be sufficient given that news articles appear rapidly and lose most of their value after just a couple days.Greg Lindenhttp://www.blogger.com/profile/09216403000599463072noreply@blogger.comtag:blogger.com,1999:blog-6569681.post-20220163379359424182007-05-10T18:19:00.000-07:002007-05-10T18:19:00.000-07:00First, it appears that the cluster membership is n...<I>First, it appears that the cluster membership is not updated in real-time.</I><BR/><BR/>Why, if I may so densely ask, would you want to update cluster membership in real time? Is the one of the big ideas behind personalization the notion that personalization works because of the stability of longer-term user interests? We discussed this just the other day.<BR/><BR/>So if user profiles are more stable over the longer term, why would you need to update cluster membership in real time? A weekly or even monthly rebuild of the clusters should suffice, non? <BR/><BR/>You can rely on covisition heuristics for daily news churn.. and basically be led by where the members of your cluster go every day. But why would you need to update the members of your cluster on a constant basis? I'm not saying you don't need to; I just don't quite see why you do.jeremynoreply@blogger.com