Monday, November 23, 2009

I mentioned in a previous posting that I recommend smoothing rate estimates using offsets on numerator (number of events) and denominator (time period). While it is possible to build a principled prior distribution and do very detailed rate estimation based on this, I find that for "top-100" lists and for "hot-100" lists, it is more useful to adjust the prior with larger strategic goals in mind.

In particular, what I like to do is set the ratio of the offsets to position an item with no data whatsoever (event count = 0, time = 0) so be somewhere in the top half of all items, but generally not in the top 10%. This expresses a general optimism about new items that allows them to achieve high rankings with a very modest burst of enthusiasm from the audience and forces them to provide some proof that they are dogs.

Once this ratio is set, it remains to set the actual magnitudes. This is done by deciding how much many events that you want an item to have or how long you want it to languish before the data overcome the prior. If you want the first 10 events to have equal weight to the prior, set the numerator offset to 10. Alternately, if you want the first day of data to have equal weight to the prior, set the denominator offset to 1 day.

Done. Works. Simple.

I have done the much more complex effort of building detailed prior models and actually estimating rates in a completely principled fashion but I have found two things:

a) business people have other agendas than pure estimation and they want a vote

b) the early estimates are pretty unreliable until the data dominate the prior (duh!). Thus, you may as well set up the prior to make (a) work.

Sunday, November 22, 2009

"Transactional Data Mining" is being tweeted more than any other document on SlideShare right now. So we've put it on the homepage of SlideShare.net (in the "Hot on Twitter" section).

That sounds sooo exciting. People magazine, here I come!

Of course, when you look into the matter, I really don't think that I am going to have to worry about dodging paparazzi any time soon. What they mean by "tweeted more" seems to be 3 times in a day. Sounds like they should have read my paper about surprise and coincidence!

In fact, for applications like popularity ranking it is often really important to have a robust idea about what is hot and what is not. There are two very common problems to solve, one is to find out which items are rapidly rising and the other is to put a reasonable rank onto things.

The problem of what is rising is surprisingly well well solved by ranking items by the log of the ratio of the number of hits/views/tweets in a recent period of time to the number of hits in a somewhat older (and typically longer) period of time. Well, not quite the ratio. In practice, it is better to offset both numerator and denominator by a constants which are well chosen by looking at the results, but which are actually well-founded in terms of MAP estimators for rates. The reason that this works so well is that popularities are typically distributed in a roughly Zipfian way so taking log of the rate is linearly related to the log of the rank.

A large change in the log-rank of an item is a great indicator of dramatically rising popularity, but rank itself is a pretty fuzzy measurement with small counts (the guys at SlideShare need to hear this). Log of the rate, however, is linearly related to the log of rank, so change in log-rank is proportional to change in log-rate. Moreover, the offset trick when computing the rate ratio largely deals with the small count problems that you have when something goes from no hits to one hit in a time period.

The second problem is computing what the rank of an item really is given bursty nasty data like hit counts. The problem is that you only have a nice tight estimate of the hit rate for the most popular items. Since you want recent rankings, you want to use a shorter period of time for your estimates so the problem of small counts is even worse. Conceptually, a very principled way to do deal with the problem is to embrace the uncertainty and sample the rates associated with each item given the data you have and rank the items. Do this a bunch of times and you have the posterior distribution of the ranks for each item. Moreover, you also have a lookup table that tells you what any particular observed rate means relative to what rank the item might have.

This sampling can be done in a few seconds in R. I think it is cooler to add dithering do the ranked list of items so that each time you look at the list, you get another sample from this distribution. This means that items that might plausibly be in the top 100 or 1000 actually get to be there some of the time and items that are definitely in the top few places are always where they should be. It also adds dynamism to your top-n list which is always good on the web.