Friday, February 15, 2008

Ranking using Indiana University's user traffic

Mark Meiss gave an great talk at WSDM 2008 on a fascinating project where they tapped and anonymized all traffic into and out of Indiana University, then analyzed the behavior of search users. Their paper, "Ranking Web Sites with Real User Traffic" (PDF), covers most of the talk and includes additional details.

The work has several surprising conclusions. First, and most importantly, they authors argue that "PageRank is quite a poor predictor of traffic ranks for the most popular portion of the Web." They say this is because the basic assumptions of PageRank simply are not true.

Specifically, PageRank assumes every link from a page is followed with equal probability, but their data shows that "a few [links] carry a disproportionate amount of traffic while most carry very little traffic." When they attempted to compensate for this with a version of PageRank they called Weighted PageRank (where the links are weighted based on click traffic), they found it helped only a little.

This lead the authors to conclude that the other two assumptions of page rank -- that people have the same chance of jumping from (ending their session) any particular page and equal probability of jumping to (starting a new session) on any particular page -- are false and problematic. From the paper:

People are much more likely to jump to a few very popular sites than to the great majority of other sites.

People follow many more links from a few very popular hubs than from the great majority of less popular sites.

Some sites are much more likely to be the starting or ending points of surfing sessions.

Finally, the authors suggest that relevance rank should be informed by click data, but note that "such steps are likely to amplify the search bias toward already popular sites." In the talk, an audience member also noted that such steps may be susceptible to click spam, which is even easier to do than link spam for those wanting to manipulate search results.

It is worth pointing out however, as I have done before, that naive PageRank has been under assault by spammers for many years and almost certainly is no longer used by any of the search engines in the original form, not without layers upon layers of efforts to eliminate link spam and ferret out any meaning remaining in the chaos that the link graph has become. As compelling as this paper's conclusions are, it could be the case that their version of PageRank naively followed links so manipulated by link spammers as to completely confuse the relevance rank, producing the poor correlation between PageRank and highly trafficked sites that they saw.

In addition to the thoughts on PageRank, the paper had several other very interesting results as well. They noted that only 5% of traffic originated from search hosts, "a surprisingly small fraction." They noted that 54% of traffic "does not have a referrer page, meaning that users type the URL directly, click on a bookmark, or click on a link in e-mail" at a much higher rate than one might expect. Finally, they noted strong recency and 24 hour trends in traffic data, saying that "47% of the clicks at any given time are predicted by the clicks from the previous day at the same time" and that, though the clicks from the previous three hours are a strong predictor of clicks for the current hour, after four hours, "the requests from the previous day yield higher precision and recall."

In all, an excellent paper, probably my favorite of the conference. Do not miss it. It is well worth reading.

For more on the thoughts in the paper on using click data for relevance rank, please see also my earlier post, "Actively learning to rank", that discusses an excellent KDD 2007 paper by Filip Radlinski and Thorsten Joachims.

This was one of my favorite papers to. A very impressive experiment at a HUGE scale.

Although their results are very interesting, I'm not sure they're comparing apples to apples. PageRank is not intended to be a predictor of traffic or popularity. Its intended to measure authority -- not the same thing. I'm not sure what this implies as far as the utility of PageRank in web ranking.

By the way, it was good meeting you at WSDM. Best of luck at LiveLabs.

Very interesting analysis and summarization of the paper. The problem that I see with a lot of the PageRank vs. Traffic ranking vs. any other metric is that they assume all users are the same. This really needs to be broken into various types of users to determine what each segment acts like. For example, a highly technical crowd tends to use Firefox, uses rss or some other feed as a referral source and types urls directly. Something like that could be very useful.

Thanks for the nice post and also all the comments make good points. Just a note in response to Mike Dierken and dp-maxime: IU (where the traffic data is collected) does not have a firewall or proxy server in place. Also, although as you (and a questioner at the conference) point out, users can turn the referer off, we estimate this to be a small fraction of the about 100,000 users in our sample. Cheers!

Greg, It is often overlooked that PageRank only really claims to be good for very general search queries - certainly not for the long tail (I believe the original paper states this explicitly). I tend to think of PageRank now more as a brand than an elegant algorithmic innovation.

matthewhurst: But didn't Google buy the Stanford startup company Kaltix in 2003? From the papers I've read, I think they were working on a personalized version of pagerank. That is, they were moving away from generalized web search into personalized web search. (This is what rob diana was alluding to, I believe). So now they've had 5 years to make PageRank personal, right?

Hi, Matt. On the long tail and PageRank not doing well there, this paper seems to indicate that PageRank performs much better on the long tail, at least if you measure performance by the correlation between PageRank and traffic to the pages. Figure 9 and the discussion in Section 6.1 is what I am referring to.

But, on your broader point, I agree that PageRank is more of a brand at this point than the key algorithm for Google's relevance rank. Saul Hansell at the NYT reported that Google has "many thousands of interlocking equations" that they have manually tested and added to their relevance rank. If PageRank still exists somewhere under all that, its influence almost certainly is limited.