Description

Anti-spamming has become one of the most important challenges to web search engines and attracted increasing attention in both industry and academia recently. Since most search engines now use link-based ranking algorithms, link-based spamming has become a major threaten. In this paper, we show that the popular link-based ranking algorithm PageRank, while being successfully used in the Google search engine, has a ``zero-one gap'' flaw, which can be potentially exploited to spam PageRank results easily. The ``zero-one gap'' problem arises from the current ad hoc way of computing the transition probabilities in the random surfing model. We propose a novel \emph{DirichletRank} algorithm in a more principled way of computing these probabilities based on Bayesian estimation with a Dirichlet prior. DirichletRank is a variant of PageRank, but it does not have the problem of ``zero-one gap'' and is analytically shown to be substantially more resistant to link farm spams than PageRank. Simulation experiments using real web data show that, compared with the original PageRank, DirichletRank is significantly more robust against several typical link spasm and is more stable under link perturbations, in general. Moreover, experiment results also show that DirichletRank is more effective than PageRank due to its more reasonable allocation of transition probabilities. Since DirichletRank can be computed as efficiently as PageRank, it is scalable to large-scale web applications.

You are granted permission for the non-commercial reproduction, distribution, display, and performance of this technical report in any format, BUT this permission is only for a period of 45 (forty-five) days from the most recent time that you verified that this technical report is still available from the University of Illinois at Urbana-Champaign Computer Science Department under terms that include this permission. All other rights are reserved by the author(s).