{papapetrou, siberski}@L3S.de, norbert.fuhr@uni-due.de

Abstract:
Text clustering is an established technique for improving quality in information retrieval,
for both centralized and distributed environments. However, for highly distributed
environments, such as peer-to-peer networks, current clustering
algorithms fail to scale. Our algorithm for peer-to-peer clustering achieves
high scalability by using a probabilistic approach for assigning
documents to clusters. It enables a peer to compare each of its documents
only with very few selected clusters, without significant loss of
clustering quality. The algorithm offers probabilistic guarantees
for the correctness of each document assignment to a cluster.
Extensive experimental evaluation with up to 100000 peers
and 1 million documents demonstrates the scalability and
effectiveness of the algorithm.