Detection of near-duplicate documents

Similarity

Identifying duplicates of web pages is a nice practical problem.
Obviously, two pages are considered duplicated ('the same') if they carry the same content.
The problem is that from the computer's point of view, pages are the same if they exactly match byte-by-byte.
However, from the user's point of view, pages are the same if they carry the same content; he ignores irrelevant parts such as advertisements and rss-feeds included.

Since we are not interested in deciding how the documents differ (plagiarism detection for example),
it is benefical to work only with hash values of page contents - fingerprints.
For exact-duplicate matching, a simple match of fingerprints suffices and therefore a common hash function is suitable.
In contrast for near-duplicate matching, while still distributing documents evenly,
the hash function has to map similar documents close together.

Charikar's hash

One of such functions is Charikar's hash [1] (my favourite), also refered as simhash [2].
It works well even for small (64bits) fingerprints: see the evaluation [2,3].

The calculation of the hash is performend in the following way:

Document is splitted into tokens (words for example) or super-tokens (word tuples)

Each token is represented by its hash value; a traditonal hash function is used

Weights are associated with tokens

A vector V of integers is initialized to 0, length of the vector corresponds to the desired hash size in bits

In a cycle for all token's hash values (h), vector V is updated:

ith element is decreased by token's weight if the ith bit of the hash h is 0, otherwise

ith element is increased by token's weight if the ith bit of the hash h is 1

Finally, signs of elements of V corresponds to the bits of the final fingerprint

Download

Here is available my experimental implementation and you are welcome to use it for your own experiments ;).
It uses Jenkin's hash function.

References

Manku, Jain, Sarma: Detecting Near-Duplicates for Web Crawling. in Proceedings of the 16th international conference on World Wide Web, ACM Press, 2007

Henzinger: Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms, in Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM Press, 2006