Posts Tagged ‘Locality Sensitive Hashing’

We’ve shown how Jaccard Similarity Coefficient can be used to measure similarity between documents when these documents are represented by a bag of words. MinHash can be used to create document fingerprints which represent the document in a small number of bytes, and can also be used to estimate document similarity using Jaccard Similarity Coefficient.

The next problem is how to find similar documents in a large collection of documents. Imagine you have 12,000 documents and you want to find pairs of most similar documents. The brute-force approach is to compare each document with each other document, using MinHash and Jaccard Similarity Coefficient. Assuming each comparison takes 10 milliseconds this operation will take around 16 days, since each document is compared with each other document.

A faster solution is to use Locality Sensitive Hashing (LSH). This takes the MinHash values for documents and hashes the MinHash values so they hash into the same bucket if they are similar. Buckets are organized into bands, with a number of documents per band (20 is a commonly used value) Documents in the same bucket in a band are then candidates for being similar documents. The algorithm is described fully in [1].

This code loads MinHash values for documents from a database (the parameters don’t matter). “mhfs.hashes” is an int[,] array with columns for the hash values (100 in this case) and a row for each document (about 12,000). It then finds similar documents for all 12,000 documents.

How long does this take to execute? The Calc() stage took 265 Milliseconds, and the searching for similar documents for all 12,000 individual documents took 948 Milliseconds when single threaded. Somewhat shorter than 16 days 🙂

Important: Both MinHash similarity testing and finding similar documents with LSH are estimations – they will result in Type I errors (false positive, documents that are reported similar, but are not similar) and Type II Errors (false negatives, documents that are similar but are not reported as similar). Type I errors are easy to eliminate – just go back to the MinHash values or the bag of words associated with the document and calculate the similarity. Type II errors cannot be easily rectified without resorting to the brute-force approach.

Note that the time to compute LSH with MinHash is not dependent on the size of the documents, only on the number of documents and number of MinHash functions used.