Pages

Thursday, June 08, 2006

SoCG 2006: Locality Sensitive Hashing.

After many years of lugging a laptop around with me at conferences, I decided to leave it in the room this time. It helps immensely with my focus during talks, and even helps during the breaks, when I can either talk to people or read the proceedings, rather than check email ... and the web ... and some blogs ... and oops ! one session's blown away.

Locality-sensitive hashing is a suprisingly effective "geometric" analogue of hashing first developed by Piotr Indyk and Rajeev Motwani. In standard hashing, the goal is to maintain a set of elements S drawn from a large universe U so that you can answer the question "is i in S" really quickly. Because U is typically much larger than S, what you'd like is a storage structure that uses space proportional to the size of S, rather than the size of U (which would be easy to do).

Hashing is one of the most elegant, profound and practical ideas to come out of the study of algorithms and data structures. It's almost impossible to find any serious program that doesn't need some kind of hash data structure. A detailed post on hashing will have to wait for another day though..

Locality-sensitive hashing is a way to take hashing to a geometric arena (you knew geometry had to show up sooner or later). Suppose that instead of merely having a set of elements from a universe, I had a set of points in a metric space. For example, maybe I have points situated in 1o-dimensional space with the normal Euclidean distance.

My goal is the same: I want to store points from the space in a data structure so I can ask the question: "is i in S ?". But now that points have distances between them, I can ask the related question, "is i near some point of S ?". Locality-sensitive hashing gives you a data structure that answers such questions approximately. More precisely, given parameters c, r, p, q, the structure guarantees that if two elements are within distance r of each other, then they are hashed to the same bucket with probability p, and if they are further than c * r, then this happens with probability q. In general, you want p to be much larger than q of course, and you'd like c to be as close to 1 as possible.

The actual construction is quite simple (you create a collection of hash functions and map each point to a vector of hash values), and has had numerous applications. What controls the efficacy of the scheme is the parameter r = log(1/p)/log(1/q), which affects both the space and time bounds of the algorithm; the smaller r is, the better. Roughly speaking, the space required by an algorithm is n^(1+r) and the query time is n^r.

Since c controls the quality of the hash, and r controls its efficiency, these two parameters vary inversely. In fact, for l1, it was shown earlier that r <= 1/c. What this paper shows is that for any lp norm, r >= 0.462/cp. Specifically, this means that the bound for l1 is almost tight. Together with new work by Andoni and Indyk showing that LSH for l2 can be done with r <= 1/c2, this gives almost matching upper and lower bounds for LSH for a range of norms.

It's a very nice result, and quite simple. In fact, for those who grumble that papers with few pages are not viewed kindly by PCs, this paper is only 4 pages long.

Update (6/11/06): As Piotr points out in the comments, the bounds are not quite matched. There is a constant factor gap between the upper and lower bounds, and thus some more work to do. Damn you, big-O notation !!!