Piotr Indyk - New Algorithms for Similarity Search in High Dimensions

Piotr Indyk is a Professor of Electrical Engineering and Computer Science atMIT. He joined MIT in 2000, after earning PhD from Stanford University. Earlier, he received Magister degree from Uniwersytet Warszawski in 1995. Piotr's research interests lie in the design and analysis of efficientalgorithms. His specific interests include high-dimensional computationalgeometry, sketching and streaming algorithms, sparse recovery and compressive sensing. He has received the Sloan Fellowship (2003), the Packard Fellowship (2003) and the Simons Investigator Award (2013). His work on sparse Fourier sampling has been named to TechnologyReview "TR10" in 2012, while his work on locality-sensitive hashing has received the 2012 Kanellakis Theory and Practice Award.

Similarity search is a fundamental computational task that involves searching for similar items in a large collection of objects. This task is often formulated as the nearest neighbor problem: given a database of n points in a d-dimensional space, devise a data structure that, given any query point, quickly finds its nearest neighbor in the database. The problem has a remarkable number of applications in a variety of fields, such as machine learning, databases, natural language processing and computer vision. Many of those applications involve data sets that are very high dimensional. Unfortunately, all known algorithms for this problem require query time or data structure size that are exponential in the dimension, which makes them inefficient when the dimension is high enough . As a result, over the last two decades there has been a considerable effort focused on developing approximate algorithms that can overcome this "curse of dimensionality"

A popular framework for designing such algorithms is Locality Sensitive Hashing (LSH). It relies on the existence of efficiently computable random mappings (LSH functions) with the property that the probability of collision between two points is related to the distance between them. The framework is applicable to a wide range of distances and similarity functions, including the Euclidean distance. For the latter metric, it is known that the "basic" application of the LSH function yields e.g., a 2-approximate algorithm with a query time of roughly dn^(1/4), for a set of n points in a d-dimensional space.

In this talk I will describe recent data structures that offer significant improvements over the aforementioned bounds. The improvement is achieved by performing *data-dependent* hashing, which constructs the random hash functions in a way that depends on the data distribution. I will also describe a recent implementation of the "core" component in the aforementioned algorithms, which empirically outperforms widely used variants of LSH. The implementation is available at http://falconn-lib.org/