A popular approach in data analysis is to represent a dataset in a high-dimensional feature space, and reduce a given task to a geometric computational problem. However, most of the classic geometric algorithms scale poorly as the dimension grows and are typically not applicable to the high-dimensional regime. This necessitates the development of new algorithmic approaches that overcome this “curse of dimensionality”. In this talk I will give an overview of my work in this area.

* I will describe new algorithms for the high-dimensional Nearest Neighbor Search (NNS) problem, where the goal is to preprocess a dataset so that, given a query object, one can quickly retrieve one or several closest objects from the dataset. Our algorithms improve, for the first time, upon the popular Locality-Sensitive Hashing (LSH) approach.

* Next, I will show how to make several algorithmic ideas underlying the above theoretical results practical. This yields an implementation which is competitive with the state of the art heuristics for the NNS problem. The implementation is a part of FALCONN: a new open-source library for similarity search.

* Finally, I will talk about limits of distance-preserving sketching, i.e., compression methods that approximately preserve the distances between high-dimensional vectors. In particular, we will show that for the well-studied Earth Mover Distance (EMD) such efficient compression is impossible.

The common theme that unifies these and other algorithms for high-dimensional data is the use of “efficient data representations”: randomized hashing, sketching, dimensionality reduction, metric embeddings, and others. One goal of the talk is to describe a holistic view of all of these techniques.