A student blog of MIT CSAIL Theory of Computation Group

Menu

Category Archives: geometry

Summary

In this post I will show that any normed space that allows good sketches is necessarily embeddable into an space with close to . This provides a partial converse to a result of Piotr Indyk, who showed how to sketch metrics that embed into for . A cool bonus of this result is that it gives a new technique for obtaining sketching lower bounds.

Sketching

One of the exciting relatively recent paradigms in algorithms is that of sketching. The high-level idea is as follows: if we are interested in working with a massive object , let us start with compressing it to a short sketch that preserves properties of we care about. One great example of sketching is the Johnson-Lindenstrauss lemma: if we work with high-dimensional vectors and are interested in Euclidean distances between them, we can project the vectors on a random -dimensional subspace, and this will preserve with high probability all the pairwise distances up to a factor of .

It would be great to understand, for which computational problems sketching is possible, and how efficient it can be made. There are quite a few nice results (both upper and lower bounds) along these lines (see, e.g., graph sketching or a recent book about sketching for numerical linear algebra), but the general understanding has yet to emerge.

Sketching for metrics

One of the main motivations to study sketching is fast computation and indexing of similarity measures between two objects and . Often times similarity between objects is modeled by some metric (but not always! think KL divergence): for instance the above example of the Euclidean distance falls into this category. Thus, instantiating the above general question one can ask: for which metric spaces there exist good sketches? That is, when is it possible to compute a short sketch of a point such that, given two sketches and , one is able to estimate the distance ?

The following communication game captures the question of sketching metrics. Alice and Bob each have a point from a metric space (say, and , respectively). Suppose, in addition, that either or (where and are the parameters known from the beginning). Both Alice and Bob send messages and that are bits long to Charlie, who is supposed to distinguish two cases (whether is small or large) with probability at least . We assume that all three parties are allowed to use shared randomness. Our main goal is to understand the trade-off between (approximation) and (sketch size).

Arguably, the most important metric spaces are spaces. Formally, for we define to be a -dimensional space equipped with distance

(when this expression should be understood as ). One can similarly define spaces for ; even if the triangle inequality does not hold for this case, it is nevertheless a meaningful notion of distance.

It turns out that spaces exhibit very interesting behavior, when it comes to sketching. Indyk showed that for one can achieve approximation and sketch size for every (for this was established before by Kushilevitz, Ostrovsky and Rabani). It is quite remarkable that these bounds do not depend on the dimension of a space. On the other hand, for with the dependence on the dimension is necessary. It turns out that for constant approximation the optimal sketch size is .

Are there any other examples of metrics that admit efficient sketches (say, with constant and )? One simple observation is that if a metric embeds well into for , then one can sketch this metric well. Formally, we say that a map between metric spaces is an embedding with distortion, if

for every and for some . It is immediate to see that if a metric space embeds into for with distortion , then one can sketch with and . Thus, we know that any metric that embeds well into with is efficiently sketchable. Are there any other examples? The amazing answer is that we don’t know!

Our results

Our result shows that for a very important class of metrics—normed spaces—embedding into is the only possible way to obtain good sketches. Formally, if a normed space allows sketches of size for approximation , then for every the space embeds into with distortion . This result together with the above upper bound by Indyk provides a complete characterization of normed spaces that admit good sketches.

Taking the above result in the contrapositive, we see that non-embeddability implies lower bounds for sketches. This is great, since it potentially allows us to employ many sophisticated non-embeddability results proved by geometers and functional analysts. Specifically, we prove two new lower bounds for sketches: for the planar Earth Mover’s Distance (building on a non-embeddability theorem by Naor and Schechtman) and for the trace norm (non-embeddability was proved by Pisier). In addition to it, we are able to unify certain known results: for instance, classify spaces and the cascaded norms in terms of “sketchability”.

Overview of the proof

Let me outline the main steps of the proof of the implication “good sketches imply good embeddings”. The following definition is central to the proof. Let us call a map between two metric spaces -threshold, if for every :

implies ,

implies .

One should think of threshold maps as very weak embeddings that merely
preserve certain distance scales.

The proof can be divided into two parts. First, we prove that for a normed space that allows sketches of size and approximation there exists a -threshold map to a Hilbert space. Then, we prove that the existence of such a map implies the existence of an embedding into with distortion .

The first half goes roughly as follows. Assume that there is no -threshold map from to a Hilbert space. Then, by convex duality, this implies certain Poincaré-type inequalities on . This, in turn, implies sketching lower bounds for (the direct sum of copies of , where the norm is definied as the maximum of norms of the components) by a result of Andoni, Jayram and Pătrașcu (which is based on a very important notion of information complexity). Then, crucially using the fact that is a normed space, we conclude that itself does not have good sketches (this step follows from the fact that every normed space is of type and is of cotype ).

The second half uses tools from nonlinear functional analysis. First, building on an argument of Johnson and Randrianarivony, we show that for normed spaces -threshold map into a Hilbert space implies a uniform embedding into a Hilbert space—that is, a map , where is a Hilbert space such that

where are non-decreasing functions such that for every and as . Both and are allowed to depend only on and . This step uses a certain Lipschitz extension-type theorem and averaging via bounded invariant means. Finally, we conclude the proof by applying theorems of Aharoni-Maurey-Mityagin and Nikishin and obtain a desired (linear) embedding of into .

Open problems

Let me finally state several open problems.

The first obvious open problem is to extend our result to as large class of general metric spaces as possible. Two notable examples one should keep in mind are the Khot-Vishnoi space and the Heisenberg group. In both cases, a space admits good sketches (since both spaces are embeddable into -squared), but neither of them is embeddable into . I do not know, if these spaces are embeddable into , but I am inclined to suspect so.

The second open problem deals with linear sketches. For a normed space, one can require that a sketch is of the form , where is a random matrix generated using shared randomness. Our result then can be interpreted as follows: any normed space that allows sketches of size and approximation allows a linear sketch with one linear measurement and approximation (this follows from the fact that for there are good linear sketches). But can we always construct a linear sketch of size and approximation , where and are some (ideally, not too quickly growing) functions?

Finally, the third open problem is about spaces that allow essentially no non-trivial sketches. Can one characterize -dimensional normed spaces, where any sketch for approximation must have size ? The only example I can think of is a space that contains a subspace that is close to . Is this the only case?

One problem that one encounters a lot in machine learning, databases and other areas is the near neighbor search problem (NN). Given a set of points in a -dimensional space and a threshold the goal is to build a data structure that given a query reports any point from within distance at most from .

Unfortunately, all known data structures for NN suffer from the so-called “curse of dimensionality”: if the query time is (hereinafter we denote the number of points in our dataset ), then either space or query time is .

To overcome this obstacle one can consider the approximate near neighbor search problem (ANN). Now in addition to and we are also given an approximation parameter . The goal is given a query report a point from within distance from , provided that the neighborhood of radius is not empty.

It turns out that one can overcome the curse of dimensionality for ANN (see, for example, this paper and its references). If one insists on having near-linear (in ) memory and being subexponential in the dimension, then the only known technique for ANN is locality-sensitive hashing. Let us give some definitions. Say a hash family on a metric space is -sensitive, if for every two points

if , then ;

if , then .

Of course, for to be meaningful, we should have . Informally speaking, the closer two points are, the larger probability of their collision is.

Let us construct a simple LSH family for hypercube , equipped with Hamming distance. We set , where . It is easy to check that this family is -sensitive.

In 1998 Piotr Indyk and Rajeev Motwaniproved the following theorem. Suppose we have a -sensitive hash family for the metric we want to solve ANN for. Moreover, assume that we can sample and evaluate a function from relatively quickly, store it efficiently, and that . Then, one can solve ANN for this metric with space roughly and query time , where . Plugging the family from the previous paragraph, we are able to solve ANN for Hamming distance in space around and query time . More generally, in the same paper it was proved that one can achieve for the case of norms for (via an embedding by William Johnson and Gideon Schechtman). In 2006 Alexandr Andoni and Piotr Indyk proved that one can achieve for the norm.

Thus, the natural question arises: how optimal are the abovementioned bounds on (provided that is not too tiny)? This question was resolved in 2011 by Ryan O’Donnell, Yi Wu and Yuan Zhou: they showed a lower bound for and for matching the upper bounds. Thus, the above simple LSH family for the hypercube is in fact, optimal!

Is it the end of the story? Not quite. The catch is that the definition of LSH families is actually too strong. The real property that is used in the ANN data structure is the following: for every pair of points we have

if , then ;

if , then .

The difference with the definition of -sensitive family is that we now restrict one of the points to be in a prescribed set . And it turns out that one can indeed exploit this dependency on data to get a slightly improved LSH family. Namely, we are able to achieve for , which by a simple embedding of into -squared gives for (in particular, Hamming distance over the hypercube). This is nice for two reasons. First, we are able to overcome the natural LSH barrier. Second, this result shows that what “practitioners” have been doing for some time (namely, data-dependent space partitioning) can give advantage in theory, too.

In the remaining text let me briefly sketch the main ideas of the result. From now on, assume that our metric is . The first ingredient is an LSH family that simplifies and improves upon for the case, when all data points and queries lie in a ball of radius . This scheme has strong parallels with an SDP rounding scheme of David Karger, Rajeev Motwani and Madhu Sudan.

The second (and the main) ingredient is a two-level hashing scheme that leverages the abovementioned better LSH family. First, let us recall, how the standard LSH data structure works. We start from a -sensitive family and then consider the following simple “tensoring” operation: we sample functions from independently and then we hash a point into a tuple . It is easy to see that the new family is -sensitive. Let us denote this family by . Now we choose to have the following collision probabilities:

at distance ;

at distance

(actually, we can not set to achieve these probabilities exactly, since must be integer, that’s exactly why we need the condition ). Now we hash all the points from the dataset using a random function from , and to answer a query we hash and enumerate all the points in the corresponding bucket, until we find anything within distance . To analyze this simple data structure, we observe that the average number of “outliers” (points at distance more than ) we encounter is at most one due to the choice of . On the other hand, for any near neighbor (within distance at most ) we find it with probability at least , so, to boost it to constant, we build independent hash tables. As a result, we get a data structure with space and query time .

Now let us show how to build a similar two-level data structure, which achieves somewhat better parameters. First, we apply the LSH family for with , but only partially. Namely, we choose a constant parameter and such that the collision probabilities are as follows:

at distance ;

at distance ;

at distance .

Now we hash all the data points with and argue that with high probability every bucket has diameter . But now given this “bounded buckets” condition, we can utilize the better family designed above! Namely, we hash every bucket using our new family to achieve the following probabilities:

at distance ;

at distance .

Overall, the data structure consists of an outer hash table that uses the LSH family of Andoni and Indyk, and then every bucket is hashed using the new family. Due to independence, the collision probabilities multiply, and we get

at distance ;

at distance .

Then we argue as before and conclude that we can achieve .

After carefully optimizing all the parameters, we achieve, in fact, . Then we go further, and consider a multi-level scheme with several distance scales. Choosing these scales carefully, we achieve .