A Research Blog

High frequency moments via max-stability

In this post, I will present what I think is a simple(r) algorithm for estimating the frequency moment, or simply the norm of a vector, in the sketching/streaming model. In fact, the algorithm is just a weak embedding of -dimensional into of dimension (this viewpoint spares me from describing the streaming model precisely).

What do I mean by a weak embedding? We will get a randomized linear map such that, for any , we have that the image is a constant approximation to with some 90% probability. Since , the embedding is actually dimensionality-reducing.

The algorithm. Let be the input vector of dimension . The algorithm just multiplies entry-wise by some scalars, and then folds the vector into a smaller dimension using standard hashing. Formally, in step one, we compute as

where random variable is drawn from an exponential distribution . In step two, we compute from using a random hash function as follows:

where are just random .

is the output, that’s it. In matrix notation, , where is a diagonal matrix and is a sparse “projection” matrix describing the hash function .

Of course the fun part is to show that this works (simple algorithm is not necessarily simple analysis). I’ll give essentially the entire analysis below, which shouldn’t be bad.

Analysis. We’ll first see that the infinity norm of is correct, namely approximates , and then that the infinity norm of is correct as well (both with constant approximation with constant probability of success). In other words, step one is an embedding into , and step two is dimensionality reduction.

The claim about the infinity norm of follows from the “stability” property of the exponential distribution: if are exponentially distributed, and are scalars, then is distributed as where is also an exponential and .

Now, applying this stability property for we get that is distributed as .

Note that we already obtain a weak embedding of into (i.e., no dimensionality reduction). We proceed to show that the dimension-reducing projection does not hurt.

We will now analyze the max-norm of . The main idea is that the biggest entries of will go into separate buckets, while the rest of the “stuff” (small entries) will give a minor contribution only to each bucket. Hence, the biggest entry of will stick out in as well (and nothing bigger will stick out), preserving the max-norm. For simplicity of notation, let , and note that the largest entry of is about (as we argued above).

What is big? Let’s say that “big” is an entry of such that . With good enough probability, there are only at most such big items (because of exponential distribution), so they will all go into different buckets.

Now let’s look at the extra “stuff”, the contribution of the small entries. Let be the set of small entries. Fix some bucket index . We would like to show that the contribution of entries from that fall into bucket is small, namely, less than (the max entry of ).

Let’s look at . A straight-forward calculation shows that the expectation of is zero, and the variance is . If only we now could show that , we would be done by Chebyshev’s inequality…

How do we relate to ? Here comes the exponential distribution at rescue again. Note that since the expectation for an exponentially distributed is constant. Together with standard inter-norm inequality that , we have that .

Combining, we have that , which is enough to apply Chebyshev’s and conclude that the “extra stuff” in bucket is with constant probability.

We are almost done except for one issue: we would need the above for all buckets , and, in particular, we would like to have for a fixed with high probability (not just constant). To achieve this, we use a stronger concentration inequality, for example the Bernstein inequality, which allows us exploit the fact that for to achieve a high-probability statement.

Discussion. The use of “stability” of exponential distribution is similar to Piotr Indyk’s use of -stable distributions for sketching/streaming , , norms. Note that -stable distributions do not exist for ; so here the notion of “stability” is slightly different. In the former, one uses the fact that for random variables , which are -stable, we have that is also -stable with well-controlled variance. In the latter case, we use the property that the max of several “stable” distributions is another one: is distributed as (i.e., is “max-stable”). Note that this is useful for embedding into . As it turns out, such a transformation does not increase the norm of the vector much, allowing us to do the dimensionality reduction in .

For streaming applications, the important aspects are the small final dimension and the linearity of . This dimension of is optimal for this algorithm.