Pages

Thursday, June 30, 2011

Bob Morris and stream algorithms

Bob Morris, one of the early contributors to UNIX and an NSA cryptographer, died on Sunday. The New York Times has a remembrance that talks about his contribution to password security and cryptography in general.

It is possible to use a small counter to keep approximate counts of large numbers. The resulting expected error can be rather precisely controlled. An example is given in which 8-bit counters (bytes) are used to keep track of as many as 130,000 events with a relative error which is substantially independent of the number n of events. This relative error can be expected to be 24 percent or less 95 percent of the time (i.e. σ = n/8). The techniques could be used to advantage in multichannel counting hardware or software used for the monitoring of experiments or processes.

In modern language, this can be re-rendered as:

It is possible to approximately count the number of elements in a stream using $O(\log \log n)$ bits, where $n$ is the length of the stream.

The idea is very simple. The trick is to count $C = \log n$ instead of $n$. Clearly, doing so requires only $\log\log n$ bits. Then any additive approximation to $C$ yields a multiplicative approximation to $n$.

So how do we count $\log n$ ? Let's assume for now that we're using base 2 logarithms. If the current counter value is $i$, then we "know" that the next update should come only $i$ counts later, if indeed we've seen $i$ elements. However, we don't know that the counter value reflects the true count, so we instead do it probabilistically. Specifically, toss a coin whose probability of coming up heads is $2^{-C}$, and only increment the counter if the coin comes up heads.

This approach yields a fixed error bound for the resulting count. The next trick is then to change the base from 2 to some parameter $a$, and then reanalyze the method. Much work later, it can be shown that with roughly $\log \log n + \log(1/\epsilon)$ bits, you can get a multiplicative approximation of $1+\epsilon$.

Postscript

Stream algorithms are a good icebreaker in an algorithms class: they're unusual, seem to solve an impossible problem, and do it really well. I also like to point out to students that three of the more important streaming algorithms were designed well before people cared about "big data" or streaming. It's a valuable lesson in the importance of understanding "old" tools.

In case you're wondering if anyone still uses Morris counters, here's a paper from IJCAI 2009 that plugs them into a larger system to maintain counts of $n$-grams when building a statistical language model.