Tuesday, February 2, 2010

Last time we showed that chaining and linear probing enjoy O(1) expected running times per operation. However, this guarantee is fairly weak, and certainly does not explain the popularity of the schemes. If you had a computation that was expected to run for 4 hours, would you be happy to hear, say, that it might take 10 hours with 10% probability?

Worst case w.h.p. So, what bound can be put on the running time that hold with high probability? For chaining, simply apply the Chernoff bound. Given that a bin contains μ=O(1) elements in expectation, the probability that it contain Z elements is at most eZ-μ / (μ/Z)Z = (O(1)/Z)Z. To make this be n-c, set Z=Θ(lg n/lglg n).

Observe that this bound is tight. There are (n choose Z) ≈ nZ/Z! combinations of keys that can land in a designated bin, and some Z keys land there with probability b-Z=(2n)-Z. Thus, in expectation, at least one bin will see Z=Ω(lg n/lglg n) elements.

For linear probing, we can also reduce this to a balls-in-bins analysis, by defining a bin as L=O(lg n) consecutive locations in the array. In the running example b=2n, such a bin is full when Z=2μ=L. By Chernoff, this happens with probability eZ-μ / (μ/Z)Z = (e/4)L/2. Thus, the maximal run in O(lg n) w.h.p. Again, this is tight by a direct calculation.

Amortized bounds. Unfortunately, these worst-case1 bounds are not too satisfactory either, since O(lg n) is trivial to get by binary trees. If you grant me some lenience for my modeling, I will prove an O(1) amortized bound w.h.p., which I believe explains the power of the algorithms much better.

1 There is an unfortunate tendency of some TCS papers to use "worst case" when they really mean deterministic. I am much happier with the convention that worst case is opposite of amortized, at least when your paper has any connection to data structures.

Formally, I will prove the following, rather trivial statement:

Let T≥clg n, for a large enough constant c. Any T operations in chaining hashtables only touch O(T) memory w.h.p.

Fix the hash codes of the T elements, and define our "bin" to consist of these ≤T distinct hash codes. All other elements are still totally random, and we expect μ≤Tn/b=Θ(lg n) to fall into the bin. If μ≤Z/(2e), the Chernoff bound is eZ-μ/(μ/Z)Z ≤ 2-μ = high probability.

But does this actually mean that chaining has O(1) amortized running time? Formally, no: if I repeat a single query T times, the running time will be T times the running time of that query, i.e. a geometric random variable with no good concentration. Here is where I invoke a bit of lenience in my modeling: in practice, it is ridiculous to worry about repeating one of the last O(lg n) operations! The memory used by these recent updates will be fresh in the first level of cache, making a repetition cost essentially zero. (One may formally say that chaining has amortized O(1) running time w.h.p. in the external memory model with a cache of size Ω(lg n).)

A pretty way to understand the amortized bound is as a "buffer size" guarantee. The most demanding applications of hashtables are in analyzing a continuous stream of data, when operations need to be super-fast to keep up with the line speed. In such application, if the design is at all sensible, there will be a buffer between the network interface and our CPU. The goal is not necessarily to take O(1) time for every single operation, but to keep the buffer small. Our proof says that the buffer will not grow to more than T=O(lg n) w.h.p., if you can afford the average time/operation.

Remember from last time that we analyzed linear probing by building a binary tree over the array, and bounding the number of nodes that become dangerous (two-thirds full).

Let μ be the number of keys we expect under some node. First of all, if μ≫lg n, we do not need to worry about the node: it doesn't become dangerous w.h.p. Otherwise, we showed that the node becomes dangerous with probability O(1/μ2); if it does, we will pay a cost of μ.

Looking at T elements, I am dealing with ≤T nodes on each level, and I expect O(T/μ2) nodes to be dangerous. As long as T/μ2 ≥ clg n, Chernoff tells me that only O(T/μ2) nodes are dangerous w.h.p. Since we only deal with μ=O(lg n), I needed to set T=Ω(lg3n). With this bound, the number of memory locations accessed by the T operations is Σμ=2^i O(T/μ2)·μ = O(T) w.h.p.

I end with a question to my knowledgeable readers. By being more careful, I can prove T=Ω(lg1+εn) suffices for linear probing. Is it possible to prove T=Ω(lg n), as in chaining? (Perhaps I'm missing something obvious.)

Yes, there is some association to cell-probe, but I think it's not the main point. As I said in the post, I think it's more about the external memory model. The performance of hash tables is largely dictated by memory performance, and the cache really is much faster...

At some level, cell-probe upper bounds are the same as external memory upper bounds, if you define the cell to be of page size. Indeed, the external memory model tends to only focus on the number of locations probed, just like cell-probe. (With the extra useful guarantee that the algorithm is somewhat realistic, and doesn't invoke the halting problem in the middle or stuff like that...)

I've been enjoying your latest sequence of posts on chernoff bounds and hashing.

Just want to clarify your usage of worst case. I agree that worst case typically modifies the input (and not the coins). But what name do you give to the running time of a randomized algorithm with a performance guarantee that holds for the worst case input? "Worst case" doesn't seem quite right. I like your use of "worst case w.h.p."; would you use "expected worst case" if you only have the guarantee in expectation?

Well, I assume all bounds are for worst-case input, since this is what 99% of TCS is about. Whenever we are doing something else, we say it explicitly (average case algorithms, distributional complexity, adaptive algorithms, competitive algorithms, etc). So I don't think there's a need to specify that you analyze your algorithms for worst-case data.