2008-11-09

In a conversation over reddit, reader nostrademons suggested Cuckoo Hashing as a hash scheme with constant-time worst-case look-ups. References to the algorithm abound, but implementations are scarce, so I set out to write code. It turned out to be more difficult than I anticipated.

The original article describing Cuckoo Hashing explains the basic scheme better than the Wikipedia article does, in my opinion. It uses, however, two hash tables with probes going first to one and then to the other. A later article by the inventor shows a simpler scheme with only one key space. This paper is straightforward, and I set out to write code under its specifications. Instead of a hash map I preferred writing a hash set; it is possible to build the former with the latter storing key-value pairs.

This requires parameterizing the hash table with not only a data type but with an equality predicate and a hash function over that type; in effect this makes the hash table not polymorphic but an existential type, witnessed by those functions:

Of index I will say more later, but count is the number of elements stored in table, where the presence or absence of an element at a given position is given by the use of an option. Thanks to one of the (few) strange corners of OCaml's type system I can make both parameters optional by relying on the polymorphic equality and hash functions built-in to the standard library:

Note that create and not make is the builder function name in the standard module Hashtbl. The easy part of implementing a hash table are the administrative functions dealing with the specifics of an imperative data type. Clearing a table is very simple:

Again, the name copies the one given to the analogous function in Hashtbl. What about that index lurking in the table? To explain it, I must take a detour.

Hashing is hard

The second paper by Pagh I linked to above assumes that:

We have access to hash functions h1 and h2 such that any function value hi(x) is equal to a particular value in {1, …, r} with probability 1/r, and the function values are independent of each other […] Note that this is only possible if hash functions are somehow chosen in a random fashion. However, in this lecture note we will not describe how to do this.

And herein lies the first rub. How to choose two functions at random? The first paper speaks of a universal family of hash functions, which is no more and no less than a set of functions indexed by some mechanism with the property that the results of each member in the family is independent of each other. I couldn't find much data about how to build such a family, but I wasn't too diligent about it. What I did find was that cryptographic hashes are a good approximation to a universal family when keyed by the index.

A suitable index is readily available as the i for each of the two probes into the Cuckoo table. Independent of this, I had read about a new cryptographic hash family by Schneier et al., the Skein hash. What struck me about it is that it uses a really simple Feistel network to build the mixing rounds:

Now each of the two hashes would be keyed by a "random" number and fed to the shuffle:

let keys = [| 599290962; 771645345 |]

(If you look up those numbers, you'll see they're not that random, and they have something to do with the largest prime 230 - 35 that fits in an OCaml 31-bit machine integer). With that, the indexed hash looks like this:

And again, what's with the index? Well, there's another rub in Pagh's papers, of which I will say more later on; suffice to say that it serves as a way to build a double hash (indeed, multiple) in the case that insertion fails. This results in a multi-parametric hash family h(k, j, i), where k is the key, j is the index or hashing "round", and i selects one of both Cuckoo hashes.

I tested these hash functions for collisions with a smallish dictionary of words and found it satisfactory. I haven't found a simple, certified implementation of a hash family, and the few references that I've looked up were vague and inconclusive, or relied heavily on strong cryptographic primitives. This is the first failure of Pagh's "algorithm", in that it is under-specified. This hash might fail to fulfill the paper's preconditions as I quoted above, and so this code is not fit for productive usage!

The good: deletions

The surprise came when I had to implement deletion from the table. When I wrote on reddit, I was under the impression that it would need lazy deletion to skip over deleted entries. This turned out to be false, as I found by implementing the table with a variant type denoting both empty and missing slots, and seeing that both paths were identical. Further thought convinced me that, indeed, insert below does not inspect slots, and so there is no need to mark them as deleted as opposed to simply empty. So, the following code, even if not shown in the paper, posed no problem:

This tries twice to find the element v, first by hashing it with hash 0, then with hash 1. By the postcondition of insertion, it is guaranteed that, if not found in two tries the key is simply not stored in the table. The code looks needlessly imperative, with an exception replacing a short-circuiting or, but it is symmetrical to that for deletion above, and what you would write with imperative loops and breaks. Now for the last hurdle.

The bad: insertions

Pagh shows how to perform the basic shuffle of elements around the table so that the two-probe look-up succeeds. What he doesn't show is how to grow the table, or what to do if insertion finds a loop in the random graph induced by the hashes. He glosses over this with:

This is continued until the procedure finds a vacant position or has taken too long. In the latter case, new hash functions are chosen, and the whole data structure is rebuilt ("rehashed") … In this case the insertion procedure will loop n times before it gives up and the data structure is rebuilt ("rehashed") with new, hopefully better, hash functions.

(emphasis mine). This is where I started looking for implementations, to see how they resolved this oversight. I found none, and so here's my attempt. If the hash function above is good (random) enough, this procedure terminates; as I can't guarantee that the hash I've given above is universal, this function can very well fail, although with an out-of-memory exception and not an infinite loop. Caveat emptor!

First, I'll bound the number of attempts at rehashing with a new set of hash functions before giving up and growing the table:

let add h v =
let attempts = ref 0 in

I use an auxiliary function go; insert is a wrapper for it that gets it started with an initial hash value and a number of attempts equal to the size of the table:

let rec insert v =
go (Array.length h.table) (hash 0 h v) v

This function go takes the number n of attempts left, the current index pos and the value v to insert. If there aren't any more attempts left, there's no other choice than to rehash and try again to insert the value that failed. Otherwise, the current slot's content gets exchanged with the value to be inserted, and the old value, if any, is inserted recursively with the first hash that changes its position:

This is essentially the inner loop of the algorithm insert given in the paper. It is now necessary to flesh out the mysterious rehash.

The first paper analyzes the algorithm in depth, and it shows that, given that the hashes are universal and independent, a fill factor of less than 0.5 implies that the probability of a cycle is less than the number of entries in the table, and so by the pigeonhole it cannot happen. To account for the imperfect universality of the hashes, I try at most 5 attempts at rehashing, with successively higher values of index. I've found empirically that odd values of the multiplier work best. If the maximum number of attempts is reached, or the table is over-full, I simply double its size. Then, I iterate over the old table, successively inserting the found values in the new.

It is here that a pathologically bad hash family could lead to problems: insert calls rehash that calls insert that could call rehash again that… Here the procedure would double the table again and again, until it exceeds the limits of OCaml arrays. With that, the insertion is carried out if the entry is not already present in the table:

in if not (mem h v) then begin
insert v;
h.count <- h.count + 1
end

As you can see, the correctness of the algorithm relies on the statistical properties of the hash family, which in my view is a flimsier foundation than linear chaining. This makes the latter attractive in the presence of hash functions of arbitrary, and not always good, quality: at worst you have an O(n) look-up, instead of non-termination.

Trying it out

This said, the code works surprisingly well. To test it, I've found a list of the 500 most frequent English words:

Are wordles a novel visualization technique, or the Applet-equivalent of fridge magnet poetry? I'm not decided; meanwhile, I've compiled one of all my posts in the last year or so. Pure vanity, perhaps, but with pretty results:

About Me

Long-time software developer and architect by day, specializing in boring-but-crucial bits of interconnect between disparate architectures. OCaml is my language of choice, but I feel extremely comfortable with C, Java and SQL whenever the need arises.

By night, I'm a calligrapher and a seeker of the hidden light of understanding in the noise of daily living.