The other day I was thinking about how to implement a general-purpose Bloom filter for .NET. Bloom filters are based on hashing. All .NET objects support the ability to get a hash code, but the problem is that Bloom filters require multiple hash functions, and it would be unfriendly to require users to create their own set of type-specific hash functions just to use the Bloom filter. I would of course create optimized hash functions for all the typical key types (numbers, strings, dates, GUIDs, arrays of value types, etc.), but there would have to be a fallback method capable of working with any type.

One possibility is to take the standard .NET hash code which every object has and hash that. That's a poor idea in general because if two objects give the same .NET hash code, then they will share hash codes for every hash function. (Of course they would give different values for different hash functions, but for a given function they would both give the same value.) The hash functions would be way too dependent on each other. However, it suffices for Bloom filters of small-to-medium size because, assuming the built-in .NET hash functions are implemented well, the chance of two distinct values returning the same hash code is very small (about one in a few billion, although it is affected by the birthday paradox).

In any case, I still had the problem of creating the hash functions. The best hash functions are injective, meaning that they never map distinct input values onto the same output value. This is impossible whenever the input domain is larger than the output domain, but in my case I would be mapping integers onto integers. Since the input and output domains are the same size, an injective function would necessarily be bijective too. It struck me that what I really needed was a permutation. If I had N random permutations of the integers, then hashing the value k with the nth hash function would simply be a matter of taking the kth element of the nth random permutation.

This left me with the problem of coming up with a way to efficiently select, say, the billionth integer from the billionth random permutation of the integers. I thought about ways of using prime and coprime numbers to generate the permutations. (Given a prime p, a coprime c, and a starting value s, (s+c*n) mod p produces the nth value in a permutation of the numbers 0 through p-1.) Those permutations would be rather predictable, but that might not be a problem. I'd only need one prime number, which I could find beforehand — it would the first prime greater than 232 — but finding the coprimes seemed like it would be nontrivial. There's also the fact that I'd need to iterate if the output wasn't less than 232. It just seemed ugly.

Luckily, I stumbled upon the idea of using a block cipher. In fact, block ciphers meet the requirements almost perfectly. The whole purpose of a block cipher is to transform one N-bit value into another N-bit value, based on a K-bit key. It does so uniquely, with every possible N-bit input mapping to only one N-bit output, which is necessary for the cipher to be reversed. In effect, the key is used to select a particular pseudorandom permutation of the N-bit values, and the input value serves as an index into that permutation.

The only problem was that there didn't exist any block ciphers that fit my purposes. They are typically designed to be complex, cryptographically strong transformations. The output size is typically 64 or 128 bits, and the key size is typically much larger than that. (The reason should be clear: the number of potential permutations is 2N factorial, while the number of keys is only 2K. So the key size is much larger than the output size to allow more of the possible permutations to be accessible to the cipher.) So I set out to create my own block cipher. Or rather, I set about to strip down and optimize an existing block cipher to take an integer key and produce an integer output. I wanted a cipher that was simple to implement. The Tiny Encryption Algorithm (TEA) looked promising, but I was dissuaded by the fact that it suffers from equivalent keys. Instead, I created a stripped-down version of Skipjack, the NSA algorithm intended to be used in the Clipper chip, chosen because it was also simple, and because I assumed it would be more random. But I was enchanted with the simplicity of TEA and, bolstered by my experience converting Skipjack, I also converted XXTEA (a somewhat stronger but more complex form of TEA).

It works pretty well! I tested several versions of the ciphers by enciphering a bunch of zeros with an all-zero key in CBC mode and running the results through NIST's statistical test suite. The Skipjack-based cipher passed with consistently high scores, while the XXTEA-based cipher had high scores except for a few failures. Neither are suitable for encryption of course, but both manage to generate strongly random-looking output. More importantly, they both work well in the Bloom filter. The generic hashing approach works up to about 10 million items, when the minimum false positive rate starts to become unacceptably high (i.e. greater than 0.1%). To solve this, I implemented custom hash providers for all the common built-in types, but they use block ciphers too. So in short, if you need a way to represent permutations of a large space without having to pre-generate them, consider using block ciphers!

Block ciphers aren't the only way to do this, though. Some techniques for creating random number generators can used, especially those based on XOR and shuffling bits, as those don't lose information. You'd use them to create a random number generator with no internal state. In fact that's what a block cipher is.

Update: For my Bloom filter, although the block cipher approach described above worked pretty well, I eventually switched to some hashing code based on Bob Jenkins' lookup3 hash, which I optimized for 4, 8, and 16 byte values in addition to the general byte string case. It doesn't work for permuting integers or generating bijective functions — the subject of this article — because it has collisions in the output, and the output of the hash is slightly less uniform than the block cipher I was using, but it's faster and, honestly, more trustworthy. The block cipher I developed worked great for a given hash function, but there were some dependencies between hash functions sometimes. That's no slight against the idea of using block ciphers in general, though, and for the problem of generating bijective functions and permutations of integers, I still think they're a perfect fit. Even for hashing, they can work well, if you know how to develop a good one, but I'm not an expert in developing cryptographic algorithms.

Here is the (heavily inlined and nigh-unreadable) code for my own block cipher. I'm sure the round function is poorly designed, but it seems to suffice for permuting integers.

Finally, here is an example of how the block ciphers can be used, first in a generic hash and then in a type-specific hash. Obviously the generic approach suffers from the problem, described above, that the hash functions are not independent, but whatcha gonna do?

static int GetHashCode(int hashFunction, T item)
{
int hash = EqualityComparer<T>.Default.GetHashCode(item);
// hash function 0 will be the built-in .NET hash function. other hash functions will
// be constructed using a weak 32-bit block cipher with the hash function as the key
// and the .NET hash as the data to be encrypted. this effectively uses the .NET hash
// value as an index into a random permutation of the 32-bit integers
if(hashFunction != 0) hash = (int)syfer((uint)hashFunction, (uint)hash);
return hash;
}
static int GetHashCode(int hashFunction, ulong item)
{
// you could also special case hash function 0 here. use two different hash functions
// to avoid the result always being zero when 'item' is zero. it would be more efficient
// to develop fast block ciphers for 64-bit values rather than combining two 32-bit
// values, but that's an exercise for the reader ;-)
return (int)(syfer((uint)hashFunction, (uint)item) ^
syfer((uint)hashFunction+1, (uint)(item>>32)));
}

Sure, I don't mind posting some test vectors. I think if you copy and paste the code you'll get the same output that I do. In fact, I'm going to copy and paste the code in order to create the test vectors. Nonetheless, here they are. Each line gives the
name of the algorithm, the key, and the first ten values from the
permutation selected by that key.