Cessu's blog

Monday, September 10, 2007

Hashing with SSE2

Almost all software engineers will occasionally need a hash function, either for hash table lookups, a source of data-dependent pseudo-randomness, or for a checksum of some sort. The classic one I and many others use is the "one-at-a-time" -hash:

It's reasonably fast and pseudo-random for most everyday use. And because it's so simple, it is trivially modified for special purposes, such as hashing nul-terminated strings in C.

The one-at-a-time hash could nevertheless be faster, and several better hash functions have been suggested by Bob Jenkins, Fowler, Noll and Vo, and Paul Hsieh. The one-at-a-time hash uses a 32-bit internal state stored in the variable hash and consumes eight bits of input per iteration. Hsieh's hash speeds this up consuming 32-bits per iteration while still significantly improving the hash function's statistical properties. Jenkins' hash increases both the internal state and consumption to 96 bits.

But I occasionally want to use hash values as checksums. Thirty-two bits is insufficient even if I don't require cryptographic resilience against malicious attacks - use SHA-2 for those purposes. Thirty-two bits is also becoming insufficient when indexing a hash table on a large server. Jenkins does offer a 64-bit variant of his hash, but mixing a large internal state becomes slow with 32 or even 64-bit wide registers.

So I decided to sacrifice portability and use the Intel's multimedia extension SSE2 with a good supply of big fat 128 bits wide registers and funny instructions for SIMD operations. The multiply-add instruction on eight 16-bit integer pairs to four 32-bit integer results is particularly handy for performing Fowler-Noll-Vo -style multiplication to propagate one-bit changes to at least 16 more significant bits. A little more work is needed to propagate those changes evenly to the other fifteen 16-bitentities, but SSE does have 128-bit bitwise operations, 64-bit arithmetic, 32-bit word shuffling and funny byte interleaving primitives (exact and more approachable documentation). Some research, a test driver to evaluate the statistical properties of the hash, a genetic algorithm to optimize the hypermystical constants in the code, and a weekend of CPU-time went into getting the details right - but hey, any hacker's goal is to save a few CPU-cycles regardless of how much computing it takes!

In order to save you from semi-trivial details I present only with a simplified version of the hash which assumes the number of bytes to hash is divisible by 32. I clocked my hash as 5-10% slower than Hsieh's, but I'm quite happy with its statistical properties: a one-bit flip in the input flips any bit in the 256-bit internal state by the end of each iteration with a probability within 0.03 of 50% in a test of 10^8 one-bit flips. The corresponding probability range for Hsieh's hash is from 3.2% to 25%, or if the final avalanching included within 0.31 of 50%. By sacrificing some of the quality of the hash the function below could be modified to be even faster than Hsieh's hash.

The code should work on all Pentia 4 and Athlon 64's and newer, but my hash will nevertheless become gradually obsolete (or at least subject to further improvement) because SSE4 will provide an instruction for computing CRC32.

About Me

A gradually middle-aging hacker, a software engineer currently working on my PhD in computer science at the Helsinki University of Technology, married, a happy father of two small daughters who are probably bound to outsmart me in everything except some arcane programming languages, a modest/low-carber, member of IKI and Skepsis, and someone who likes good discussions, elegant algorithms, terse software designs, puzzles with an element of surprize, cycling, cooking, and - given a relaxed schedule - building or renovating woodwork.