11/19/2010

11-19-10 - Hashes and Cache Tables

At first this test looks pretty cool, but on further examination it has a lot of flaws. For one thing, they're basically testing only english words.
If what you care about is short english text, then maybe the results are good, but if you care about other things they are not. I
do like the fact that they are testing the total time to compute the hash + the time to lookup. However, when you do that then the
details of the hash table become crucial, and it seems they are using an externally-chained hash table which is a severe bias (I'd
like to see tests using stlport std::hash_map, rdestl, khash, google's "dense hash", or the cblib hash_table, etc. - some standard reference hash
implementations). You also need to talk about your hash fill ratio, whether you store the hash value, whether the data is interned, etc.
These will affect which hash function is "best".

There are also major architectural issues. For example, the large multiplies of Murmur are not fast on all current platforms.
Furthermore, on some platforms you can't do unaligned DWORD reads fast, and you may have to do endian swaps. These issues do not
affect all hashes in the same way, so to compare hashes without considering these issues is a mistake.

Also, to do a fair comparison you really need to understand the details of each hash. Some are designed for use only in prime-size
tables. Some should be collapsed by xor folding, others should not. You cannot simply plug a bunch of different hash functions into
the same code and have a fair comparison.

But perhaps most crucially, the hashes are not optimized to the same amount. Some are unrolled, others are not. Some are reading
dwords, some are reading bytes. This is not a fair comparison. Really the only way to do a perfect comparison is to optimize each
hash function as much as possible, because hashes have a different degree of capacity for optimization. For example, Adler and
Fletcher (which are checksums not hashes, BTW) can easily be made fully 16-byte parallel (see for example
asmcommunity Adler32 ; there are also some very good
Altivec versions floating around), most hashes cannot be directly parallelized the same way (they can only be "trivially" parallized
in the sense of running multiple hashes at once in parallel on interleaved input streams).

(ADDENDUM : actually the biggest problem with the test is probably that the hash functions are called through function pointer, not inlined;
this is a severe penalty for the cheaper hash functions; in particular with properly inlined hash functions I ran a similar test and found
FNV to just destroy Murmur, which is the opposite of what they find)

Anyway, enough being a grouch, let's get to what I've been thinking about.

One piece of news : there's a (beta) update to Murmur2 called (unsurprisingly) Murmur3 :

Murmur3 improves the basic mixing a bit to fix some flaws in Murmur2. It also works on larger data elements, which is a mixed blessing. That will make
throughput on large keys faster, but hurts performance on small keys. One of the nice things about Murmur2 was how simple it was.

My final answer for file hashing was Bob Jenkin's Lookup3, modified to be 4-way SIMD. Since Lookup3 inherently
works on 12 bytes at a time, the 4-way simd makes it work on 48 bytes at a time, and I then unroll that 8 times
so that I can work on 384 bytes at a time (which is crucially 3 whole 128 byte cache lines).

So anyway. When you're talking about hashes it's important to distinguish what you're using them for. There are lots
of variations, but I think there are three primary categories :

1. Hash for hash table lookup (with full resolution, either by in-table probing or external chains).

In this case,
collisions are only bad in that they cause you to take more CPU time to probe. You will never fail to find what you
are looking up due to collisions. There's a very subtle trade off here between hash quality and the time it takes
to compute the hash - the winner will depend intricately on the exact data you are hashing and what kind of hash table
implementation you are using. I'm sure this is one of those scenarios where different authors will all claim a different
hash function is "best" because they were testing in different scenarios.

2. File checksum/integrity/corruption/hacking testing.

In this case you want the hash function to change any time the file
has been changed. There are then two subsets of this : catching accidental corruption and catching intentional corruption.
To check the ability to catch accidental corruption, you want to make sure your hash is robust to common types of damage,
such as file truncation, injection of zeros, single bit flips, etc. Catching intentional corruption protects you from attacks.
We're not talking about cryptographic security, but a good hash should make it very difficult to construct alternative data
which is both *valid* and has the same hash.

3. Hashes for "cache tables". This is the main thing I want to talk about.

A "cache table" is a hash table in which collisions are not resolved (or they are resolved only some finite number of times).
So far as I know there's not a standard term for this, but we found this Google Code project :

And I think "cache table" is a pretty good name for it. It's similar to CPU caches, but without a backing store. That is,
a traditional "cache" is basically a "cache table", but then if the data is not found in the cache, you go and get it out of
the slow location. We say "cache table" to make it clear that in this case there is no slow location - if it's not in the cache
you don't find it.

Cache tables are very popular in data compression. To my knowledge they were first used popularly in LZRW. I then made heavy use
of them for LZP (and PPMCB, which is very similar to the way modern compressors use them). They are now the basis of most modern compressors such as PAQ and related context mixers.
Cache tables in traditional data compression usage never resize, you simply set your memory usage and that's how big the cache table is.

Usage of a cache table is something like this :

h = hash(key)
look up table[h]
(optionally) check if table[h] is right for key
if so return its data
optionally :
reprobe, either by stepping h and/or by searching "ways"
# of reprobe steps is finite
if failed to find, return not in table

There are a few optional places. A crucial one is whether you do exact resolution or simply accept collisions. (in the old language
of LZP, this is "context confirmation"). That is, for minimum memory use, table[h] might not store anything about key. That means when
hashes collide, you can't tell, and you might get back an entry which does not correspond to your key.

Sometimes this is just fine. For example, in LZRW style string matchers, if you get back the wrong data you will simply not get a string
match. In other cases you might want some amount of collision resolution, but only probabilistically. For example, PAQ stores only a few
bits of key in the table, so when hash collisions occur it can detect them most of the time, but accepts a certain amount of false positives.
This issue should always be considered as a trade off of memory usage - is it better to use the memory to resolve collisions more accurately,
or just to have a larger table so that fewer collision occur?

The other issue is how you reprobe, and that is closely related to your insertion strategy. When you try to insert and find a collision,
you have a variety of choices. You can try all the possible reprobes for that hash value and pick the one that is "best" to replace under
some criteria (perhaps such as the last used, or simply the slot that's empty). If you're using "ways" you might just do round-robin insertion
in the ways, which is a probabilistic replacement strategy.

Anyway, I think "cache tables" is an interesting topic which I haven't seen explored much in the literature. It's particularly interesting
in compression, because when you get a false collision (eg. you retreive data from the table and think it applies to you when it doesn't),
that doesn't ruin your compressor, it just makes you code that event with more bits than ideal. So you can measure the cost of collisions as
bit excess, and you ideally want to tune collisions to occur for less important data. Hash function collisions in cache tables affect both
speed and the quality of results - eg. more collisions means more dropped data - but you also usually care exactly *which* data is dropped.

The question of reprobes within a linear tables vs. ways (multiple independent linear tables) is interesting, and sometimes the optimal solution
is to use both. For example, my "medium speed" string matcher for the RAD LZ77 uses two independent hash values (in the same table) and 2 ways. That is, there are two
linear tables, and you look at two spots within each table.

I think that Cuckoo hashing for cache tables is a particularly interesting possibility. "Cuckoo cache tables" could do the normal Cuckoo thing
(where you have two hashes for each key and you swap data around to make space for an insertion),
but you set a maximum number of swaps on insertion (something like 4 or 8), and if you have to do the cuckoo swap step and hit the maximum number, you
just stop and let some item fall out of the cache. You can also easily extend this by keeping track of age or quality or something as you do the
Cuckoo, and choose to drop the worst item instead of simply dropping the last one in the chain. You don't have to worry about the pathological
Cuckoo case of having a cycle you can't resolve - you simply drop an item in that case (in fact you don't even have to detect that case, you'll
just go around the cycle up to your maximum finite number of steps).

11 comments:

So, I have some data on using cuckoo for hash/cache tables (I too like that name).

A few note on cuckoo. You obviously don't want to use the original straight-up cuckoo stuff. You want each item to hash to 2 buckets (not slots), and each bucket should be cache-aligned, and you want to do linear probing within those buckets. Use power-of-2 table sizes.

For rehashing, I had a primary hash which was just the low bits (& mask), and the secondary hash folded in the higher bits depending on how large the table was. You might be able to just get away with >> 16 though. The problem is that you can get self-collisions, where an item will rehash to the same bucket as the hash. I think the solution to that is to do mod table_size-1, and use this as an offset from the original bucket. That way, you never have a collision, and the mod does an excellent job of mixing in the higher bits.

Anyway, there are lots of tricks like this. LMK if you want to know more.

I think you're being a bit hard on the strchr benchmark, I mean some of the things you're talking about are difficult to factor out. Like what would be a word-sized version of FNV? Or isn't it an advantage that Fletcher/Adler is more easily parallelizable? So it isn't perfect (what benchmark is) but I don't think it is egregiously so. I agree that the biggest issue is probably the test corpus.

"I think you're being a bit hard on the strchr benchmark, I mean some of the things you're talking about are difficult to factor out."

Meh, I just think it's depressing that they've done a lot of promising work, and the result is hardly any useful information. I mean seriously, what can you actually learn from that page that's trustworthy information? If you look at how close all the results are to each other, they are within the margin of platform variation and optimization variation, therefore you can say absolutely nothing about which one is best.

" Like what would be a word-sized version of FNV?"

They do have one, it's called "Whiz" on that page. It is not a correct word-sized FNV, in fact it's much weaker, but they find it to be very good (on short english text). But that's partly because their "Whiz" and "Jester" have been highly tweaked while the others haven't.

"Or isn't it an advantage that Fletcher/Adler is more easily parallelizable?"

It *is* an advantage, but they aren't using it. To compare speed without taking advantage of that makes the speed comparison bogus. The only fair way to test speed is to test the best version of each algorithm.

It bothers me a lot when the only real conclusion you can draw from a test is "highly tweaked code snippet beats completely untweaked code snippet!" .

But they're also calling their hash functions through *function pointers* which is a pretty disastrous mistake. Combine that with the cache misses from chaining, and cache misses for the strcmp, and you can see why the speed results show such little difference.

(for example, in a real test Murmur2 should not be so competitive with FNV on short words - FNV is way way faster than Murmur2 for short word hash tables)

By the way, I do still have my how to structure a hash table research coming.

I was hoping to just use a decent hash function and some trivial input to not worry about the hash function side of it, but experiments showed some hash functions were exploiting my trivial input, so I'm going to end up using random input in an attempt to factor the hash function out of things (but obviously any real hash function might then work worse than what I measure, sigh).

I've looked at Murmur3 a bit. It's actually quite a bit slower than Murmur2 on short items, and even Murmur2 is un-competitively slow on short items. The throughput is very high for long items, particularly on x64. It might be a good "file hash" but it's not a good "hash table" hash.

BTW another thing missing from the strchr test is that when hashing strings, you generally want the strlen() call to be integrated with the hash function, which they don't do. Again, some hash functions are more amenable to that. In fact this tends to bias you back towards preferring byte-at-a-time hash functions rather than big-chunk unrolled ones. (you can use the nice trick for checking "any zero bytes in a dword" though)

Oh, the Whiz thing is new. I should look at it, maybe, although maybe it's just terrible. I didn't know there had been updates to the strchr thing. Anyway, just post something to that board and he'll probably make those updates.

The way my hash function works is that I have a short inlined loop that deals with the last (up to) 16 bytes of the string, but the rest is not inlined and calls the big unrolled job. Actually, it is a bit more complicated than that, but the point is that it is a good idea to partially inline a string hash function so that you have a simple, small bit (FNV-1a with a stronger final mix) and a big fast bit (Murmur, lookup3, something with SSE2).

Some more details on my bucketing. I think I wrote something a little misleading in my first post: items DO hash into a slot, but that slot is part of a cache-aligned bucket of slots. Linear probing occurs within the bucket, not necessarily from the first slot in the bucket, wrapping around until you return to the original slot. I'm not sure whether it matters to vary what direction you probe in (I don't think so).

So check if the first bucket has space, otherwise check if the second bucket has space. If both are full, this means that your hash table is likely in need of rebalancing. One thing to do is be proactive, and go through the elements of both buckets and try to rehash them. Move them if they achieve better balance in their new home. I never implemented the full recursive cuckoo. I figure, if you've reached this crisis, then your table is hopelessly full and you might as well rehash into a bigger table.

In a cache table, another option is to evict something. But you don't have that many options. Effectively, your replacement strategy is going to be among a few buckets worth of elements rather than the whole table.

Anyway, with this cuckoo-like implementation, I took various pow2 sized tables (1-16.7M), and tried to insert as many elements as there were slots. I plotted when the first "crisis" occurred, and how full the tables eventually got.

Up to 128 elements, there was never a crisis. After that, I lost ~1% every doubling. The tables could get incredibly full; they almost never dropped below 98.5%. And the real joy is that reading those tables is still damn fast.

Ah, yeah, hashing to arbitrary spot and then linear looping around the cache line is one of the cache-aware-hashing strategies we've discussed.

So to look up, you have to look in two different buckets and linear probe around both?

"In a cache table, another option is to evict something. But you don't have that many options. Effectively, your replacement strategy is going to be among a few buckets worth of elements rather than the whole table. "

This is a big improvement over most cache tables which don't get many choices at all about what to evict (and you probably can't afford to consider too many things).

Yeah, you run around both buckets, but that's as far as the probing ever goes. In other open addressing scheme (dense uses quadratic), you can get some nasty behavior. It is nice that you can bound things to 2x bucket cache misses for every read at maximum load (which is the steady state for a cache, of course).

OK, so cuckoo-ish cache tables sound sane, but how much work can you afford in cache maintenance? I imagine in something like LZP where the cache is in the inner loops, you can't afford to do something like cuckoo, and you just go with direct mapped. In something like a disk cache, you're so much faster than the backing medium that it makes sense to doing something like full LRU or ARC.

It seems like "cache tables" might be related to bloom filters: probabilistic results (e.g. some small chance of false positives) and fixed storage space. (Simple bloom filters are just sets, but bloomier filters, also described on that wikipedia page, are associative arrays.)