Friday, February 27, 2009

Hash Table with Preemptive Bloom Filter

There's a useful hash table/Bloom filter trick I've been seeing across a number of papers. (If you don't know what a Bloom filter is, then I haven't been doing my job...) For the record, I'm not claiming this trick is mine; maybe it originated in the paper Longest Prefix Matching Using Bloom Filters, where it plays a key role. But I also can't recall a paper that clearly highlights the idea, so I'm doing it here.

Suppose you'll be doing lookups of (large) keys in a (slow) hash table, and many of your lookups will be for items that actually are not in the table. For example, you may be storing URLs on blacklist or dangerous packet signatures or prefixes of addresses or whatever else. Then it makes sense to keep a Bloom filter for the items in your hash table in (fast) memory. Before you do a lookup into the table, you check the Bloom filter. A false positive in the Bloom filter means you do a lookup when you didn't have to, but this allows you to avoid a lookup for a key that's not in your hash table the large majority of the time. The approach works for any type of hash table.

This should be considered a standard approach in hash table design, so I suggest we come up with a catchy name for it, so it doesn't have to be continually repeated in future papers. I'll suggest it be called a "hash table with preemptive Bloom filter", but feel free to suggest a cooler sounding name for it in the comments.

This example highlights the need for coming up with names for nice ideas that generalize to many situations. I've seen this idea described repeatedly in multiple papers precisely because nobody has thought to give it a good name.

Also, a fairly similar technique is used in database hash join implementations when the tables are much larger than disk. You can filter out tuples that won't join and avoid writing them back to disk partitions. Implemented in multiple commercial databases.

Anon 2: I agree in spirit that really this is just "using a Bloom filter", but it's helpful to give it a name so people will stop feeling the need to describe it. Although perhaps just a standard sentence, "We use a Bloom filter on the keys of the hash table to preempt most unsuccessful searches" should suffice.

Anon 6: Blooming is unsatisfactory. Bloom is the man's name. We don't call it a Karping reduction, even though that sounds kind of funny too.

Naming things like this puts up barriers to entry, thus hurting outsiders. It does make things easier for the crowd that knows the language, but the cost is too high. When we teach, we should just say "This is a Bloom Filter; it is named after a person and has nothing to do with flowers; here are some uses."