In this blog, you can read the 'aha' moments in my rendezvous with computers, computing, and electronics, HOWTOs and whys that you could not get an answer from your textbook and then once in a while, a strange mixture of global economics.

Thursday, May 05, 2011

Bloom Filter is an extremely space-efficient probabilistic data structure. You can read more about it in its wikipedia page: http://en.wikipedia.org/wiki/Bloom_filter. Bloom filter requires a very little storage space for a single key. An enormous amount of data can be stored in a small Bloom filter, if we can afford the risk of a very small percentage of false positive. Venti network storage system introduced in Plan 9 distributed operating system uses Bloom filter for storage.

A few days ago, Raymond Hettinger wrote a simple Bloom filter in ActiveState recipe. This reminded of the fact that I wrote a Bloom filter long time ago with a different hash function. So I forked Hettinger's code and extended the Bloom filter to allow union and intersection operations between two Bloom filters. Such operations are very easy, and allowed if both the Bloom filters use the same hash function. It may be used in distributed systems, where two Bloom filters are independently constructed and merged eventually. I also externalized the hash function to be given as the input to the Bloom filter class. So the user can test it with different hash function, if they have a yielding function to generate hash values.

You can find the Python source code for the Bloom filter in my github or ActiveState recipe.

Bloom filter is a very good data structure for space-efficient storage, whereas it is not suited if you insist on having no false positive in searches. Also if you want to do other operations like sorting the dataset or deleting elements from the dataset, Bloom filter is not suited.

However with some slight modification, sacrificing some space, we can achieve a few more things. For instance, if you want to count the number of times a given key is searched, you can allocate more than a bit for every hash value and increment the positional value as a counter for every search. Least positional value (minus 1) for a given string tells how many times that string has been searched. But this operation would increase the storage requirement. My implementation does not include any such variation.