Bloom Filter is a probabilistic set data structure used to test whether an element is in a set or not. It has the advantage of fast operation over traditional set data structures but at cost of sacraficing correctness: false positive matches are possible, meaning that a query returns either “possibly in set” or “definitely not in set”. A bloom filter leverages predetermined k hash functions, each of which maps to a location of the underlying bit array for every item added. It’s like sampling of each element and loses precision during the process. Natrually, only add operation is supported but not remove. Not surprisingly, the more elements added into the set, the larger probability of false positives.

To pick up k independent while uniformly distributed hash functions is non trial. But a simplified version of Java implementation described in the blog adopts double hashing strategy, code available @ my github Bloom Filter.

Elements of Implementation

The actual data structure representing a bloom filter is a bit array, or in Java, the BitSet of length bitSize.