How I stopped worrying about determinism and started loving randomness.

These past two days in some free time, I decided to explore this nifty C library called libevent.

Following the theme from the previous post, the first question is: ‘Why do we need it?’. If you are familiar with network programming, or any multithreaded programming which involves blocking IO, you already know the problem at hand. Right from the hardware level to the software level, a common problem that happens is: IO is slower than the CPU. If we have several tasks to finish, and the current task being executed is waiting for a blocking IO to finish, we should ideally work on the other tasks and let that blocking IO finish in the background, and check on it later.

When we have several such operations happening in the background, we need a way to figure out when a particular operation (such as read, write, accept a connection), can be performed without blocking, so that we can quickly do that, and return to other things. select(2), poll(2), epoll(4), kqueue(2) (on *BSD systems), are one of the several ways to do it. In essence, you register a set of file descriptors that you are interested in, and then call one of these ‘backends’. They would usually block until either one of the fd-s that you are interested in, is ready for data to be read or written to it. If none of them is ready, it would block for a configured amount of time and then return.

The problem that libevent solves is, it provides an easy to use library for notifying when an event happens on the file descriptors which you consider interesting. It also hides the real backend (select, epoll, kqueue) being used, and this helps you avoid writing platform-dependent code (eg., kqueue works only on *BSD) and if there were a new and improved backend in the future, your code would not change. It is like the JVM for asynchronous event notification system.

I only have experience with select, so my context is limited. Using select is very tedious.

select.c

123456789101112131415161718192021

// Essentially a bitmaskfd_setreadfds;while(true){// Zero out the bitmaskFD_ZERO(&readfds);// Enable the fd that we are interested inFD_SET(sockfd,&readfds);intret=select(sockfd+1,&readfds,NULL,NULL,&timeout);if(ret<0){fprintf(stderr,"select() returned %d\n",ret);}// Is the bit corresponding to the fd enabled?// If yes, that fd is ready to be read from.if(FD_ISSET(sockfd,&readfds)){// Do what we want here}}

In essence, what we do here is to create a set of file descriptors (fd_set). We then run a loop where every time we set all the file descriptors we are interested in, into that set. Then we call select(), and it either times out, or one of the bits in that set would be set. We have to check for each of the file descriptors. This makes it a little ugly. Other backends might be more pleasant to use, but libevent is way ahead of select in terms of usability. Here is some sample code:

An event_base represents a single event loop like the one that we saw in the select example. To subscribe to changes in a file descriptor, we will first create an event. This can be done using event_new, which takes in the event base we created, the file descriptor, flags signalling when the event is active, the callback method and arguments to the callback method. In this particular example, we ask that the acceptConnCob method be called when the file descriptor is ready for reading (EV_READ) and persist this event, i.e, even when the event fires, automatically add it for the next time (EV_PERSIST is to be used here). Note that we had to add the file descriptors in every iteration of the while loop of the select example, so using the EV_PERSIST flag is a slight convenience here. Once, I have created the event, I need to add that event to the event_base it should be processed by, along with a timeout, using the event_add method. If the fd doesn’t become active by the specified time, the callback will be called anyways. Finally, to get things running, we will ‘dispatch’ the event base, which will spawn a new thread to run the event loop.

Note that nowhere have I specified which backend to use. I don’t need to. However, there are ways to prefer or avoid certain backends in libevent, using the event_config struct. Refer to the link in the end.

I can add multiple events, and there are a lot of options that you can use. For instance, I can create a timer by passing -1 as a file descriptor with the EV_TIMEOUT and EV_PERSIST flags, and the required timeout in event_add. This would call the timeout callback every ‘timeout’ seconds. Example:

I created a simple fortune cookie server (one of my favorite demo examples), where I have a set of messages, and if someone asks me for a fortune cookie, I will give them the current fortune cookie. Every few seconds, I will pick a new fortune cookie to be returned. This is implemented by having two events, one for accepting connections and the other for having a timer. The code is here.

One small thing to keep in mind is that if the callbacks themselves to do something heavy, then it defeats the purpose of using libevent. This is because the callbacks are called from the same thread as the actual event loop. The longer the callback runs, the longer the event loop is not able to process other events.

libevent allows you to do a lot of customizations. In the above example, I have added callbacks to override the logging mechanism, so that I can use glog (Google’s logging library). There are several other features such buffered events and a lot of utility methods (such as creating simple http servers), that you can find in the libevent book.

There are other similar async event notification systems such as libev, and libuv, etc. but I haven’t had the time to figure out the differences. I hope to cover the very interesting wrapper around libevent in folly, Facebook’s open source C++ library, in a future post.

This is my first post on this blog, which doesn’t discuss any specific problem / technology. As I have been spending time writing code, I feel that it is a good practice to sit back and introspect once in a while. This post has some thoughts, which I hope will make people sit up and notice if they are making the same mistakes. Because this is a long-ish post, some might not read it completely, or some might not be able to relate to the content right now. Regardless, I feel eventually a significant number of us will make the mistakes that I have made in the past, and learn from it first hand. Whatever the case may be, if this makes any sense to you, please share your thoughts and experiences with me :)

As Engineers I feel we are often excited to work on new and ambitious projects. I am specifically talking about non-trivial projects which break new ground, and/or have a reasonable change of not succeeding. The latter could be because it is often the case that, these projects are complicated enough and its hard to be exact with respect to the benefits. These projects might also touch certain areas of the system which are hazy in general.

‘Hazy’ doesn’t really imply that that particular area / part of the system, is naturally hard to understand. It could be just that we don’t know the problem well enough, and how it interacts with those ‘hazy’ areas. I cannot stress enough that it is critical to understand the problem really well before hand. It seems clichéd, and has been repeated so many times, that it will probably not make a good enough impact. So, I will repeat this again in detail, so it stays with you and me, a little longer.

Understand The Problem

As per Prof. Bender, when giving a presentation, making sure that people understand why we did what we did, is the most important thing. Extending this backwards, ever wondered if that problem really needs to be solved in the first place? A lot of times, as a new CS graduate, working on my first full-time unsupervised big tasks, I would really be in awe of the supposed problem. Looking with rose-tinted glasses, you feel that this is what you had told the recruiter and interviewers that you wanted to do in the job. Excellent, lets start working on it. And if you do this, and just jump into this directly, you are going to have a bad time.

Often I did not spend enough time understanding why exactly was I doing what I was doing. Do benchmarks show that this is really needed? Do I have a good enough prototype which shows that if I do what I am going to do, it will give us significant benefits? Do people need this? Has this problem been solved before? What is the minimum I can do to solve this reasonably, and move on to other bigger problems?

This proactive research is what I feel is the difference between new and experienced engineers. In fact, I think, in some cases senior engineers write LESS code than the less experienced ones and still get more things done. Its now clear to me, that the actual coding should only take 10% of the time allocated to the project. If I spend enough time doing my due-diligence and am ‘lazy’, I can simply prune some potential duds much before they turn into huge time sinks. If I spend some more time on the problems which actually require my time, I can figure out things I can do to reduce the scope of the problem, or cleverly use pre-built solutions to do part/most of the work. All this can only come if we (and I will repeat again) Understand. The. Problem.

(Please let me know if you agree or disagree with me about what I said. I would love to hear back).

This week we planned to discuss Dynamic Programming. The idea was to discuss about 4-5 problems, however, the very first problem kept us busy for an entire hour. The problem is a well known one: ‘Given an infinite supply of a set of coin denominations, $C = {c_1, c_2, c_3, …}$, in how many ways can you make an amount N from those coins?’

The first question to be asked is, whether we allow permutations? That is, if, $c_1 + c_2 = N$, is one way, then do we count $c_2 + c_1 = N$, as another way? It makes sense to not allow permutations, and count them all as one. For example, if $N$ = 5, and $C$ = {1, 2, 5}, you can make 5 in the following ways: {1, 1, 1, 1, 1}, {1, 1, 1, 2}, {1, 2, 2}, {5}.

We came up with a simple bottom-up DP. I have written a top-down DP here, since it will align better with the next iteration. The idea was, $f(N) = \sum f(N-c[i])$ for all valid $c[i]$,
i.e., the number of ways you can construct $f(5) = f(5-1) + f(5-2) + f(5-5) \implies f(4) + f(3) + f(0)$. $f(0)$ is 1, because you can make $0$ in only one way, by not using any coins (there was a debate as to why $f(0)$ is not 0). With memoisation, this algorithm is $O(NC)$, with $O(N)$ space.

dp1.cpp

1234567891011121314151617181920212223

// table stores the results. All values are init to -1.inttable[MAXN];// cval is the value of coins available.intcval[MAXC];// N is the amount, C is the number of coins.intN,C;intsolve(intn){int&res=table[n];if(res!=-1){returnres;}elseif(n==0){returnres=1;}res=0;for(inti=0;i<C;i++){if(n-cval[i]>=0){res+=solve(n-cval[i]);}}returnres;}

This looks intuitive and correct, but unfortunately it is wrong. Hat tip to Vishwas for pointing out that the answers were wrong, or we would have moved to another problem. See if you can spot the problem before reading ahead.

The problem in the code is, we will count permutations multiple times, for example, for $n = 3$, the result is 3 ({1, 1, 1}, {1, 2} and {2, 1}). {1, 2} and {2, 1} are being treated distinctly. This is not correct. A generic visualization follows.

Assume we start with an amount $n$. We have only two types of coins of worth $1$ and $2$ each. Now, notice, how the recursion tree would form. If we take the coin with denomination $1$ first and the one with denomination $2$ second, we get to a subtree with amount $n-3$, and on the other side, if we take $2$ first, and $1$ next, we get a subtree with the same amount. Both of these would be counted twice with the above solution, even though, the order of the coins does not matter.

After some discussion, we agreed on a top-down DP which keeps track of which coins to use, and avoids duplication. The idea is to always follow a lexicographic sequence when using the coins. It doesn’t matter if the coins are sorted or not (actually yes, if you check all the coins that you are allowed to use, if they can be used). What matters is, always follow the same sequence. For example, if I have three coins {1, 2, 5}. Let’s say, if I have used coin $i$, I can only use coins $[i, n]$ from now on. So, if I have used coin with value $2$, I can only use $2$ and $5$ in the next steps. The moment I use 5, I can’t use 2 any more.

If you follow, this will allow sequences of coins, in which the coin indices are monotonically increasing, i.e., we won’t encounter a scenario such as {1, 2, 1}. This was done in a top-down DP as follows:

// This is a toy program, so please excuse the trivial flaws.#define MAXN 2000#define MAXC 20// An N*C array, hard-coding the max N = 2000, C = 20.intr[MAXN][MAXC];intcval[MAXC];intN,C;/*** O(N*C) solution.* N is the sum to make, and C is the number of coins you can use. In the method* n is the amount we have to make, and c denotes the coin number which is the* smallest coin we can use, i.e., we can only use coins [Cc, Cn-1]. This is to* prevent double counting.*/intsolve(intn,intc){int&res=r[n][c];if(res!=-1){returnres;}if(n==0){returnres=1;}res=0;for(inti=c;i<C;i++){intrem=n-cval[i];if(rem>=0){res+=f(rem,i);}}returnres;}intsolve(intn){memset(r,-1,sizeof(r));returnf(n,0);}

Now, this is a fairly standard problem. I decided to check on the interwebs, if my DP skills have been rusty. I found the solution to the same problem on Geeks-for-Geeks, where they present the solution in bottom-up DP fashion. There is also an $O(N)$ space solution in the end, which is very similar to our first faulty solution, with a key difference that the two loops are exchanged. That is we loop over coins in the outer-loop and loop over amount in the inner loop.

This is almost magical. Changing the order of the loops fixes the problem. I have worked out the table here step by step. Please let me know if there is a mistake.

Step 1: Calculating with 3 coins, uptil N = 10. Although we use an one-dimensional array, I have added multiple rows, to show how the values change over the iterations.

Step 2: Initialize table[0] = 1.

Step 3: Now, we start with coin 1. Only cell 0 has a value. We start filling in values for $n = 1, 2, 3, ..$. Since, all of these can be made by adding \$1 to the amount once less then that amount. Thus, the total number of ways right now, would be 1 for all, since we are using only the first coin, and the only way to construct an amount would be $1 + 1 + 1 + … = n$.

Step 4: Now, we will use coin 2 with denomination \$2. We will start with $n = 2$, since we can’t construct any amount less than \$2 with this coin. Now, the number of ways for making amount \$2 and \$3 would be $2$. One would be the current number of ways, the other would be removing the last two $1$s, and adding a two. Similarily, mentally (or manually, on paper) verify how the answers would be.

Step 5: We repeat the same for 3. The cells with a dark green bottom are the final values. All others would have been overwritten.

I was looking into where exactly are we maintaining the monotonically increasing order that we wanted in the top-down DP in this solution. It is very subtle, and can be understood, if you verify step 4 on paper, for $4, 5, 6, …$ and see the chains that they form.

In the faulty solution, when we compute amounts in the outer loop, when we reach to amount $n$, we have computed all previous amounts for all possible coins. Now, if you compute from the previous solutions, they have included the counts for all coins. If you try to calculate the count for $n$, using the coin $i$, and result for $n - cval[i]$, it is possible, that the result for $n - cval[i]$, includes the ways with coins > $i$. This is undesirable.

However, when we compute the other way round, we compute for each coin at a time, in that same lexicographical order. So, if we are using the results for $n - cval[i]$, we are sure, that it does not include the count for coins > $i$, because they haven’t been computed yet, since they would only happen after computing the result for $i$.

As they say, sometimes being simple is the hardest thing to do. This was a simple problem, but it still taught me a lot.

Instead of going into fractal trees directly, I am going to be posting a lot of assorted related material that I am going through. Most of it is joint work with Dhruv and Bhuwan.

This post is about the Static-to-Dynamic Transformation lecture by Jeff Erickson. The motivation behind this exercise is to learn, how can we use a static data-structure, (which uses preprocessing to construct itself and answer queries in future) and create a dynamic data-structure that can continue taking inserts continuously.

I won’t be formal here, so there would be a lot of chinks in the explanation. So either excuse me for that, or read the original notes.

A Decomposable Search Problem

A search problem $Q$ with input $\mathcal{X}$ over data-set $\mathcal{D}$ is said to be decomposable, if, for any pair of disjoint data sets, $D$ and $D’$, the answer over $D \cup D’$, can be computed from the answers over $D$ and $D’$ in constant time. Or:

$Q(x, D \cup D’) = Q(x, D) \diamond Q(x, D’)$

Where $\diamond$ is an associative and commutative function which has the same range, as $Q$. Also, we should be able to compute $\diamond$ in $O(1)$ time. Examples of such a function would be $+$, $\times$, $min$, $max$, $\vee$, $\wedge$ etc. (but not $-$, $\div$, etc.).

Examples of such a decomposable search problem can be a simple existence query, where $Q(x, D$ returns true if $x$ exists in $D$. Then the $\diamond$ function is the binary OR. Another example can be where the dataset is a collection of coordinates, and the query is the number of points which lie in a given rectangle. The $\diamond$ function here is $+$.

Making it Dynamic (Insertions Only)

If we have a static structure that can store $n$ elements by needing $S(n)$ space, after $P(n)$ preprocessing, and can answer a query in $Q(n)$ time. What we mean by a static data-structure is that we can only make inserts into the data-structure exactly once. But we can iterate through that data-structure (this is what the notes have missed, but is a requirement to get the bounds).

Then, we can construct a dynamic data-structure the space requirement of the dynamic structure would be $O(S(n))$, with query time of $O(\log n).Q(n)$, and insert time of $O(\log n).\frac{P(n)}{n}$ amortized.

How do we do this?

Query: Our data-structure has $l$ = $\lfloor{\lg{n}\rfloor}$ levels. Each level $i$ is either empty, or has a static-data structure with $2^i$ elements. Hence, since the search query is decomposable, the answer is simply $Q(D_{0}) \diamond Q(D_{1}) \diamond … \diamond Q(D_{l})$. It is easy to see why the total time taken for the query would be $O(\log n)$ Q(n).

An interesting point is, if $Q(n) > n^\epsilon$, where $\epsilon > 0$ (which essentially means, if $Q(n)$ is polynomial in $n$), then the total query time is $O(Q(n))$. So, for example, if $Q(n) = n^2$, with the static data-structure, the query time with the dynamic data-structure would be $O(Q(n))$. Here is the proof, the total query time is: $\sum{Q(\frac{n}{2^i})} \implies \sum{(\frac{n}{2^i})^\epsilon} \implies n^\epsilon \sum{(\frac{1}{2^i})^\epsilon} \implies n^\epsilon . c \implies O(n^\epsilon) \implies O(Q(n))$ .

Insert: For insertion, we find the smallest empty level $k$, and build $L_k$ with all the preceding levels ($L_0$, $L_1$, …, $L_k$) and the new element, and discard the preceding levels. Since it costs $P(n)$ to build a level, and each element will participate in the array building process $O(\log n)$ times (or jump levels that many times), we will pay the $P(n)$ cost $O(\log n)$ times. Over $n$ elements, that is $O(\log n).\frac{P(n)}{n}$ per element amortized. Again, if $P(n) > n^{1+\epsilon}$ for any $\epsilon > 0$, the amortized insertion time per element is $(O(P(n)/n)$. The proof is similar to what we described above for the query time.

Interesting Tidbit
Recollect what we mean when we say that a data-structure is static. A Bloom Filter is a static data-structure in a different way. You can keep inserting elements into it dynamically up to a certain threshold, but you can’t iterate on those elements.

The strategy to make it dynamic is very similar, we start with a reasonably sized bloom-filter, keep inserting into it as long as we can. Once it is too full, we allocate another bloom-filter of twice the size, and insert elements into that, from now on. And so on. The queries are done on all the bloom-filters and are a union of their individual results. An implementation is here.

What Next: Deamortization of this data-structure with insertions as well as deletions. Then we will move on to Cache-Oblivious Lookahead Arrays, Fractional Cascading, and eventually Fractal Trees. This is like a rabbit hole!

It is fascinating, how simple data structures can be used to build web-scale systems (Related: A funny video on MongoDB).
If this doesn’t make sense to you yet, allow me to slowly build up to the story. One of the most simple, and yet powerful algorithm a programmer has in his toolbox is the Binary Search. There are far too many applications to it. Consider reading this Quora answer for simple examples. I personally use it in git bisect to hunt down bad commits in a repository with tens of thousands of commits.

The humble sorted array is a beautiful thing. You can search over it in $O(\log n)$ time. There is one trouble though. You cannot modify it. I mean, you can, but then you will spoil the nice property of it being sorted, unless you pay an $O(n)$ cost to copy the array to a new location, and insert the new element. If you have reserved a large enough array before hand, you don’t need to copy to a new array, but still have to shift elements and that will still be an $O(n)$ cost.

Also, if we were allowed to plot complexities on a graph, we can plot the insert complexity on the X-axis and search complexity on the Y-axis. Then all the suitable data-structures would hopefully be bound by the square with edges on <$O(1)$, $O(1)$> and <$O(n)$, $O(n)$>. The sorted array with <$O(n)$, $O(\log n)$> would lie somewhere on the bottom right corner, whereas, a simple unsorted array would be on the top-left with <$O(1)$, $O(n)$>. You can’t do insertions better than $O(1)$ and you can’t do searches better than $O(\log n)$ (although the bases and constants matter a lot, in practice).

Now, how do we use a static structure, so that we retain the goodness of a sorted array, but allow ourselves the ability to add elements in an online fashion? What we have here, is a ‘static’ data-structure, and we are trying to use it for a ‘dynamic’ usecase. Jeff Erickson’s notes on Static to Dynamic Transformation are of good use here. The notes present results related to how to use static data-structures to build dynamic ones. In this case, you compromise a bit on the search complexity, to get much better insert complexity.

The notes present inserts-only and inserts-with-deletions static to dynamic transformations. I haven’t read the deletions part of it, but the inserts-only transformation is easy to follow. The first important result is:

If the static structure has a space complexity of $S(n)$, query complexity of $Q(n)$, and insert complexity of $P(n)$, then the space complexity of the dynamic structure would be $O(S(n))$, with query complexity of $O(\log n).Q(n)$, and insert complexity of $O(\log n).\frac{P(n)}{n}$ amortized.

Then the notes present the lazy-rebuilding method by Overmars and van Leeuwen. Which improves the first result’s insertion complexity by getting the same complexity in the worst case instead of amortized. (Fun fact: Overmars is the same great fellow who wrote Game Maker, a simple game creating tool, which I used when I was 12! Man, the nostalgia :) I digress..)

The inserts-only dynamic structure, is pretty much how LSM trees work. The difference is the $L_0$ array starts big (hundreds of MBs, or a GB in some cases), and resides in memory, so that inserts are fast. This $L_0$ structure is later flushed to disk, but does not need to be immediately merged with a bigger file. That is done by background compaction threads, which run in a staggered fashion, so as to minimize disruption to the read workload. Read the BigTable paper, to understand how simple sorted arrays sit at the core of the biggest databases in the world.

I really like the concept of Fortune Cookies in *nix systems, and I absolutely love the Hindi movie Andaz Apna Apna. So, I thought I would write up a simple fortune-cookie server which serves random quotes from the movie. A lot of my friends liked it.

Using Elixir, it is extremely simple to write a fortune-cookie server which serves quotes from multiple quote databases. All you need to do is create the a file that contains the quotes you want to serve, one per line, and give it a name like foo.quotes. Place it in the directory where the server was started from, and those quotes would be serve from the /foo endpoint.

To make it more fun, /foo?f=cowsay returns the quote in the cowsay format! Something like this

Implementation Note: To implement the feature of keeping tab on the quote databases, without having to restart the server, one way was to use the Inotify subsystem in Linux, using Go. But this didn’t work for OSX. So I wrote up a quick and dirty implementation which does ioutil.ReadDir(".") periodically, and filters all the files which have a .quotes extension.

It has almost been a year since I wrote something. For the most part, I have been too busy to share something. Since the past one year, I have been working on HBase at Facebook, which is the backbone of Facebook Messages, and many other important services at Facebook. Also, I’ve moved to Mountain View, California, and I absolutely love the surroundings.

I have been trying my hand at different things, like learning some ML, Go, trying to learn how to play an Electric Guitar, and other things. One thing I want to follow this year would be to continue doing new things, and keep sharing my experiences.

Finally, I have also ditched WordPress in favor of Octopress, a Markdown-style blogging framework built on top of Jekyll. What this means is, I just need to worry about the content, which I can write in simple Markdown format. Octopress generates a completely static website for me to use. I don’t have to setup the Apache-MySQL monstrosity to serve a simple blog.

However, the transition isn’t very smooth.

I had to get my posts from WordPress into a Markdown format. For this, I used exitwp.

I had to copy the images I had uploaded to my WP setup to my Octopress sources directory, and manually change the image tags to point to the right location in all the markdown files.

Obviously, I lost the ability to receive comments natively, and lost the older comments on my posts. This is fine by me, since I don’t receive too many comments anyways. I will enable Disqus on the blog for future comments.

Also, WP has this ridiculous URL scheme for posts, which is something like yourblog.com/?p=XYZ, where XYZ is a number, while Octopress has a more sensible, :year/:month/:date/:title scheme. Google had indexed my blog according to the older scheme and now, and anybody who has linked to a specific blog post, will now be redirected to the main page. In short, its not pleasant.

However, the big win is that it is super-easy for me to write posts and host a blog. Earlier, this blog was put up on a free shared hosting site, and it was very clumsy to manage my blog. And as of the time of writing this post, I am hosting multiple blogs on a very lightweight VPS, and as long as I don’t get DDoS-ed, this machine is more than capable of hosting several such static-only blogs. Because, after all, how hard is it to serve static HTML :)

I have a lot to share about what I learnt in the last year, so keep looking :)

I had done a visualisation using R, where I plotted the locations of my friends and followers on Twitter on a map. I re-did this visualisation using D3.js. It is a fantastic visualisation tool, which a lot of features. I could only explore a fraction of the API for my purpose, but you can see a lot of cool examples on their page.

I wanted to resize the bubbles when there are a lot of points in the vicinity, but I’ll leave that for later.

You can have a look at the D3.js code, and the python script which gets the lat-long pairs using the Twitter API, and the Google GeoCoding API (Yahoo! decided to commercialize their GeoCoding API sadly).

I have been trying to learn some Go in my free time. While, I was trying to code up a simple Bloom Filter, I realized that once a Bloom Filter gets too full (i.e., the false positive rate becomes high), we cannot add more elements to it. We cannot simply resize this original Bloom Filter, since we would need to rehash all the elements which were inserted, and we obviously don’t maintain that list of elements.

A solution to this is to create a new Bloom Filter of twice the size (or for that matter, any multiple >=2), and add new elements to the new filter. When we need to check if an element exists, we need to check in the old filter, as well as the new filter. If the new filter gets too full, we create another filter which is a constant factor (>=2) greater in size than the second filter. bit.ly uses a solution similar to this (via Aditya).

We can see that if we have N elements to insert in our ‘Scalable’ Bloom Filter, we would need log N filters (base r, where r is the multiple mentioned above). It is also easy to derive the cumulative false positive rate for this new Bloom Filter.

If the false positive rate of each of the individual constituent bloom filters is f, the probability that we do not get a false positive in one of the filters is (1-f).

Therefore the probability that we do not get a false positive in any of the q filters is, (1-f)^q.

Hence, the probability that we get a false positive in any of these q filters is:

1 - (1-f)^q.

Some rough estimates show that this cumulative false positive rate is around: q * f (only if f is small). Where, q is about log N, as we noted above. Therefore, if you have four filters, each with a false positive rate of 0.01, the cumulative false positive rate is about 4 * 0.01 = 0.04. This is exactly what we want.

What is beautiful in this construction is, the false positive rate is independent of how fast the filter sizes grow. If you maintain a good (small) false positive rate in each of the constituent filter, you can simply add up their false positive rates to get an estimate of the cumulative false positive rate (only if f is small).

You can simply grow your filters fast (like around 5x), each time one filter becomes too full, so as to keep the number of filters small.

I have implemented this ‘Scalable Bloom Filter’ (along with the vanilla Bloom Filter and the Counting Bloom Filter) in Go. Please have a look, and share any feedback.

Today I was trying to link together code which uses C++ Templates. The usual accepted pattern is to put the declarations in the header, and the definitions in a separate .hpp / .cpp file. However, I was unable to get them to link it together.

To my surprise, I discovered (or re-discovered, probably) that, when dealing with C++ Templates, you need to put the definitions in the header file. A good explanation of why it is so, is here.