Saturday, January 07, 2012

Let S be some finite set with a probability distribution on it. Here's a diagram showing some example probabilities for the set S={A, B, C, D, E}:

How can we generate lots of random samples from this distribution? A popular method involves first storing the cumulative distribution function in a table like so:

We then use any of a number of popular methods to generate uniform pseudorandom numbers in the range [0,1) and for each one walk through the table until we find the first entry greater than our pseudorandom number. The symbol above the number is the one we generate. So if we generated 0.512, we'd pick symbol C. It's straightforward to prove this gives the correct probability distribution.

As described this algorithm can be slow. If the size of the table is N we may have to walk through up to N entries to generate each sample.

One approach to accelerating this algorithm is to quantise our pseudorandom number and use it to look up, in a precomputed table, a jump into our cumulative distribution table. I've used this several times in my visual effects career. But today I'm going to take a different approach.

Another natural way to speed up sample generation is to use a binary search to find the appropriate point in the cumulative distribution, for example the C++ standard template library's upper_bound method will do the job.

But what is a binary search algorithm going to do? Typically it's going to start by comparing our pseudorandom number with the midpoint of the table. If our number is bigger then it will recursively use binary search on the left (looking next at the midpoint of the left half), otherwise on the right, and so on. If we're generating many samples from the same distribution we're going to be repeatedly looking up the same midpoints in the table. At the end of the day, the process can be described by a decision tree. So we may as well throw away the table and build the decision tree up front. Here's what it might look look like:

But maybe we can do better. For example C is three times as likely as A but they both take the same amount of time to generate as they both require walking down to depth 2. Meanwhile D has a probability of 0.25 and requires walking to depth 3. We'd like to rebalance the tree. Note also that there's no reason to list our original PDF in the order I gave. Different orderings might give different trees.

It's straightforward to describe what we want to optimise. We want to place sample i at depth di so that the expected value of di is as small as possible. In other words we want to minimise Σpidi. But this is precisely the problem solved by Huffman coding.

And that's the conclusion I wanted to reach: Huffman coding gives the optimal decision tree to generate random samples. It also tells us this interesting consequence: if we use a decision tree method then the performance of our algorithm is bounded by the entropy of the probability distribution. I find this connection between entropy and algorithmic complexity pretty surprising.

I learnt the above during my interview at Google!

Why is there this connection between a compression algorithm and the generation of random samples? It took me a little longer to realise why but it's quite straightforward.

Huffman coding tries to compress text one letter at a time on the assumption that each letter comes from some fixed and known probability distribution. If the algorithm is successful then we'd expect the compressed text to look like a uniformly distributed sequence of bits. If it didn't then there'd be patterns that could be used for further compression.

So when we're decompressing Huffman encoded data we have a machine that takes as input uniformly distributed bits and which outputs letters sampled from some probability distribution. But that's exactly the same problem that I posed above. So, at least from some perspective, decompression is precisely the same thing as generating samples from a probability distribution.

Or to put it another way: there is a class of algorithm whose purpose is to convert samples from one probability distribution into samples from another. When we use one of these algorithms to convert from samples from some distribution P that's easy to generate samples from, into samples from Q, we call this "an algorithm to sample from distribution Q". When we use one of these algorithms to convert from some distribution to a uniform distribution, and the function given by the algorithm is invertible, then we call it "lossless compression".

I see how this tells us about the optimal number of decisions you have to make while generating a random number, but we still cannot generate the numbers by just taking a random bit string and decompressing it with the Huffman tree over the desired distribution. E.g. consider a distribution {a:0.4, b:0.6} - it will have a Huffman tree with just 2 leaves, and decompression will give {a:0.5, b:0.5}. Am I missing something?

I came to this same realization when I started studying data compression (almost 30 years ago -- yikes!) seriously. Build a decoder for the appropriate arithmetic code, and feed it random bits. You get your output distributed according to the probability model used to build the arithmetic code.

Hi I hate to jump to a different subject. But a while back you had some wonderful notes on Target Enumeration and a Haskell implementation.My question is how could you extend what you did such that our boundary points lie on real number coordinates...so to refer to the pixel examples our area of coverage would contain the counted 'holes' plus the partially contained 'holes' or say pixel areas? How could we get the exact coverage based on the way you implemented? I appreciate any information as I know this was a while back---but I JUST found the blog if you have any resources please feel free to contact me at:jcarrola@swri.org ---thanks for any time you can spare. Regards

I don't think Huffman coding yields suitable trees for this problem, since the trees for random variate generation have the additional requirement of being ordered. Optimal alphabetic binary trees or optimal BSTs require a different algorithm that is not a variation of Huffman coding:

"This is also known as the Hu-Tucker problem, after the authors of the paper presenting the first linearithmic solution to this optimal binary alphabetic problem, which has some similarities to Huffman algorithm, but is not a variation of [Huffman coding]. These optimal alphabetic binary trees are often used as binary search trees."