So you have an infinite stream of uniform random binary digits, and want to use it to produce an infinite stream of uniform random base $n$ digits.

The obvious really easy way to do it is to find the smallest $k$ such that $2^k>=n$, and generate numbers in the range $0 \dots 2^k-1$.

If the number you generate is less than $n$, yield it, otherwise chuck it away.

This works, but has the problem that you might go a long time before you generate a number that you don’t throw away.

So what can we do with the numbers that are thrown away?

Subtract $n$ from them, and use them as a stream of infinite base $(2^k-n)$ digits.

You can then do the same trick of generating enough of those digits until you can generate numbers greater than or equal to $n$. Numbers less than $n$ are yielded; otherwise, the trick is repeated yet again!

Example

Here’s an example, generating base $5$ digits:

The smallest power of $2$ greater than $5$ is $2^3 = 8$. So take digits from the binary stream three at a time.

There are three unusable numbers we can generate: $5,6,7$. So subtract $5$ from those, giving $0,1$ or $2$, and treat them as digits from a stream of base $3$ digits.

The smallest power of $3$ greater than $5$ is $3^2 = 9$, so take digits from that stream two at a time. Numbers $5,6,7,8$ from that range can’t be used, so subtract $5$ from them, giving $0,1,2$ or $3$, and treat them as base $4$ digits.

The smallest power of $4$ greater than $5$ is $4^2 = 16$, so take digits from that stream two at a time. The offcasts from that stream should be considered as base $16-5 = 11$ digits. $11$ is bigger than $5$ already, so take digits from that stream one at a time.

The offcasts from the base $11$ stream look like base $6$ digits. There’s only one unusable number from that range — $5$ — which doesn’t give you any information, so stop being clever at that point and throw away those numbers.

At the moment, our stream of base $5$ digits is $[3,3,1]$ and we have a $2$ in the stream of base $3$ digits. We can keep generating numbers like this indefinitely, and we hardly ever throw information away. It would probably take a lot more steps to get to the point where we use the bigger accumulators.

Working code

I’ve written some Python code which implements the algorithm above to produce an infinite stream of digits of any base, from a stream of binary digits provided by Python’s random module.

6 replies

Comments

I think this is information theory, and you might want to have a look at the kind of algorithm used for zipping files for clues.

Assuming you want letters to be equally likely, if you have an alphabet of k letters, and generate n random bits, you should (in principle) be able to encode a message of length $\frac{n}{\log_2(k)}$ (i.e., with 32 letters and 10 bits, you could encode two letters).

Quite how you do it, I’m not sure – I suspect you’d need a splitting tree with rules to use if you get to a non-letter node.

An on-the-fly thought: can you consider the binary stream as a binary ‘decimal’ and simply convert it into a ternary (or base n) ‘decimal’?

For example: if you have 0.101110, you can place limits on it digit by digit:

0.1 means the number is between 1/2 and 1
0.10 … 2/4 to 3/4
0.101 … 5/8 to 6/8
0.1011 … 11/16 to 12/16 (in particular, between 2/3 and 1 so first ternary place is 2; between 6/9 and 7/9 so second ternary place is 0)
0.10111 … 23/32 to 24/32
0.101110 … 46/64 to 47/64 (between 19/27 and 20/27, so third ternary place is 2).

Coding theory is indeed relevant here.
The first method is closely related to Huffman coding.
You pad the initial alphabet with symbols that will not get emitted and then construct the binary Huffman tree.
A standard method to increase the efficiency is to encode pairs, triples, etc.
For this application you can also encode a list with each symbol repeated twice, three times, etc.

The method interpreting the binary stream as a number in [0, 1) is essentially arithmetic coding. This can be implemented using only integers (range coding).

The method from your original posted can be refined to wring all the entropy out in a less drastic way than Colin’s arithmetic coding; this refinement is also what your “random bit unbiasing” link does. Namely, the portion of information you’re discarding is whether each attempt of a given accumulator to emit a digit succeeds or whether it passes through to the next one.

In your example, converting a binary string 110.101.011.111.001 to base 5, you should be not just yielding [3,1] directly and passing [1,0,2] to a base 3 accumulator, but also passing [1,1,0,1,0] to a new biased accumulator of binary digits which are 1 with probability 3/8.