C/C++

Random Numbers in a Range Using Generic Programming

Source Code Accompanies This Article. Download It Now.

The Greatest Common Divisor algorithm computes the maximal number that divides its two arguments—and can be used as an efficient random-number generator.

Michael is a doctoral candidate in the Computer Science Department of Ben-Gurion University. He can be contacted at orlovm@cs.bgu.ac.il.

Consider the generation of random numbers in an interval, say the generation of an integer in {0,1,...,m-1} where the probability of each integer is 1/m. Usually, you have a source of randomness that provides random bits in chunks of architecture word sizesomething like the wrapper around the mtrand library (bedaux.net/mtrand); see Listing One.

Mtrand implements the Mersenne Twister random-number generator (www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html), which is a high-quality source of randomness. By "high quality," I mean that you don't need to consider low-order bits of words produced by the generator to be of inferior quality than high-order bits. In principle, you can take the value returned by next()modulom to produce an integer in the range [0,m-1].

However, this approach is problematic when m is comparable in size to the number of values representable by value_t. For instance, assume that the type of value_t is an unsigned 10-bit integer, and can thus contain values in the [0,1023] range. Consider m=684, which is approximately 2/3 of 1024. Taking a random value_tmodulom results in the lower half of [0,m-1] range having twice the probability of its upper half.

The regular approach to this problem is to ignore it (effectively assuming that m is low enough that the difference in probabilities is insignificant), or to reject the values in the "remainder" portion of the value_t range; see Listing Two.

Taking the previous example with m=684, the constants in Listing Two are:

M=1023, the maximal value representable by value_t.

rem=340, the remainder size.

last=683, the last value in the range of "safe" values returned by next().

Is the remainder values rejection approach costly? It is, especially for values of m close to, but above, the half-range of value_t. For such values, the expected number of calls to next() for each call to next(max) is 2 (this follows from the probability of rejection being 1/2). Because the cost of random bits varies between "moderate" (for Mersenne Twister) to "expensive" (for cryptography-oriented generators, especially hardware implementations), the running time of the relevant program portions doubles in such cases (unless you settle for the low-quality linear congruential generators).

Is it wasteful? The rejected values are thrown away, but perhaps their inherent entropy can still be usedmaybe to reduce the probability of rejection on the next iteration of the do/while loop? For instance, in the example, once rnd is rejected, and just before the next loop iteration, rnd-last-1 is uniformly distributed in the [0,rem-1] range. You could partition the [0,m-1] range into exact rem buckets, and in the next iteration draw a value in the corresponding bucket using m=rem. What's problematic in this solution is that the conditions for its application are quite rare because it requires that rem divides m. It can happenwith a value_t type of 10 bits, two possibilities are m=768 and m=320however, it is possible to do a little better using the Greatest Common Divisor (GCD) algorithm.

Recall that the GCD algorithm computes the maximal number that divides its two arguments, in logarithmic number of arithmetic operations. Let's return to m=684 and rem=340. Both are divisible by 4 (which is the GCD). Thus, if rnd is rejected in the first loop iteration, you can find out in which quarter of the remainder values rnd is located, map it to the corresponding quarter of values in the [0,m-1] range, and let m=684/4=171 in the next iteration. Then, the probability of rejection becomes 1/6 instead of 1/3 (and can in general drop to a much lower value).

It turns out, however, that for randomness sources that are only moderately expensive, the GCD computation cancels any running time benefit attained by reducing the expected number of calls to next(). If you are writing this code for a library, you could specify that it assumes usage with only "expensive" random-number generators, or implement memoization. However, in the first case, the code's generic nature suffers, and in the second case, a penalty is incurred on users who don't need perfect distribution of values in the [0,m-1] range.

This gloomy situation does have a win-win solution if you are using a language that allows compile-time computations (C++, for instance), and if you are willing to limit the approach outlined earlier to values of m known at compile time.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!