Security

Algorithm Alley: Truly Random Numbers

Source Code Accompanies This Article. Download It Now.

Random numbers are essential in cryptography. This month, Colin Plumb discusses the random-number generator he helped devise for the Pretty Good Privacy (PGP) e-mail security program.

Why do we need random numbers? There are a lot of reasons. You might be designing a communications protocol and need random timing parameters to prevent system lockups. You might be conducting a massive Monte Carlo simulation and need random numbers for various parameters. Or you might be designing a computer game and need random numbers to determine the results of different actions.

As common as they are, random numbers can be infuriatingly hard to generate on a computer. The very nature of a computer--a deterministic, digital Turing machine--is contrary to the notion of randomness.

One application where random numbers are essential is in cryptography. The security of a cryptographic system often hinges on the randomness of its keys. In this month's "Algorithm Alley," Colin Plumb discusses the random-number generator in the Pretty Good Privacy (PGP) e-mail security program. Colin is one of the designers and programmers of PGP, and has spent a lot of time thinking about this problem. His solution is elegant, efficient, effective, and has applications well beyond e-mail security.

The ANSI C rand() function does not return random numbers. This is not a bug; it's required by the ANSI C standard. Instead, the values returned are determined by the seed supplied to srand(). If you run the same program with the same seed, you get the same "random" numbers. The pattern may not be obvious to the casual observer, but if Las Vegas ran this way, there'd be fewer bright lights in the big city.

John von Neumann once said that "anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin." Sometimes you want truly random numbers. Any number of security and cryptographic applications require them. When it comes to checking for viruses, for example, CRCs are convenient and fast, but you can easily fake out a known polynomial. The Strongbox secure loader from Carnegie Mellon University, however, uses a random polynomial to achieve security while keeping the speed advantages of a CRC (see Camelot and Avalon: A Distributed Transaction Facility, edited by Jeffrey L. Eppinger).

There are ways to produce random bits in hardware; sampling a quantum effect such as radioactive decay, for instance. However, this is hard to calibrate and involves special hardware.

On the other hand, software in a properly working computer is deterministic--the antithesis of random. Still, a computer generally has to interact and receive input from real-world events, so it is possible to make use of a very unpredictable part of most computer systems: the person typing at the keyboard.

Although keystrokes are somewhat random, compression utilities illustrate just how predictable most text is. While it would be foolish to ignore this entropy, anticipating what someone types is akin to password guessing: difficult, but if you have the computational horsepower, conceivable.

A more fruitful source is timing. Many computers can time events down to the microsecond. And while typing patterns on familiar words or phrases are repeatable enough to be used for identification, there is still a large window of available noise. Our basic source for entropy comes from sampling system timers on every keystroke.

The problem that remains is to turn these timer values, which have a nonuniform distribution, into uniformly distributed random bits. This is where the software comes in.

Theory of Operation

The file randpool.c (see Listing Four) uses cryptographic techniques to "distill" the essential randomness from an arbitrary amount of sort-of-random seed material. As the file name suggests, the program maintains a pool of hopefully random bits, into which additional information is "stirred." The goal here is that if you have n bits of entropy in the pool ("Shannon information," if you're familiar with information theory), any n bits of the output are truly random.

The stirring operation (actually nothing more than an encryption pass) is central. If you know the key and initial vector, you can reverse it to get back the initial state. Since the encryption is reversible, stirring the pool obviously does not lose information. So all the information in the initial state is there, it's just masked by the encryption.

Since we don't need to reverse the encryption, the key is then destroyed. The information is then reinitialized with data taken from the pool that was just stirred. This makes it essentially impossible to determine the previous state of the random-number pool from what is left in memory.

The cipher is Peter Gutmann's Message Digest Cipher using MD5 as a base. This is fast and simple (as strong ciphers go), especially on 32-bit machines. In this application, the large key size also helps efficiency. (For another application of this cipher, see Gutmann's shareware MS-DOS disk encryptor, "SFS." Every commercial MS-DOS disk encryptor I've seen--Norton Diskreet, for example--has appalling cryptography. Their only advantage is that you can get the data back with a few weeks' work if you lose the key. If you lose the key with SFS, it's lost.)

The output of the generator is taken from the pool, starting after the 64 bytes used for the next stirring key. If you reach the end of the pool, stir again and restart. After that, it is theoretically possible to examine the output and determine the key, which would reveal the complete state of the generator and let you predict its output forever. That, however, would require breaking the cipher by deriving the key from the data before and after encryption, an adequate guarantee of security.

Input is more interesting. To ensure that each bit of seed material affects the entire pool, the seed material is added (using XOR) to the key buffer. When you reach the end of the key buffer, stir the pool and start over. The difficulty of cryptanalysis (deriving the key from the prior and following states of the pool) ensures that regularities in the seed material do not produce regularities in the pool.

Adding bytes to the key sets the take position to the end of the pool, so the newly added data will be stirred in before any bytes are returned. Thus, you can add and remove bytes in any order.

The code mostly works with bytes, but since MD5 works with 32-bit words, a standard byte ordering is used. This way you can use it as a pseudorandom number generator seeded with a passphrase.

If you want to use the hash directly, md5.c (see Listing Two) includes a full implementation of the MD5 algorithm. (It is similar to the hash presented in "SHA: The Secure Hash Algorithm," by William Stallings, DDJ, April 1994.) If you have a large amount of low-grade seed material, you can use MD5 to pre-reduce it. For example, you can feed mouse-position reports into MD5, then periodically add the resultant 16-byte digest to the pool. Even faster algorithms are possible--based on CRCs and scrambler polynomials--if you have real-time constraints.

Practice of Operation

The file noise.c (see Listing Three) samples a variety of system timers and adds them to the random-number pool. It also returns the number of highest-resolution ticks since the previous call, which you can use to estimate the entropy of this sample. On an IBM PC, only 16 bits are returned; this underestimates the result if the calls are more than 1/18.2 seconds apart, but that is not a security problem.

The code also works under UNIX. You may have to find the frequency of a timer that only returns ticks; noiseTickSize() finds the resolution of a timer (the gettimeofday() function) that only returns seconds.

The main driver is in randtest.c (see Listing One). A flash effect is provided by funnyprint(). Of more value is randRange(), which illustrates a way to generate uniformly distributed random numbers in a range not provided by the generator. The problem is akin to generating numbers from 1 to 5 using a six-sided die. The solution amounts to rerolling if you get a 6.

The most interesting part is rand-Accum(), which accumulates a specified amount of entropy from the keyboard. It uses the number of ticks returned by the noise() function to estimate the entropy. It assumes that inter-keystroke times vary pretty uniformly over a range of 15 percent or so. Thus, it divides the tick count by 6 to get the fraction of the interval that is random, then takes the logarithm to get the number of bits of entropy.

The integer number of bits comes from normalizing the number and counting the shifts. The entropy is kept to four fractional bits using a few iterations of an integer-logarithm algorithm.

Weaknesses

I don't know of any exploitable holes in this approach to generating random numbers, but in cryptography, only a fool is sure he has a good algorithm. I believe the following points need further examination:

The divide-by-six approximation in randAccum(). This was chosen so a machine with only a 60-Hz clock would produce at least one bit per keystroke; not a very good reason. A much better technique is suggested by Ueli Maurer's paper from Crypto '90, "A Universal Statistical Test For Random Bit Generators." However, this technique is slow to decide that the input is trustworthy and requires large tables.

The "leakage" rate of information from the pool. Because the stirring key is drawn from the pool itself, collisions are possible. These are states of the pool which, after stirring, result in the same output state. This reduces the information content of the pool.

The use of MD5 as a cipher. If you are using this as a cryptographic PRNG and producing large amounts of output from a smaller seed, the cipher at the heart of the stirring may be broken. The amount of known plaintext available from any given stirring is quite low (a few hundred bytes), the key space is dauntingly large (512 bits), and no such attacks on MD5 have appeared in the civilian literature; however, MD5 was not designed for use as a cipher and has had less study in this mode.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!