Search

Subscribe

A Million Random Digits

The Rand Corporation published A Million Random Digits with 100,000 Normal Deviates back in 1955, when generating random numbers was hard.

The random digits in the book were produced by rerandomization of a basic table generated by an electronic roulette wheel. Briefly, a random frequency pulse source, providing on the average about 100,000 pulses per second, was gated about once per second by a constant frequency pulse. Pulse standardization circuits passed the pulses through a 5-place binary counter. In principle the machine was a 32-place roulette wheel which made, on the average, about 3000 revolutions per trial and produced one number per second. A binary-to-decimal converter was used which converted 20 of the 32 numbers (the other twelve were discarded) and retained only the final digit of two-digit numbers; this final digit was fed into an IBM punch to produce finally a punched card table of random digits.

I have a copy of the original book; it's one of my library's prize possessions. I had no idea that the book was reprinted in 2002; it's available on Amazon. But even if you don't buy it, go to the Amazon page and read the user reviews. They're hysterical.

The meat of the book is the "Table of Random Digits." It lists them in five-digit groups -- "10097 32533 76520 13586 ..." -- 50 on a line and 50 lines on a page. The table goes on for 400 pages and, except for a particularly racy section on page 283 which reads "69696," makes for a boring read.

Comments

In most cryptographic algorithms, there is a need for "random" constants. For example, in MD5, SHA1, SHA512, AES, DES, we can see many arbitrarly chosen constants. If these were chosen as certain values, that could simplify the algorithm, and make it vulnerable to an attack. When a cryptographer designs an algorithm, she can choose values from this book to ensure that they are not nefariously chosen by your enemy.

"When a cryptographer designs an algorithm, she can choose values from this book to ensure that they are not nefariously chosen by your enemy."

It's less that, and more that people are worried that the designer deliberately chooses weak values, that he designs in a back door. (Back in the 1970s and 1980s this was a big worry about DES: that the NSA deliberately choose weak constants.)

Choosing a natural constant, or numbers from the Rand book, is less of a security measure against this than you might think. I'll bet that that if I had a particular type of constant I needed to weaken an algorithm, I could come up with a plausible random-number generation scheme using the Rand tables to get them.

The suggestion, that '69696' is in the least bit salacious or "racy", must strike a thinking person as a statement having missed the point by half.

A moment's reflection on the 8 (or is it just five?) digits under consideration, readily reveals, that not only do we see a failure of the generator (to generate a truly-verifiable set of random numbers), but as well, a rather more than lucid syllogism in argument of intelligent design, I should suggest.

The three 6's . . . that's an accident? No way. Beyond that, the digits (the important ones, the 6s) are broken up into a trinity.

One could go on. But "racy"? Unless by "racy" is meant sort of appertaining to race cars. In which case, 'racy' as in 'mechanical' is an apt expression. Nothing happens without a cause, without within a pattern.

So, yeah, racy. At first I'd read it, the word, as an arguably impolitic reference to miscegenative couplings. . . . or at least dating. In which case I had wondered, why suddenly is this guy going racial?

There's a great anecdote from Press et al., "Numerical Recipes", in the chapter on random number generation, in the section on why one should be suspicious of system-supplied RNGs:

--- Quote: ---
One infamous such routine, RANDU, with a = 65539 and m = 231,
was widespread on IBM mainframe computers for many years, and widely copied
onto other systems . One of us recalls producing a “random��? plot with only 11
planes, and being told by his computer center’s programming consultant that he
had misused the random number generator: “We guarantee that each number is
random individually, but we don’t guarantee that more than one of them is random.��?
Figure that out.

This book frequently turns up when you search Amazon for an ISBN number that is slightly wrong or mis-typed. I've often thought that you could write a very silly PRNG with OCR and Amazon's "view a random page" link.

This reminds me of an amusing little story about an encounter between the English Mathematician Godfrey H Harding and Indian mathematical genius Srinivasa Ramanujan:

"I remember once going to see Ramanujan when he was lying ill at Putney. I had ridden in taxi cab number 1729 and remarked that the number seemed to me rather a dull one, and that I hoped it was not an unfavorable omen. 'No,' he replied, 'it is a very interesting number; it is the smallest number expressible as the sum of two cubes in two different ways.'"

@V: The next time you need some random constants for a crypto algorithm, instead of unilaterally specifying some arcane way to extract them from this book, pi, or the canonical gzipped fulltext Moby-Dick, you should bring together a number of respected cryptologists, plus an EFF representative, the North Korean ambassador, a NSA spokesperson, and a 9/11 conspiracy theorist. Have each of these fine people bring a piece of cardboard on which they have written their favorite 128-bit random number. On the count of three everyone turns his card face up, and then you XOR all the numbers together. The inventor of the algorithm has specified in advance some statistical tests to run on the constant to see if it turns out catastrophic. If they fail, start over with the same people but new favorite random numbers. If the result fails K times in succession, discard the algorithm as too sensitive to the quality of its parameters.

I got the book (2002 printing) almost a year ago purly as a keepsake of sorts. Granted it's not an original printing, but the book itself is a time marker.

And think about it, the most recent memorization record for pi is one million decimal places -- the equivilent of this book (minus the deviants). Gives one new respect for the brain that pulled off that feat.

Ok. So everyone has made light of (or, what is worse, positively ignored) my suggestion about the "guiding intelligence" evidently behind Mr. Schneier's citation. That makes no difference.

Just look at the post above from "henning makholm". You see the "gzipped fulltext Moby-Dick" note? That is precisely what I was getting at. Melville. The one guy who saw it. You remember the scene where the sailor skins a whale's penis and makes a raincoat of it? (chap. 95) Seriously. Go have a look at the Gutenberg Project text. Only, they get the word "bishopric" wrong.

Crazy, crazy! Like Brando walking out of a very deep shadow in his barrack in Apocalypse.

anyway.

Anyway. melville makes this exceedingly strange allusion to somebody or other dressing himself up in a "whale-penis-raincoat" and then receiving a 'bishopric' (wish I could do footnotes here and take you to the earlier English rendering of the word . . . runs something like "bishop-prick") in exchange . . . or something for their costume.

But wait.

A friend just stopped in and read my postings on this topic and told me that the whole idea is the "69" thing. He said, that Americans think about, how shall I put this, about a sexual encounter involving simultaneous fellatio-cunnilingus, when the word/number is spoken. . . which is to say that one forms a circle with one's partner. How disgusting!

Reminds me of a weird American song: "will the circle be unbroken . . . by and by . . . etc. etc." Never could figure out what that meant.

Possibly a dumb question here, but does anybody actually use *printed* tables of random numbers? As opposed to using letting the computer generate a random number?

I teach this course that involves the study of probability and the book makes a really big deal out of how to get random numbers off a printed table, but it seems to me that I'd just use Excel and the randbetween() function if I needed a five-digit random number.

@Robert:
> it seems to me that I'd just use Excel and the randbetween() function if I needed a five-digit random number.

Robert, random numbers are a much bigger topic than that!

The problem which the RAND table was addressing is that computer "random number" functions actually return "pseudo-random numbers", that is, they are generated by a mathematical formula which is intended to produce a series that "looks random" and has no obvious structure.

However there are several significant problems that can occur with these pseudorandom number generators, or PRNGs.

* Firstly, given the same "seed" or starting value, they always produce the same output sequence. This can usually be mitigated by using the system clock as the seed.

* Secondly, having a finite state, they are necessarily periodic, i.e. will eventually repeat. Many have a period on the order of 2^32, which is good enough for many purposes but not for large scale simulations or computer security. Cryptographic ones usually have a period of 2^64, 2^128, or even more. Some inferior ones have a period of the order of 2^15 which is really only good enough for games.

* Thirdly, while they may not have any "obvious" structure, the fact that they are generated from a formula means that they must have some type of structure. Often this will not correlate in any significant way with the simulation you are running but sometimes it will, and then seriously pathological results will arise. An infamous example of this is that one type of very popular PRNG, the linear congruential random number generator or LCRNG, tends to produce successive values which defines points on a small set of (n-1) hyperplanes in some n-dimensional space. If an LCRNG is used in a careless manner to simulate random positions in space or space time, it can produce totally spurious results. In other example, the FFT of many PRNG sequences show obvious "spectral lines".

When we ask for random numbers there are at least six types of problems we commonly want to solve:

* in some security protocols, we often require a number that must be unpredictable even by a very smart opponent. In this case, the number must be generated with a significant amount of "truly random" entropy. The usual process is to combine every available source of difficult-to-guess (high entropy) data, using an entropy-preserving formula (i.e. one that makes the final number as difficult to guess as correctly guessing _all_ of the input numbers). There also exist hardware RNG modules which generate "true" RNs directly from unpredictable physical processes, like thermal noise in a transistor.

* in other security protocols, we require a large number of random numbers, such that the opponent cannot guess the next one after having seen all the previous ones. The solution to this is a "cryptographically strong pseudorandom number generator" or CSPRNG. Finding any structure in a good CSPRNG sequence is as difficult as breaking an underlying cipher. However, CSPRNGs tend to be much slower than regular PRNGs. For more information see:http://en.wikipedia.org/wiki/Cryptographically_secure_pseudorandom_number_generator

* in many applications, we only require that the numbers be "well distributed" across their range, and not repeat within the application domain. An LCRNG of adequate period is usually sufficient for this, which--along with simplicity and speed--is why they are so common in computer software.

* for statistics and simulations, we require that successive numbers do not correlate in any measureable way with parameters under test. It may also be desireable that the sequence can be easily reproduced by colleagues. This is the problem which the RAND tables were intended to address.

* in a very small number of security protocols, we don't care about the opponent predicting the number, but we want to be able to prove to him that WE couldn't predict it. The RAND tables have also been used for this purpose. Numbers generated in this way are called "nothing up my sleeve numbers". An interesting article on this topic can be found at:http://en.wikipedia.org/wiki/Nothing_up_my_sleeve_number

* finally, in large scale statistics and simulations, we need numbers without pathological correlations, but we need a really enormous number of them. CSPRNGs have been used for this purpose but provide additional unnecessary assurances and are often too slow. The state of the art in this area is the Mersenne twister, see:http://en.wikipedia.org/wiki/Mersenne_twister
The Mersenne twister is not cryptographicaly secure but it has a huge period, very little correlation, and is pretty fast.

Prior to 2003 the rand() generator in Excel was a defective modification of an LCRNG and known to have a number of quite serious defects, and a period of only 2^24. A much improved version was shipped in Excel 2003 (although with a bug not fixed until Jan 2004). In this version, rand() is the mod 1.0 sum of three 15 bit LCRNGs scaled to [0,1), and is fairly good as such things go, good enough for most non-security purposes except large scale simulations. The period is just under 2^45.

I do not know the function behind randbetween() but one source states it was not updated in 2003 and is still based on the older, defective rand() function.

In a nutshell, AFAIK, ideal random numbers are ones which are chosen in a manner where you can't tell there's a pattern to them.

In other words, one number on its own really doesn't fit most of these definitions of random - it applies to sequences.

In practical cryptographic implementations - particularly software - this randomness won't be perfect. The idea is to make it take as many values as you can to even figure out that it's not "ping-pong ball"/"dice" random, much less be able to do anything useful with it.

What I want to know is, how'd RAND do? How did this elaborate physical process measure up to today's random number generation? Is it less -- or even more! -- measurably random than what I get from a "cat /dev/random?"

Yes; in the book's introduction, the results of some of RAND's own validation tests are given, and so it was known to be less-than-perfect even when it was published. However the biases are too small to matter for most purposes.

> I assume that it's small size would not allow you to determine it's period.

As it is generated from a hardware source of randomness which--under present understanding of quantum mechanics--is believed to be truly random, the sequence should be aperiodic.

> What about their technique? Can it be faulted?

There are a couple of possible criticisms.

First, they are really vague about the "random frequency pulse source" which is the real heart of the machine. It is widely belived to have been a Geiger counter mounted near a radioactive source of suitable size (presumably somewhere on the order of 30 microcuries), but this doesn't seem to have been actually recorded, and references to the machine's statistical quality "running down" after "one month of continuous operation without adjustment" tends to suggest otherwise. It is possible that the pulse generator was actually some sort of astable oscillator. In that case, it may well have been chaotic rather than random, making analysis of the rest of the processing, and consequent quality of the output "randomness", much more difficult. Additionally a chaotic oscillator might tend to end up synchronising with some accidental external driving field, which would make its output highly predictable.

A second point is that a 5 bit counter was driven from the pulse source and sampled once per second. In other words, each sample was the 5 least significant bits in the count of events in the last second. However a count of randomly distributed events over a set period is not uniformly distributed, it is a Poisson distribution. Reducing the count modulo 32 will largely correct this bias, but perhaps not entirely; the standard deviation in this case is about 320, only ten times the modulus, so once can easily see residual biases (on the order of a few percent) remaining.

Next, there is a curious anomaly in the process of converting 5 bit numbers to digits base 10. Basically the numbers are reduced modulo 10, but because 32 is not evenly divisible by 10, we have to discard some samples so as to avoid bias. To avoid bias only 2 values (e.g. 30 and 31) need to be discarded, however they actually discard 12. Since this would increase by 50% the duration of an operation which must have taken on the order of a fortnight, it must have been reasonably important; but so far as I know there has never been an explanation why. What was wrong with counts in the twenties?

Finally, to correct for biases that were detectable when the machine had "run down", they summed pairs of digits modulo 10 (this is what is meant by "rerandomization of the basic table"). The reason was that this "transformation was expected to, and did, improve the distribution in view of a limit theorem to the effect that sums of random variables modulo 1 have the uniform distribution over the unit interval as their limiting distribution."

This is correct so far as it goes but contains two hidden assumptions. Firstly, that theorem is only true if the two random variables are independent. That would seem like a reasonable assumption if they, say, generated two tables of a million digits each and then digitwise summed them. But what they actually did is produce one table of a million digits and then sum each digit with the one 50 places behind it (what they did with the first 50 isn't mentioned, I presume they wrapped around). So if any correlations in the device were capable of lasting across 50 seconds, the procedure is not mathematically valid, even if the output "looks" more random. Worse, because this trick produces the same number of output digits as input, it follows that the entropy per digit in the output cannot be any higher than the input. Since the entropy per digit of a biased distribution is less than a uniform one, it follows that the output sequence is just as biased as the input, only in a more complicated way.

You people just don't get it! They offer the table for download! As a zipped file! By definition random data cannot be zipped! It's all a huge fraud! BS is in cahoots with the NSA, they want you to use these random numbers!!!one!eleven!

So Roger, if you were advising a person of limited technical skills how to produce random numbers to protect themselves from the rubber hose gang, how would you suggest they do it? If the protection needed to hold for the duration of the person's life, how would the answer change?

You have to be very carefull with using an electronic roulet wheel as there are some interesting side effects due to the effects of sampaling both in theory and in real circuits.

Another thing is that radio active sources contain bias, not just from the Poisson distribution effects but also from the half life of the source. That is if you get an average of 1000 counts per second today, the average will have dropped to 500 per second one half life later. For some sources with half lives measured in thousands of years this might not seem important but it is.

Back to using two oscillators and some of the problems involved, i'll just quickly outline some of them ;)

If you use two oscillators and sample one with the other you are basically making a hetrodyne mixer that you can find in any radio. The result is you usually assume you end up with four frequencies (to a lesser or greater extent) at the output of the mixer,

F1, F2, F1+F2, F1-F2

You then filter out the one you want (normally F1-F2) and put it through an amplifier etc to get it to a usefull level.

In an ordinary portable radio the Radio station is (we assume) of high stability the one in the radio of low stability and probably of high noise as well. There is a well known problem in RF engineering circles of oscilator noise messing up the desired signal to the point that it is unintelegable (the eye diagram is used to show this on digital systems). In the case of your roulet wheel the output signal contains the noise and frequency variation of both oscilators.

What is not immediatly obvious to most people (and yes that includes a lot of design engineers) is that the four frequencies at the output of the mixer have real energy that has to go somewhere. Unless you are very very carefull it will end up where you don't want it, ie in with the desired signal...

How this happens is that when you select the desired frequency with an ordinary filter, and reject the others they bounce back into the mixer and generate even more frequency components. Some of these new frequencies end up back in the same frequency range you are looking at and add to the desired frequency causing it's zero crossing point to move. Some go out back to the oscilators causing more frequency generation in the oscilator circuits. All of these new frequencies tend to bounce back into the mixer and around they go again.

Also if a generated frequency gets reflected back into a variable frequency oscilator close to the frequency it is operating at it can pull the oscilator onto it's frequency (see lose locked oscilators such as those used in PAL and NTSC chroma circuits).

The result is that a mixer can be regarded like a preasure vessel, eventually the preasure is going to go out the only exits available to it, unless it has a safety mechanisum to remove it safely. If you block up the ones going to the oscilators then the energy has only one place to go, and that is onto your chosen signal as some form of modulation....

Another problem is the filter, as part of it's function it containes energy storage components (capacitors and inductors) all of these have defects and in combination have resonant frequencies. The result is that the amplitude of the signal will be changed by the filter not just at the filter band edges but within the pass band. So as your "random signal" moves up and down the filter bandwidth it is Amplitude Modulated (AM) by the filter.

Now what you may not know is that an AM signal also consists of four different frequency components two of which are of the same frequency but their phase is different and the two "sideband" frequencies. The phase difference does not usually matter unless one or both of the frequencies is changing then you get a rotational effect, this due to vector addition which causes further AM effects on the frequency components, etc, etc, etc.

Also a changing phase means that the resulting signal is being Phase Modulated (PM) which means the zero crossing point of the signal you want is being moved...

Guess what to drive a counter you need to have a rasonably acurate edge, sin waves just don't hack it it needs "squaring up". The easy (and wrong) way to do it is to put the signal through a limiter.

A limiter is usually an over driven amplifier where the output signal crashes (and bounces) into the supply rails. I'll leave you to imagine what this does interms of cross modulating between the signal and the supply line noise and vice versa. lets just say it's messy...

Also guess what squaring a sinewave up by limiting it's amplitude moves the energy within the signal around and again the energy has to go somewhere, the result is AM to PM conversion. Or to put it more simply the variation in amplitude of the signal caused by the filter, on passing through the limitter moves the zero crossing point of the signal. Also the amplifier has it's own frequency charecteristic and the signal smaking into the rails causes ringing at one of the amplifiers resonant frequencies...

The right way to do it is through a zero crosing or threshold detector where the zero crossing / threshold point is derived from the signal average (but this also has it's own problems).

Also there is the not so minor problem of how much do the electronics destort the signal during amplification, before it is converted to a square wave. The answer is a lot, your average active component has a power law curve, and the design engineer biases the device to get a desired effect either gain or minimum distortion (but not both). Again this leads to cross modulation of the signal with powersupply noise and the generation of harmonic components. One trick engineers use to try to reduce distortion is feed back, but this due to amongst other things such as time delays has frequency selective problems giving other AM and PM type effects to a variable frequency signal...

There are a whole load of other effects you can bump into as well but you would need several books to describe them.

The net result of all of this is that your original signal now has lots and lots of noise on it at the zero crosing point. If you analyse it (and people have) you will find that in most cases it can be found to consist of a number of predominant frequency components moving the crossing point back and forwards in a predictable manner. Opps this is a decidedly non random effect wich is going to not only be visable as such in the generator output but it can also add an offset or bias, this effect is well known to engineers who have tried to design Base Band Output Direct Conversion Receivers (one of the reasons hetrodyne receivers with Intermediate Frequency (IF) circuits where designed in the first place).

In general the bias can be removed by the use of a simple digital circuit however the non random modulation signal cannot be so easily removed.

So as you can see designing a simple "true random" generator is not an easy engineering task.

Then there are other difficulties to do with selecting the bandwidths of the signal path and the effects the filters have. That is they turn a Gausian White Noise Signal, into a bandwidth limited signal, those in the audio engeniring game refer to this as white to pink noise conversion.... Band width limited signals from sampled signals have their own problems to deal with as DSP engineers are well aware.

I could go on with other effects but I think you get the idea.

A lot of people decide when faced with designing a "True Random" generator, that to save cost and effort they will just fudge the whole issue and take the not quite random output and feed it through a crypto or hashing function, that way it is going to be (they think) not a problem...

One of my CS professors told us story about when he had wanted a lot of random numbers for a simulation, in the 1970s. He approached the Post Office to see if they could give him some from the electronic random generator source used for Premium Bonds (an odd British cross between an investment and a lottery). They refused to give him numbers, because they were afraid he would find bias in the results.

Boring? Boring!?! Surely not! Information theory tells us that a source of complete randomness is the most surprising thing that there can be! Truly this is one of the most surprising books ever written.

"You people just don't get it! They offer the table for download! As a zipped file! By definition random data cannot be zipped! It's all a huge fraud! BS is in cahoots with the NSA, they want you to use these random numbers!!!one!eleven!"

Like, mon, dem table is in ASCII or someting, right. Dat stuffs got loadsa re-dun-dan-seeeee, cuz for dem numbaz one ter ten an stuff he only uses like fifteen or sixteen out of der two-fiver-six possibuls, so dem can com-per-ess it.

"You people just don't get it! They offer the table for download! As a zipped file! By definition random data cannot be zipped! It's all a huge fraud! BS is in cahoots with the NSA, they want you to use these random numbers!!!"

"I teach this course that involves the study of probability and the book makes a really big deal out of how to get random numbers off a printed table, but it seems to me that I'd just use Excel and the randbetween() function if I needed a five-digit random number."

The requirements for cryptographic random numbers are much, much more serious than the requirements for statistical random numbers. Excel's function is fine for Monte Carlo simulations and the like, but are awful for cryptography.