What does RANDPERM miss?

Abstractly, a random permutation generator (RPG) is a function that maps a random bit-string of length b into a permutation of {1,2, ..., n}.

There are 2^b distinct bit-strings of length b and n! distinct permutations of length n. Hence we must have 2^b > n! Using Stirling's approximation for n! and taking logs of both sides of the inequality we get b > n*log2(n).

Now RANDPERM can generate a permutation of length n = 10^6 in 0.2 sec and part of its output looks like this: 383882, 905744, 760491, 138901. However, to generate all of the (10^6)! ~ 10^(5,565,709) possible permutations the inequality above says that b > 10^6*Log2(10^6) ~ 2*10^7 bits are necessary. But RANDPERM uses indirectly the Mersenne Twister generator whose internal state has about 625*32 = 2*10^4 bits and so it cannot possibly generate all (10^6)! permutations. Indeed, it must miss most of these permutations.

My question is: What permutations does RANDPERM (or any other RPG) miss? Are these misses spread uniformly over the permutation space?

The randperm I have (R2010b) looks like this:
[~,p] = sort(rand(1,n));
So in the case that n = 2 (for example), rand(1,n) will generate a huge number of different streams, but the sort will project these down on to only two different permutations.

Products

4 Answers

You are right: The Mersenne-Twister cannot create all permutations from a certain limit. The period of 10^6000 means that the sequence of the created 32 bit numbers is repeated after this number of steps. RANDPERM uses 53 bit doubles, which are built from two 32 integers, such that the half period must be taken.

While the sequence of numbers is repeated after 0.5*10^6000 RNG calls only, the values of the 53 bit floating point number are repeated earlier. RANDPERM uses Matlab's SORT, which is a stable quicksort implementation. Here stability means, that the order of equal inputs is kept in the output. This leads to a lovely bias, which can be found without waiting for years by reducing the RNG range drastically:

However, for 53 bits and periods of 0.5*10^6000 the determination of the bias will lead to other problems: There are only about 10^80 nucleons in the universe. This limits the possibilities to store a list of already chosen permutations. So there are some permutations not reachable by RANDPERM, but the chance to suffer from this is tiny.

As you, Derek, know, I suggest not to use RANDPERM for large data sets, but the Fisher-Yates-Shuffle, e.g. implemented in FEX: RPG Lab or FEX: Shuffle. I've implemented the KISS RNG in Shuffle, which has a period of 10^37 and thought of using Marsaglias' CMWC4096 with a period of 10^33000. But as far as I can see, the limits of the universe are a stricter bound of the number of computable permutations.

I'd looking forward to know, if any advanced mathematician shares this estimation.

A small point: Marsaglia says that the period of his CMWC4096
generator is 10^39460 or about 10^33000 times as large as MT's
10^6000.

In what follows I will assume that the integers {1,2,...,n} are to
be shuffled or permuted, that KISS has a period T = 10^37, MT a
period T = 10^6000, and CMWC4096 a period T = 10^39460.

If an RPG uses an RNG with period T then, to cover (gain access to)
the whole permutation space, n should be restricted so that n! < T.
This means your Shuffle should have n < 26, my GRPfys or RANDPERM
should have n < 1840, and if CMWC is used then n < 9878.

I'm not too worried that there are a mere 10^80 particles in the
known universe because I will never ask an RPG to generate all
permutations of size n = 10^6. What I would like is a very small
sample ns = 10,100,1000 from the gigantic space of (10^6)!
permutations of size 10^6 and that these permutations be selected
or drawn uniformly from this gigantic space.

If an RPG violates the n! < T restriction then I do not know what
happens because I do not know which parts of the permutation space
the RPG cannot access. This is where the mathematicians come in.

In the meantime, I urge you to implement Marsaglia's CMWC4096 which
would extend the range of your FEX:Shuffle and my FEX:RPGlab. Also,
take a look at Richard Brent's site http://maths.anu.edu.au/~brent/
If anyone can answer our questions, he can.

Here is what Brent says in his 2006 paper "Some long-period random
number generators using shifts and xors" contained in
http://wwwmaths.anu.edu.au/~brent/ftp/random/xorgens305.zip

'Any RNG based on a finite state must eventually cycle, but it is
desirable for RNGs to have a very long period T (the cycle length).
Most generators fail certain statistical tests if more than about
T^(1/2) random numbers are used [14], or perhaps even about T^(1/3)
numbers for the “birthday spacings” test.'

This suggests that the restrictions on n should be even smaller
than the ones I calculated above.

Thanks for finding the bad period - you are a careful reader. I have an CMWC4096 already, but no clean method to get the necessary 131072 bit with high entropy for seeding. I could use a cheap KISS stream, and AES-CBC encryption of a weaker stream or a true random stream from an internet service. All methods are ugly to include in the monolithic Shuffle.c. Marsaglia states, that CMWC can be initialized by a sufficient warm-up also, but this will makes the repeatability of a simulation very expensive.
The seeding gets important also: You cannot reach a large number of permutations, if you initialize the RNG with 32 bits only.
I cannot see, how your argument agrees with this one: The period of KISS is 10e37, such that the Fisher-Yates-Shuffle can create 3.85e36 permutations before the the RNG is exhausted. For n=26 there are 4.03e26 different permutations, such that I assume that FYS+KISS finds all of them.

The limits of the universe and the RNG states reduce both the number of computable permutations. Is it useful to analyse the different limits separately? You know, I'm physicist...

Sorry for the n = 26 error. It should be n = 33 which gives n! =
8.683e36 ~ 10^37 = T, the period of KISS.

On the "limits of the universe... etc", I can only repeat what I
said above: I want to look a small number of permutations (atoms)
of size n (=10^6 say) selected UNIFORMLY from the permutation space
of size (10^6)! (the universe).

Matsumoto, Brent, Marsaglia, L'Ecuyer, etc., state that their
generators have periods T = 10^6000, T = 10^39460, etc., without
any reference to the "limits of the universe". They don't seem to
worry about the universe. But, they are not physicists ...

@Derek: The difference between the mathematical and physical view is the size of infinity. The physicist tends to accept 10^6! as infinity and then the term _uniformly_ looses its meaning, because there is no physical mechanism to proof that a distribution is not uniformly distributed. (Sorry, this is not 100% exact. A physicist will reject the assumption that a uniform RNG with infinite precision can create the sequence {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 ...}.)
GUIDs are accepted to be unique, although this is mathematically wrong. Checksums like SHA-256 are actually RNGs with a wide low-entropy seed. Nevertheless the degree of uniqueness is sufficient for using them as trusted fingerprints.
Without doubt you are correct, that the determination of permutations is limited, especially the RANDPERM method (therefore I've voted the question). My meta arguments conern only the fact, that it will be very hard to create a real-world application, which suffers from this limitations.

Assuming that the Fisher-Yates Shuffle produces permutations
uniformly if it uses a good RNG with period T, and if we use
Brent's & Matsumoto's recommendation that we generate no more than
T^(1/3) RNs then we must restrict n so that n! < T^(1/3). This
gives

1. n = 11 for KISS, T = 10^37

2. n = 702 for MT, T = 10^6000

3. n = 3690 for CMWC4096, T = 10^39460

These values of n are very small.

I have come to the pessimistic conclusion that we may never be able
to generate large permutations, with a guarantee of uniformity.

The same conclusion applies to generating random samples: there are
C(N,n) possible samples of size n from a population of size N. Thus
a function that uses KISS to generate samples of size 500 from an
array of size 10^6 cannot possibly cover the C(10^6,500) ~ 10^1866
possible samples.

J. Gentle, in his book "Random Number Generation and Monte Carlo
Methods", 2nd Ed., 2003, has the following comment at the bottom of
page 220:

`In no other application considered in this book does the
limitation of a finite period stand out so clearly as it does in
the case of generation of random samples and permutations.'

Also, he points out that these limitations have been known for many
years. See:

Greenwood, J. Arthur (1976a), The demands of trivial combinatorial
problems on random number generators, "Proceedings of the Ninth
Interface Symposium on Computer Science and Statistics"

A final problem: empirically testing an RPG for uniformity seems to
be difficult, even for small values of n.

A couple of people in this thread have mentioned the Fisher-Yates-Durstenfeld(-...) shuffle algorithm.

New in R2011b, the RANDPERM function lets you specify a second argument, as in, "give me a random selection of size k from a random permutaion of 1:n". This not only simplifies and speeds up the typical

perm = randperm(n);
r = perm(1:k)

that has been recommended in the past, but also uses F-Y-D. So of course you can ask for a random permutation of all of 1:n as

perm = randperm(n,n)

and have it generated using F-Y-D with the Mersenne Twister underneath. For backwards compatibility, randperm(n) continues to use the existing algorithm based on sort(rand(1,n)).

(Not that demonstrates which is which, but you have it right.) Also note that

>> rng default
>> randperm(5,3)
ans =
5 3 2

so you can see that randperm(n,k) is not simply the first k elements of randperm(n,n) -- the latter is never generated, which leads to fast generation of things like

>> randperm(2^30,3)
ans =
521168135 859294610 152349296

By the way, randperm(n,k) can be thought of as "sampling without replacement", as opposed to randi(n,k,1), which can be thought of as "sampling with replacement. Also of potential interest, new in R2011b in Statistics Toolbox if you have it, is the datasample function <http://www.mathworks.com/help/toolbox/stats/datasample.html>, which facilitates both of those, including weighted versions of both (it is intended as an improved version of the existing randsample function in Statistics Toolbox).

It doesn't answer your question exactly. Your question seems to be more is the Mersenne Twister a good random number generator. The answer to that is, yes, it is one of the better general purpose random number generators available today. Yes, there are sequences it will not generate, however, its repeat time is so long that it is basically impossible to detect that from the sequence itself.

You are "putting words in my mouth": I do not question the goodness
of the Mersenne Twister. It is one of the best of the "modern"
generators and has been tested and used by many. Its period is
about 2^{19937} ~ 10^{6000} and is equidistributed in 623 dimensions.

There are better generators, according to George Marsaglia
(http://groups.google.com/group/sci.crypt/browse_thread/thread/305c507efbe85be4)

It seems to me that if we use any RNG whose internal state has b
bits then we should restrict the size of the permutations generated
by any RPG that uses it so that n*log2(n) < b. If an RPG uses the
MT generator then it should not be allowed to generate permutations
of size greater than about n = 1830, with n*log2(n) = 19833. If we
allow n to be greater than 1830 then the RPG must miss some of the
permutations. Hence my question.

As long as randperm() is implemented as a stable sort of random numbers that can be duplicates, you cannot trust randperm() requesting more than 1 value.

Suppose you rand(1,2) and both results happen to come out exactly the same. If they cannot come out exactly the same then that would be a test that could be applied to easily distinguish pseudo-random numbers from true-random numbers: true random numbers can have duplicates (and the standard birthday-paradox analysis would make short work of detecting the problem.)

If the two values of rand(1,2) can come out exactly the same (no matter what the period is for full repetition), then the stable sort that is applied to the random numbers is always going to choose the same ordering, not a permutation of that.

This suggests an recursive "saferrandperm" algorithm

[vals, idx] = sort(rand(1,N));

diff(vals)

Look for runs of 0 in the diff: these indicate groups if identical values, and identifying them is an order-N operation.

For any one group of identified identical values, extract the indices, take the length of that, and saferrandperm() that length. Use the indices saferrandperm() returns on the subset size to shuffle the extracted indices in the vector to be returned; if there are more groups of identical values, do the same things for those.

When saferrandperm detects ties in the generated rand() values, it does the same recursive shuffle.

The proportion of duplicated values will get smaller and smaller and as log as the RNG does not generate all-identical values, eventually any one subset will be suitably permuted with the ordering *not* stable.