2018/12/108:30 PM5 months agofluffyfluffy at beesbuzz dot bizA Seattle-based programmer/musician who makes games, comics, and bad decisions.A frequent thing that people want to do in making games or interactive applications is to shuffle a list. One common and intuitive approach that people take is to simply sort the list, but use a random number generator as the comparison operation. (For example, this is what's recommended in Fuzzball's MPI documentation, and it is a common answer that comes up on programming forums as well.)

A frequent thing that people want to do in making games or interactive applications is to shuffle a list. One common and intuitive approach that people take is to simply sort the list, but use a random number generator as the comparison operation. (For example, this is what’s recommended in Fuzzball’s MPI documentation, and it is a common answer that comes up on programming forums as well.)

This way is very, very wrong.

First, there are two approaches which work really well that I’ll describe briefly; one is the Fisher-Yates shuffle algorithm, which works by walking down the array and randomly swapping the current element with any of the elements which follow it (including itself). This is an \(\mathcal{O}(N)\) operation which is trivial to implement in pretty much any imperative language and can be done in-place.

If you are in an environment which does not allow array-level swapping like that (such as, say, a functional programming language or a spreadsheet), an alternate approach is to map the array to one where each element is prefixed with a random number, then sort based on that Random prefix, and then remove the prefix. Here is how to do it in Excel, and how to do it in Fuzzball MPI. This generally is an \(\mathcal{O}(N \log_2 N)\) operation: not as efficient, but it still generates high-quality results.

But why are these approaches better? Well, both of them have an equal chance of any input cell ending up in any output cell; there is absolutely no bias to where things might end up, because all of their placements are independent from one another, and each placement happens effectively once.

But the coin-flip sorting approach introduces a lot of bias; cells can move an arbitrary number of times, and where they can move to depends on where they are located within the array to begin with.

Briefly, \(\chi^2\) measures how close your output distribution is to an expected distribution. If we were shuffling our lists uniformly, then we’d expect every value to end up in every cell about the same number of times; for an array of length \(N\), with a shuffle repeated \(K\) times, you’d expect each value to be in each cell \(\frac{K}{N}\) times.

Here is what the \(\chi^2\) distribution is for the good algorithms, with \(N=235\) and \(K=100000\):

And here it is for three different random-sort algorithms, bubble sort, std::stable_sort, and merge sort, all using a coin-toss ordering function:

(The reason I went with std::stable_sort instead of std::sort is that the latter uses some optimizations that assume that the sort is transitive – which isn’t the case when you’re randomly flipping a coin. So std::sort tends to crash when this constraint is violated.)

For a better comparison of how the different algorithms stack up as \(N\) changes, here’s a chart:

N

Fisher-Yates

Random prefix

Bubble sort

std::stable_sort

Merge sort

4

0.00005

0.00004

0.139569

0.158157

0.00004

5

0.00002

0.00005

0.218803

0.273358

0.0487565

6

0.00006

0.00008

0.29487

0.400601

0.0330905

7

0.0000

0.00006

0.364549

0.527574

0.0342361

8

0.00006

0.00006

0.430965

0.660637

0.00005

10

0.0001

0.00010

0.557523

0.93249

0.0335068

12

0.000127262

0.000108561

0.674842

1.19897

0.024186

15

0.000148504

0.000128046

0.831853

1.60815

0.0178444

18

0.000186571

0.000189251

0.971268

2.01685

0.0259307

22

0.000212744

0.000221075

1.14734

2.56097

0.0321786

27

0.000249649

0.000268954

1.34988

3.24243

0.0271932

33

0.000348344

0.000328459

1.56487

4.06064

0.0116926

41

0.000417497

0.000404196

1.82521

5.1443

0.0301371

51

0.000494652

0.000514062

2.11672

6.50646

0.026951

63

0.000612439

0.000614844

2.4303

8.14154

0.0057482

78

0.000753832

0.000759035

2.7903

10.1829

0.0270242

97

0.000952913

0.000986311

3.19424

12.7711

0.022091

121

0.0011958

0.0012227

3.65048

16.0263

0.0137545

151

0.00147693

0.0015004

4.16317

2.2788

0.026883

188

0.00185188

0.00186338

4.7293

2.72105

0.0242407

235

0.00231242

0.00233842

5.3744

3.24268

0.0189719

And here is a logarithmic graph showing how the total \(\chi^2\) error changes as \(N\) increases:

So, as N increases, you would expect the \(\chi^2\) to increase if \(K\) remains constant, as it did in this case (as there are fewer overlapping samples). And in fact the bubble sort and std::stable_sort approaches increase pretty much in parallel with the Fisher-Yates and random prefix approaches (aside from the discontinuity at \(N=128\) due to the performance-tuning heuristics that algorithm uses) – but this is in a logarithmic graph, meaning that the random sorts increase orders of magnitude faster than the other ones.

Merge sort is a bit interesting, in that it is highly dependent on the actual size of the array, and in particular how close it is to a power of 2, with perfect powers of 2 having the same \(\chi^2\) as the ideal algorithms. (I leave this as an exercise to the reader to consider why this would be.) But there’s still a lot of inherent bias, and it’s not even casually-predictable bias.

Anyway, this is why when shuffling a list, you should use Fisher-Yates if possible, or sort based on a randomly-generated unassociated number if not.