One is a division-remainder (mod) bias due to the fact that random number generators have a fixed range of possible input values, which are often unrelated to the size of the requested output range.

The other is an uneven frequency distribution of outputs as the range gets larger. Some outputs appear much more often than they should, while other outputs appear to not be possible at all.

1. MOD bias

This is a very small bias which is hard to detect for small ranges, but becomes more apparent at larger range sizes. Because of the way the bias fluctuates as the range size changes, I don't believe this bias is caused due to the huge ranges in this example, but instead the larger ranges make its presence more obvious.

The max value I've been able to return from $rand(0,999999999) is approximately 6 million short of the string of 14 9's, so I'm assuming the intent is to allow returning values from 0 to 14 9's. The minimum non-zero value I found from this same range was also approximately 6 million greater than zero, but that's related to issue#2.

This issue#1 bias appears to be a modulo bias that's caused when (the range-size of input values received from the random generator) MOD (the range-size of requested output values) is not zero. The bias favors values toward the low end of the range. For smaller size ranges, the bias can be harder to detect, but it's technically there.

For a simplistic explanation, it's similar to what would happen if you dealt out an entire deck of cards to different group sizes of players. If the number of players is a factor of 52 like 52-26-13-4-2-or-1, everyone gets the same number of cards, but for all other numbers of players at least 1 player gets an extra card. If there are 12 players, most receive 1 fewer card. If there are 14 players, most receive 1 extra card.

For another example, assume a number generator calculates a perfectly random value from 0-255, and you're trying to get a random number from 0-9. There's 256 possible inputs but only 10 possible outputs, so some outputs map to 25 inputs but some outputs map to 26 inputs. If taking the 0-255 random input MOD 10 and returning (low_value + remainder 0-9), then outputs 0-5 have 26 inputs mapped to each of them, but outputs 6-9 have 25 inputs mapped to each of them. This would cause 0-5 to appear 4% more often than 6-9.

If the input is a 16-bit value from 0-65535, there's still a bias toward the lowest values of a requested range of 10, but it's a much smaller bias due to the difference between having 6553 vs 6554 inputs, which is a much smaller percent change. However if the range were increased to be 0-49151, the lowest 16384 outputs have 2 inputs while the other 32768 outputs only had 1 input, causing some outputs to appear 100% more often than others. The number of outputs being favored changes depending on how close the output range is to 1/2 1/4th 1/8th etc of the 65536.

The solution to avoiding this type of bias is to determine a value, based on the sizes of the input and output ranges, above which any input value is going to be the extra input given to only a portion of the outputs. If the random generator returns a number above this value, it should be discarded, and a replacement random value should be requested from the RNG until the RNG returns a value within the range.

Assuming the input value comes from a range from zero to %in_max, below should work to determine the range of inputs that need to be discarded. This could also be combined with my suggestion to allow the output range to be either partly or entirely below zero.

If you substitute an output range size that's a factor of 256, such as 128 or 64 or even 1, the %throwaway_above value is always going to be the same as %in_max, so nothing is ever thrown away. By throwing away these extra highest values, it eliminates the bias described above. The number of values thrown away can never be larger than half the values output by the RNG, so throwing away RNG output should be rarely needed in most cases.

--

In this next example, you can see the bias changing as you change the 0.996. At 1.00 any bias is not very visible, nor at 0.50 where it's half of 10^15. However at 0.996 the 1st pocket attracts twice as many random numbers as the other pockets, and at changing 0.996 to 0.66 this effects spreads to half the pockets. The bias gets smaller as you repeat your tests after changing the 14 into 13, but the effect is still there at around 50% more outputs instead of 100% more outputs.

The above dealt only with how the output from the random number generator is translated into an output value, and not about the quality of random values returned by the generator itself. Even at the largest ranges, each of the regions of the output appeared to have similar total number of outputs.

However at much smaller ranges than that, the random generator has gaps where some outputs are rarely if ever returned, while other outputs appear much more often than they should. 0-16777215 is a large range, but it's small enough to be a range used by scripts, such as choosing a random RGB color index.

$rand(0,16777215) should return numbers 0-255 an average of 1 time per approx. 65536 random numbers, and appears to be close to doing so. However, while random numbers should not have a completely smooth distribution of numbers, the number of times each numbers is returned should be closer to the mean than $rand is returning. The first 2560 times that numbers 0-255 were returned by $rand(0,16777215), in sequential order of 0 to 255, the number of times each number was returned by $rand was:

The most frequent value was '0' appearing 39 times, nearly 4 times the mean of 10. 76 of the 256 output numbers did not appear at all in the first 2560 random values which were in the 0-255 range, which is very unlikely in a random sample of this size.

As the output range increases, the possible returned values are increasingly spaced apart, but the low value continues to appear much more frequently than the others. When the range is $rand(0,99999999999999) which is greater than 2^46, the '0' value appears slightly more frequently than once every 2^24 numbers, which is not significantly less often than '0' appeared in $rand(0,16777215).

I've been aware of scaling artifacts in $rand(), but these samples are pretty wild. I wouldn't have expected 24 bits (2^24-1) to be so profound with this effect.

I do wonder if there might be a better, faster even, pseudo-random generator mIRC might use. AutoHotkey is using Mersenne Twister MT19937 by (C) 1997 - 2002, Makoto Matsumoto and Takuji Nishimura (which only needs a copyright notice in the help file; steal it from AutoHotkey's help file.) http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html ((Oh, looks like they don't have any license requirements anymore.))

_________________________At least I won lunch.Good philosophy, see good in bad, I like!

Thanks for your bug report. This has been an ongoing topic since $rand() was first introduced :-) I have changed the random number algorithm many times over the years. Unfortunately, every time I change it, someone eventually performs an analysis and reports that it is biased. That you've found the biases are minor is great. Usually, someone finds something drastically wrong with the RNG.

Currently, the RNG code in mIRC uses a Marsaglia variant. Also tested but commented out are Mersenne Twister and Wells512. I would have to look through older versions to list all the RNG variants I have tried. On my to-do list are PCG variants. However, the odds are that all RNGs contain issues of some kind, which is why it is such a thriving field of research - the perfect PRNG has not been found yet.

I was able to reproduce the mod bias issue with your script. Looking through the code, it appears to be due to a mod/pow combination that over-truncated the resulting value, losing significant bits. I have changed the mod/pow call and it now seems to distribute correctly, including when using 0.66 instead of 0.996. That said, this will need further tests as it is difficult to predict the side-effects of changes like this.

In addition to this change, I have also added $rands() which returns a cryptographically secure random value using Microsoft's SystemFunction036() API.

Looking into this a little more: the Marsaglia RNG that mIRC currently uses is quite old and I can find no reviews for the variant being used. So I have decided to change the RNG algorithm to the 64bit version of Jenkins's Small PRNG. This RNG is estabilished and well-reviewed, the code is short and simple, and it passes a wide range of tests eg. PractRand. This will be in the next version.

In the new rand, i'm seeing evidence of uneven bias at huge ranges, but not the same kind as before. The bias seems identical for both $rand and $rands, so it's more likely caused by identical post-processing of the random inputs.

So far I've tested only the lowest bit, as it's easiest to test odd/even since $isbit doesn't work above 2^32. At largest ranges, there can be a significant difference depending on whether the range size itself is odd or even.

The Jenkins PRNG is 64bit output, and the SystemFunction036 allows the app to determine the size of the random input, but it appears to allow 64bits and even more. $rand now returns values as large as the 2^53 doubles range instead of chopping the range near 2^46.5, however the odd/even range size can determine whether all or mostly even numbers are returned.

In all these examples, the low end of the range is zero, and if $2 is 's' it uses $rands instead of $rand, showing the effects appear in both identifiers.

At the largest range, it can return numbers very close to the top of the range. However while this example shows an even split between odd/even numbers, the odd numbers are at the top of the range and the even numbers are all at the bottom of the range.

/randbiasdemo3 2^53-1

--

For 2^52-2 and all other range maxes in both directions where the -2 is replaced by a negative or positive even number, all returned values are even.

/randbiasdemo3 2^52-2

--

The odd/even bias comes-and-goes as the range size shrinks, like the previous modulo bias did. In this example and other even-numbered range maxes in both directions, 2/3rds of returned numbers are even.

In the new rand, i'm seeing evidence of uneven bias at huge ranges, but not the same kind as before. The bias seems identical for both $rand and $rands, so it's more likely caused by identical post-processing of the random inputs.

That is very likely the reason for it. In both cases, mIRC needs to round an int64 to a double, as doubles are used in most calculations, eg. $calc().

On the other hand, are you saying that when $rand() is used with values in a reasonable range, ie. not huge values that are going to overflow due to internal conversions, that it is working well?

Regarding %throwaway_above, I am not convinced that this is something mIRC should be doing. The scripter can easily cause calcuations to overflow throughout the scripting language, if they choose large enough numbers.

Or are you saying that $rand() and $rands() should be mapping all int64 results to a specific maximum value?

Hmm. Interesting. I just tested out your script with a number of different methods of converting uint64s to doubles. The method used in the beta was recommended in a number of places and is even used in a well-known RNG. However, it turns out that a different method actually resolves the issue you are reporting. The different method results in an equal distribution of odd and even numbers and a good distribution overall. I will be changing to this method in the next beta, so let's see how that works out. Thanks for the test script!

With random numbers, it's hard to state that something is 'working well' just because you can't easily see problems. When problems are visible at huge ranges but aren't at small ranges doesn't mean the problem isn't there at the smaller ranges.

If it's not reasonable to try to obtain a uint64 %throwaway_above where the range value could be a 53bit integer, a better method than trying to round to a double, would be something similar to the XOR folding mentioned in the FNV site. If trying to reduce a 64bit data input to a number in the 0-2^53-1 range where $calc can accurately do math, it can first split the 64 bits into group #1 having 53 bits and group #2 having 11 bits. The 11 bits of group#2 could be XOR'ed against the lowest 11 bits of group#1. This would create a 53 bit value where each of the 0-1FFFFFFFFFFFF values have the same 2^11 number of possible inputs.

Once you have the 53 bit value, then $calc can accurately calculate the %throwaway_above value needed for the $rand(N1,N2) range as long as none of %range or N1 or N2 is outside the accuracy range.

Once the 64bit number has been narrowed to a 53bit number, using the %throwaway_above value ensures that all outputs have an equal number of outputs, because it throws away extra inputs only available to some of the outputs. Where %in_max is a 52/53 bit number, the number of values being thrown away can never exceed half of %in_max and can never exceed (%range -1). Worst case scenario is throwing away nearly half the first random number, and for smaller ranges it should rarely come into play.

Any method of taking a 2^N size input and applying it against an output range whose size is not a 2^N size is going to have some bias in it, even if it's not easily detectable at 'normal' ranges. Other uses like the 64-bit salt/iv used by $encode, or the 128-bit salt used by $zip, could use the raw input used by $rands().

In the latest beta, I'm not detecting the older behavior of thinned-out output which had caused nearly a third of outputs in a 2^24 range to not exist within more than 2^27 random numbers. Each bit increasing this range size would double the testing time, so it's the kind of thing that's hard to detect in larger ranges without a huge number of tests. I'm not detecting the odd/even bias from the latest alias randbiasdemo3.

Modulo bias always exists when the output range's size is not a divisor of the input range's size, due to some outputs matching mor inputs than other outputs do. Increasing the size of the input range doesn't eliminate it, it just makes observing the bias difficult because the number of biased inputs becomes a smaller fraction of the total inputs. The biased inputs thown away is always less than the output range's size.

The modulo bias in the latest beta is hard to observe without either looking at a huge number of inputs or looking at a very large output range. It can be seen with the earlier randbiasdemo3 alias using either $rand or $rands, though dividing the inputs into smaller odd/even groups makes it a little harder to see.

/randbiasdemo3 0.75*2^52/randbiasdemo3 0.75*2^52 s

The modulo bias is solved by discarding enough inputs to effectively shrinking the input range enough that it becomes a multiple of the output range. The number of biased inputs to discard can be calculated either as a %throwaway_above or as a %throwaway_below value. In the simplistic example of returning a range of 10 outputs from the 256 input values 0-255, there are 6 extra inputs that need to be discarded to bring the total inputs down to 250. The two ways are:

These always results in a keep-range which is the highest multiple of out_range <= %in_range, and doesn't throw away any inputs when the out_range and in_max are both powers of 2. In the example of outputting a range-size 10 random number from an input of 0-255:

The $rand_withoutmodulobias alias below removes the bias by ignoring the biased inputs when reducing a 53-bit 'double' down to the requested output range. It assumes there wasn't any bias in how the double was created, and that each of the 2^53 outputs has an equal number of inputs. It would be difficult to detect small loss of precision without a debugging parameter which would force the first fetched 64bit value to be a specific value, to see what happens from various neighboring inputs and how they are used in various output range sizes.

A 53-bit int could be created without bias from a 64-bit int by taking as many bits as can be held in a double and optionally XOR'ing them with the remaining bits so that all 64 bits can have an effect on the outcome.

The $rand_WithoutModuloBias alias calculates %throwaway_above after making sure that %in_max and the size of the output range both fit within a 53-bit double. It then handles reducing the 53-bit double down to the requested output range. If the $rand_WithoutModuloBias were substituted in place of $rand in the other aliases, it would eliminate the modulo bias as a source of any output bias, but wouldn't remove the hi-odd/lo-even bias in the prior beta since that exists in the 2^53-1 output range. It should always output the identical output that $rand would've otherwise returned except in cases where the double was originally filled with one of the biased outputs that it now throws away.

Because the modulo bias is independent of the random source, it also exists if the randbiasdemo4 alias is edited to use $rands. Modulo bias favoring smaller values:

/randbiasdemo4 0.75*2^53-1/randbiasdemo4 0.75*2^51-1

Editing randbiasdemo to use $rand_withoutmodulobias instead of $rand (2 places) eliminates the modulo bias by throwing away the extra/biased inputs. This rand_WithoutModuloBias can also be edited in 2 places to use $rands instead of $rand.