Thoughts on programming and more

Yesterday I wrote about the terrible quality of the default pseudo random number generators in the C++ standard library. The article came to the conclusion they're all around terrible, and should generally be avoided, yet, it didn't provide any alternatives. Therefore, today I'll provide you with three very good alternatives to the existing generators in the C++ standard library.

They're almost fully compliant to the standard, and should be an almost drop-in replacement for the old generators.
There is a minor change in the interface, related how the generator is seeded. To seed any of the generators, you need to pass in the std::random_device (or any class which implements operator() and returns an unsigned int) you want to use, instead of just the seed. By doing so, I can make sure each generator is seeded accordingly.

Each of these is passing both the PractRand and BigCrush test. The raw PractRand test results can be found here and the code can be found at the end of the article.

Comparison

For comparison, I've reproduced the comparison tables from the previous article. The newly added generators are highlighted in bold.

generator

512M

1G

2G

4G

8G

16G

32G

64G

128G

256G

512G

1T

2T

ranlux24_base

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

ranlux48_base

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

minstd_rand

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

minstd_rand0

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

knuth_b

✓

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

mt19937

✓

✓

✓

✓

✓

✓

✓

✓

✓

✗

✗

✗

✗

mt19937_64

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✗

✗

✗

ranlux24

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

ranlux48

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

splitmix

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

xorshift

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

pcg

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

All three generators have excellent statistical quality, while having a smaller footprint and being faster. As a bonus, each of the three generators can be implemented in less than 10 lines of code, making them super easy to port to other environments or languages.

Code

tl;dr: Do not use <random>'s generators. There are plenty good alternatives, like PCG, SplitMix, truncated Xorshift32*, or Xorshift1024*.

A recent tweet reminded me of the <random> facilities in the C++ standard library. Having just recently studied and implemented a couple random number generators myself, I was curious how they hold up in a modern test.
From the get-go, I knew this is going to end badly, but I gave it a shot anyway.
I build a little test program, to feed into PractRand:

The test program is printing unsigned 32bit numbers to stdout, which will be piped into the RNG_Test executable from the PractRand suite.
I compiled a single executable for each generator, and started the tests.

Results

Most of the tests were over in just a couple seconds.

generator

256M

512M

1G

2G

4G

8G

16G

32G

64G

128G

256G

512G

1T

ranlux24_base

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

ranlux48_base

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

minstd_rand

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

minstd_rand0

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

knuth_b

✓

✓

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

✗

mt19937

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✗

✗

✗

mt19937_64

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✗

✗

ranlux24

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

ranlux48

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

✓

The table shows after how many iterations the generator fails the statistical tests.
The earlier it fails, the worse it is. Modern PRNGs, like PCG, are known to not fail the statistical tests at all.
That's only have the story, though.
With the table above showing the statistical quality of each generator, the following is listing it's properties:

Name

Predictability

Performance

State

Period

Mersenne Twister

Trivial

Ok

2 KiB

2^19937

LCG (minstd)

Trivial

Ok

4-8 byte

<= 2^32

LFG (ranluxXX_base)

Trivial

Slow

8-16 byte

<= 2^32

Knuth

Trivial

Ok

1KiB

<= 2^32

ranlux

Unknown?

Super Slow

~120 byte

10^171

While both the 32-bit and 64-bit mersenne twister fail (mt19937 & mt19937_64) at some seemingly high amount of data (256G and 512G, respectively), they are fundamentally flawed. They're easily predictable, large (2.5KiB of state) and multiple times slower than any other alternative.

If you must use any generator from <random>, use either ranlux24 or ranlux48, the only two which showed a reasonable result in the test. At the expense of performance.

While <random> has the building blocks to build a decent generator, with good statistical quality, it requires knowledge of the domain, and is not trivial. The predefined generators didn't stand the test of time, and are weak throughout. To make matters worse, the default_random_engine often defaults to mt19937, which, while not as bad as the other choices, is neither fast, lean nor unpredictable.
There is a certain irony, that it's known that rand() is terrible generator, while everybody is fine that C++ includes almost the exact same generator (rand() is usually implemented as an LCG).

Raw Test Results

Following are the raw tests results linked, for your viewing pleasure.

On Windows, debugger symbols aren't stored side-by-side with the executable data in the same file. They're stored in a .pdb file (short for "program database"). This is especially great if you distribute your program to end-users, but still want to be able to debug any crashes. Just keep the .pdb file somewhere save, and any crash log you get send can easily translated back into source locations.

On Linux, debug symbols are traditionally stored inside the executable, and stripped (using strip(1)) before distributing. This takes away the possibility to debug any crash, send to you from the stripped executable.

Today I discovered a neat little trick to create something resembling PDBs on Linux, by making use of objcopy(1). In our example, I already have compiled an executable a.out, which I want to distribute to my users:

$ ls -lah a.out
-rwxr-xr-x 1 woot woot 30K May 5 11:26 a.out

As we can see, the executable, with debug symbols, has a size of 30k.

Now we extract the debug symbols into another file, using objcopy(1):

$ objcopy --only-keep-debug a.out a.out.pdb
$ strip a.out

As we can now see, our debug symbols are extracted and removed from a.out:

The following tweet sparked my interest to investigate further into the different ways of mapping and unmapping memory on Windows, and trying to find the "best" way:

Easy gamedev improvement for a frame allocator: Uses two virtual address ranges. At the end of the frame, decommit (unmap) physical memory of the range that was first used, switch to other virtual address range, and then commit (map) physical memory to it. 1/2

Mapping Memory

VirtualAlloc is pretty straight forward to use, you simply pass in the desired address (or NULL, if you are fine with the decision of the OS), size (as a multiple of dwAllocationGranularity) and flags and you get back the address of the newly allocated memory region.

CreateFileMapping is a little more complicated, since it's actual purpose is to map files on disk to memory. However, if you pass in a size and INVALID_HANDLE_VALUE instead of a valid file handle, you'll get a file mapping which is backed by the systems page file. The resulting file handle has to be mapped with MapViewOfFile.

Since Windows does not support overcommitting, unlike Linux, all allocated memory, regardless of method, must always be backed by either the swap file[1] or physical memory.

Unmapping Memory

Now that we know how to map memory on Windows, let's look into unmapping. Contrary to any logic, there are four, not just two, ways of unmapping:

VirtualFree, with MEM_RELEASE

VirtualFree, with MEM_DECOMMIT

VirtualAlloc, using MEM_RESET & MEM_RESET_UNDO

UnmapViewOfFile

Using UnmapViewOfFile is the only legal way to unmap a region mapped with MapViewOfFile, the former three ways are legal for any memory allocated using VirtualAlloc.

Benchmark

One might think that all function somewhat similar. Either of the three ways should technically only change a couple of bits in some data structure in the kernel. Apparently so, this is not the case.

Every benchmark consists of mapping the allocated region, overwriting it (using either memset or touching the first byte of the 64k region, more on that later), to make sure all of the memory is actually committed to physical RAM, and than unmapping the region again.
The cost of the pure memset operation is also benchmarked, and subtracted from all benchmarks.

Benchmarks were run on my desktop machine at home (i7 5960X, 64 GiB DDR4-2134 RAM, Samsung 960 Pro). The size of memory region used was 300 MiB (64k * 4800).

Further Investigation

Further investigation revealed, that MEM_RESET is lazily unmapping the pages[2], dropping them only in the case of memory pressure, while the other ways are actively unmapping (and probably zero'ing) the memory. This would explain the difference in perceived performance.
Releasing the memory will try to "hide" the cost of zero'ing, as explained by this fantastic blog post by Bruce Dawson.

Conclusion

If the intention is to re-use the pages in the near future, prefer to mark them as unused using MEM_RESET. Otherwise, simply releasing the pages is best, and will give Windows a better opportunity to re-use pages.
In general, though, I'd advise against any method, since the performance characteristics is not suited for anything close to (soft-) realtime.