Random/LFSR on P2

Comments

Sebastiano Vigna kindly emailed me all the xoroshiro32 characteristic polynomials, essential info for creating jump functions. The weight is the number of terms in the polynomial and in theory close to 16 is preferable. Our best triplet [14,2,7] has weight of 11 and second best [15,3,6] has weight of 13.

I hadn't looked closely at all the full-period triplets before. For every [a,b,c] there is a corresponding [c,b,a] that shares the same characteristic polynomial. Knowing this in advance means brute force searches can be reduced by half.

Hey Tony, thanks for getting that. It perfectly matches our 84 brute forced findings. I only just really looked at it now.

And the recommendation of targeting only the ones of closely matching "degree" with word size will be valuable with what's coming up ... I've spoken with Sebastiano again and, same as you, I just had to ask and he immediately popped me a complete list of full-period candidates for a Xoroshiro64 iterator! Amazing stuff. It isn't nicely sorted in mirror pairs like your one was though.

Excellent news! Thanks for doing that, Evan. I could sort the list into mirror pairs, if you wish.

Tony,
In what I presented to Chip, I chose to revert back to just the earlier iterator only type function so as to limit the logic/flops burden in the instruction of the larger number of bits involved. Also, it keeps the execution to a single clock cycle, which I suspect is not the case with your approach.

Tony,
In what I presented to Chip, I chose to revert back to just the earlier iterator only type function so as to limit the logic/flops burden in the instruction of the larger number of bits involved. Also, it keeps the execution to a single clock cycle, which I suspect is not the case with your approach.

If there are no timing issues with doing a couple of XORs before a 32-bit addition then my XORO64 idea will work.

EDIT:
Two parallel xoroshiro32s would be replaced by one xoroshiro64. The inputs to one of the two parallel 16-bit additions have a two xoroshiro delay in XORO32.

What we have now is ideal, I think, and it's easy-to-use as can be. 32 bits is plenty for most uses, I imagine.

And if a *real* random number is needed, you can always just do a GETRND, without any contingencies.

XORO64 could produce 32-bit PRNs in three instructions. There is a new, simple and optional post-processing that should improve its output. I cannot say any more at the moment. We are getting tremendous help from Sebastiano Vigna, who has been testing the following xoroshiro[32]64 triples:

All xoroshiro generators (without additional backend reworking) at some point have some Hamming dependency kicking in. That is, if you look at the number of ones and zeros in a sequence of, say, 10 consecutive outputs, the resulting 10-tuple doesn't have exactly the distribution it should be, as linear generator tend to create correlation between the number of 1s of consecutive outputs.

In xoroshiro128 this happens after several TB of data (at least 32TB), so it's not a real concern. But in smaller-sized generator it will happen before (e.g., "DC6" test from PractRand).

In this case, testing is fundamental because theory cannot really say much except what is full period and what not.

That email address was just a throw away and it's actually stopped by the provided now, I seriously dislike the "social media" tendrils. Not to mention the horrid comment entry interface that Google Groups has. I'll be a very sad puppy if Parallax ever tries to use such rubbish for it's forums.

Note: The full width 32-bit tests top out at 32 GB. All failed with a massive brick wall BRank. I believe this is due to the summed bit1 being shifted into the more sensitive bit0 position for PractRand.

Example:
[Low1/32]BRank(12):3K(1) R=+21249 p~= 9e-6398 FAIL !!!!!!!!

EDIT:
The second column of scores, [31:1], what I've also called Word1 before, shows more substantial results. These all failed on the DC6-9x1Bytes-1 test in PractRand.

PS: The byte sized sampling variants (except for LSByte) feel like they're going to go a lot further. I tested a pair of low quality candidates out to 4 TB (took more than a day) and it wasn't giving up there.

LSByte, [7:0], always dies at 8 GB (exactly quarter of full 32-bit sampling). These have the same massive BRank fails as the full width variant.

EDIT2:
When placing the parity at bit0 the BRank test fails the [31:0] variant at 512 GB instead of 32G. There is a mix of DC6 failures at 256 GB with these. The group of good scores is diminished and not as obvious.

When placing the parity at bit0 the BRank test fails the [31:0] variant at 512 GB instead of 32G. There is a mix of DC6 failures at 256 GB with these. The group of good scores is diminished and not as obvious.

Yeah, It's probably not as bad as it looks because I think the scoring processed value gets scaled down more for larger datasets but those were bad enough to all fail the DC6 at the 512 MB mark as well as the BRank fail.

EDIT: The reason I included them is the 35.6 is the lowest. Just another datapoint to consider.

Every cog gets a uniquely-picked/ordered/inverted set of 32 bits from the 63-bit source. Every smart pin gets 8 such bits. Smart pins need these for DAC dithering and noise generation. Cogs need them for the GETRND instruction. I think cogs and pins will all be happy with this arrangement. These patterns really only need to serve as apparently-uncoordinated white noise sources.

On start-up, the PRNG will be 1/4 seeded maybe 64 times with 4 bits of pure thermal noise, along with another 27 bits of process/voltage/temperature-dependent bits.

A few questions:

1. Has the parity improvement been applied to xoroshiro128+ ? That would make a 64-bit source and could mean each bit is used the same number of times for the different cogs and smart pins.

2. Is there enough time to do the 64-bit addition in a single clock?

3. Have the different cog and smart pin scramblings been made public? (I don't want to hold up Chip with this one.)

A few bits get used 9 times, while the rest get used 8 times in those longs. This is because, as you know, we only have 63 bits in x[62:0]. No long uses the same bit twice (cog GETRND sources), nor does any byte (smart pin noise sources). I think the distribution is good enough that adding that parity fix wouldn't make any meaningful difference.

Here is the Spin program for the Prop1 that a I wrote to randomly come up with these patterns: