EDIT: The "Bit" column exhibits a perfect scaling of scores! That blows me away. I don't have a clue why it's so rigidly alone like that. Pity it's also the scoring variant that takes many times longer to run.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

As MUL takes the same time as ADD on the P2 there would be no performance penalty. It's not clear to me whether the product could be truncated to 16 bits as well as 8 bits. If so, would it perform better in randomness tests than the 16-bit sum? If not, two iterations would be needed to get a high-quality 16-bit result, but that's just the same as with adding.

I can certainly replace the summing with multiplying. Don't even have to re-run the full period searches.

Tony,
With a quick look at that webpage, I'm not making much sense of how he's done the Xorshift algorithm. I'm suspicious he's just looking at the state data directly, because otherwise PractRand would be throwing a wobbly on our testing quick smart I'd think.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

... It's not clear to me whether the product could be truncated to 16 bits as well as 8 bits. If so, would it perform better in randomness tests than the 16-bit sum? If not, two iterations would be needed to get a high-quality 16-bit result, but that's just the same as with adding.

Assuming XORO32 based, why would multiplying two 16-bit values only produce 8 bits? Or is that not what you were talking about?

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

Well, simply replacing the addition with a multiply produces rubbish PractRand scores. Amusingly, the "bit" variant is still the same scores but most everything else barely makes it passed the starting blocks.

I've added the author's name to my previous post. It's hard to be sure, but I think her "xoroshiro16star8" uses two 8-bit states that are multiplied by a 16-bit magic number ("MCG multiply") and only the high byte of the product kept.

Dang, it just dawned on me, far too late, that DRAM prices have been going through the roof all this year! And even with those prices, supply appears to have gone to virtually nil.

One of my new 16GB DIMMs failed and there is no exact replacements in stock. I've got the full credit for them (Returned the pair) and can claim it as a refund, but the options for replacement are slim pickings.

I'm back using me old PC for the moment, and the remainder of the year looks bleak.

Well, things have progressed. I got the full refund for the faulty RAM and went elsewhere to get something faster. I figured if I was going to be paying inflated prices to get a decent set of DIMMs I may as well get something faster than I had. The alternative was something slower but still more expensive than original.

I've since discovered, and verified, that at least one program was having issues with the SMT flaw in the early Ryzens. I would have been happy to just disable SMT in the BIOS but the silly BIOS has a problem with that - It insists I had an unexpected power down every time it powers up - have to press F1 and ESC and ENTER every time I power it up with SMT disabled!

So now I'm sending the CPU back for a free exchange tomorrow - never done anything international like this before - AMD have directly emailed me a FedEx number to book it to. Conveniently there is a depot only ten minutes drive away ...

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

Yep, I've got a set of score tables for all those. Size 20 is the largest I've done though. Anything bigger took forever to brute force search for full period candidates.

I'll zip them up when home ...

I've done Xoroshiro40+ on the Z80 and the triplet numbers were quite friendly. A final one I'd like to try is Xoroshiro48+ as every bit in two sets of three 8-bit registers would be used. There is a trick for finding full-period triplets without brute-force as that would take forever with Xoroshiro128+, if only we knew what the trick is.

Yes, Xoroshiro128+ and 1024+ both have "jump" functions supplied in the sources from the authors.

Starting from bit 0 of both word halves of the PRNG state, it does an overlaying XOR of a working state with the current state based on what looks to be a specially crafted key, then calls the iterator, stepping the current state by one. This repeats for each bit of the word size (ie: half the state length).

Somewhere, some how, some mathematician can prove that some pseudo random number generator algorithm with some parameters will run though all, or most, of its combinations of states before returning to where it started.

I have no idea how.

Of course, running through a huge sequence does not imply a random looking sequence. A simple binary counter will run through such a sequence, step by step in order. We need something better to defeat the tests of "randomness".

I think understanding C is the least of the problems in understanding how these things work.

Certainly brute force running and checking them does not help. The periods can be longer to work through than the age of the universe!

What is certainly true is that tweaking the number of bits in the state or messing with the parameters can ruin a good PRNG.

I start to believe that the search for a better pseudo random number generator drives people insane. As this thread shows I felt it myself before I got help from "PRNG Anonymous". For example this from Stanford:

The top three lines of the source is a comment, so that's plain english at least. Basically, this particular jump() config is not useful for finding any full period triplets because even it still has to be called 2^64 times to hop all the way along.

So, presumably it's possible to built jump functions that steps further than a square-root into the period. But they're not inclined to explain, or maybe it's just really hard to work out. I have no idea.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

What is certainly true is that tweaking the number of bits in the state or messing with the parameters can ruin a good PRNG.

Full period may be solved at a higher analytical level but the quality certainly isn't. Tony's link to Melissa's visualisation work is a good demo of how quality is accessed. And that's also why PractRand and BigCrush are heavily relied on.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

It's a bit of a dense read and I think what David Roberts does is use xoroshiro128+ to create a 64-bit sum, then splits that into a pair of 32-bit values to get two PRNs for the price of one iteration. He replaces bit 0 of the low long with its parity, on the basis that parity should be random.

This would take two extra instructions with XORO32 on the P2, one to get the parity of sum[15:0] into C and the second to put C into sum[0]. To be honest, it wouldn't take a lot more to do two iterations but that would waste half the XORO32 bits.

I'm wondering whether the parity of the state could be used instead. If this produced equally random results and if the XORO32 instruction flagged parity (and zero) the same way as AND/OR/XOR/TEST, then only one extra instruction BITC sum,#0 would be needed to improve the XORO32 output, with no change to the underlying algorithm.

I've arbitrarily tried a few of those ideas myself. Most end up with total fails on Practrand, some have given equivalent quality ... hmm, I've not tried to use all bits at once though, since that part I'd long moved outside the generator code. Good point, I'll have another fiddle ...

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

That, and I've only just got the replacement CPU on warranty exchange from AMD. They took a tad longer than promised to get it out to me. I'm still trying to verify if the SMT flaw is completely solved or not. I think everyone is worried that AMD are not telling the whole story with it.

In my latest testing it has hiccup'd once already, but is a specific set of circumstances that may not be directly related. Lots more testing to still go on this one as well ...

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

I ran tests dumping just the lower Dword of the Qword output and tests dumping just the upper Dword of the Qword output. It seemed that BRank failures only occurred with the lower Dword.

I did some further reading and found this with regard Xoroshiro128+ and, I suspect, it is true for Xorhift128+ as well.

Beside passing BigCrush, this generator passes the PractRand test suite up to (and included) 16TB, with the exception of binary rank tests, which fail due to the lowest bit being an LFSR; all other bits pass all tests. We suggest to use a sign test to extract a random Boolean value.

So, we have an errant bit 0 in the Qword.

If we have two Dwords, X random and Y not random then X Xor Y will be random. That is why the Xor method worked. However, bit 0 will still be errant for the double precision values.

We could simply cast out the lower Dword saving us time with the Xor method. However, we will now have to go past the 'engine' twice to collect two 'good' Dwords for the Double precision values; not good for a 64 bit generator. Alternatively, we can try and rectify bit 0.

I tried the suggestion of using a sign test but all I got was a toggling of bit 0, which simply moved the location of a potential failure.

"We suggest to use a sign test ..." confused me for a while as it could be interpreted as a suggestion for solving the bit 0 "problem" by using the sign bit, but all it really means is choose the msb if you want a single random bit (because that's the most random one).

... but all it really means is choose the msb if you want a single random bit (because that's the most random one).

Yeah, I'm guessing that's entirely common knowledge in the field for this type of algorithm. Melissa refers to it that way. No doubt there is a math theory to back this up.

Chip asked for the msb as one of my score variants and it's the only bit position I've singly sampled in my tests. And Practrand scores on this variant test have, without fail, squared for each pair of extra bits added to the state size. It's a little eye popping for me.

BTW: The reason why the "bit" variant takes so much longer to run the scoring is because it iterates 8 times and stuffs 8 msb's into one byte before piping that byte out to Practrand. Practrand is multithreaded and I've got 8 cores, so Practrand can absorb the data stream as fast as the generator can feed it the data.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."