Chip,
Totally difference matter - I was just reading the original Xoroshiro128+ source again and noted a point stated about compilers will optimise the convoluted shifts and masks into a rotate instruction where they can. I knew that wasn't much help for my code since the word size is definable and therefore often won't fit the target processor ... then it dawned on me that that's exactly what a rotate instruction should be able to handle!

Hmm, I'm pondering the idea of adding a definable mask to the output of the ALU. Have a hidden register that holds the mask.

Its impact would be wide ranging and the details will take time to sort out - Something to think about for the Prop3 I guess.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

Totally difference matter - I was just reading the original Xoroshiro128+ source again and noted a point stated about compilers will optimise the convoluted shifts and masks into a rotate instruction where they can. I knew that wasn't much help for my code since the word size is definable and therefore often won't fit the target processor ... then it dawned on me that that's exactly what a rotate instruction should be able to handle!

What exactly are you suggesting here ?

There are two forms of Rotate in MCUs, one rotates a single-step, and another takes the rotate-count as a parameter.
It would be practical to modify the first form to have a reach, so allow ROTATE of any element width from 2-32 bits, and that would have immediate use for Rotate of 8b and 16b elements.
However, I'm not sure a Rotate with two params, one for rotate-count, and another for rotate-reach would be practical, seems to be quite logic hungry.

The rotate logic probably would be bit hideous, haha.
Don't worry, I was just musing on record.

ok.
The single-bit form could be useful, (and compact logic) if combined with REP.
Trades off a little speed for logic savings.
Saves logic, code is compact and covers any reach and any count, for those cases that require it.
I think REP can manage multiple lines, for > 32b, N shifts ?

I've been running some more tests and not doing so well ... I sat down and went over the numbers and worked out that, give or take quite a large margin for case variations, it's a 32x time multiplier per word size increase to check every combination. So the difference in time taken between a s16 full combination search and a s32 full combination search is a 32^16 = 1.2x10^24 (quadrillion, apparently) multiplier!

This issue applies heavily to the max-iteration search. I've tried to reduce the burden by limiting the combination ranges but it's not anywhere near enough. The inner most loop is a 4x per bit as a starting point.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

That's the whole iterator, now realized in the XORO32 instruction. It uses any cog register to track the 32-bit PRNG state. Evanh found that the ROL/SHL settings used in this implementation cause the PRNG to pass all random quality tests out to 1GB, whether from using the top bit, the top byte, or the top 15 bits of the sum of the two 16-bit fields that comprise the state. This is golden.

This all means that you can now have high-quality seedable and repeatable pseudo-random number generators in PASM, without having to implement any algorithm. This was modeled after the best PRNG topology known on the interwebs, and the very best settings were determined through a lot of testing. Thanks to Ahle, Evanh, and whoever came up with the xoroshiro128+ algorithm!

So that's just another instruction now, as in it's executed through the ALU using D source and D result. It's quite neat to think how compact that becomes.

PS: I've been offline for a week or so. Had the flu - couldn't sleep for a couple of days, then managed to sleep with an eyelash, or something in me eye. Damaged my eye enough that I couldn't sleep another night, went to doctor, got cleaned up but couldn't bare even looking at a display. Just slept for days.

All good now though.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

So that's just another instruction now, as in it's executed through the ALU using D source and D result. It's quite neat to think how compact that becomes.

PS: I've been offline for a week or so. Had the flu - couldn't sleep for a couple of days, then managed to sleep with an eyelash, or something in me eye. Damaged my eye enough that I couldn't sleep another night, went to doctor, got cleaned up but couldn't bare even looking at a display. Just slept for days.

I may as well dump out all the scores I've generated myself I guess ... score tables for word sizes 12 to 20 are attached. s20 took a couple of days on a Ryzen 8-core running at 3800MHz on all cores.

Next step is look into using GPU for doing the full-period candidate searches ... The CPU takes enormous time when the word size is notched up - With respect to the word size, the per-combination run time is a square law 4^x exponential law and the number of combinations to search is a cube law. Overall candidate search time multiplies the two together.

Yeah, that was a tad short-hand labelling:
- "Word" is the full summed word size minus the lsb, for s16 (Xoroshiro32+) that is the top 15 bits [15:1].
- "Byte3" is the most significant 8 bits of the summed word, for s16 that is bits [15:8].
- "Byte2" is half way down the summed word, for s16 that is bits [11:4].
- "Byte1" is always bits [8:1] for all word sizes.
- "Bit" is msb.

PS: You'll note I've included Xoroshiro's predecessor, Xorshift, for comparison. It performs notably worse on the PractRand scores even though the author indicates it's still a perfectly okay algorithm. To me that says that Xoroshiro is exceptionally good for it's compactness and speed.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."

Thanks Doug,
The mathematicians that do the real work formulating these algorithms must work at an entirely different level though. Somehow, for a period of 2^128-1, the author of Xoroshiro must have picked full-period candidates without empirically ever having tested the period on even one combination.

"There's no huge amount of massive material
hidden in the rings that we can't see,
the rings are almost pure ice."