Random/LFSR on P2

Comments

There will be a way to do the jumping using a double iterator function. But it's not like there is a great need for this with a Xoroshiro32. It's just too small a state space. At 160 MHz and using the REP instruction, a cog can iterate the XORO32 instruction 64k states in 0.4096 ms. So, about 13.1 ms to produce those 32 seeds above.

The jump function needs single-stepping of states which XORO32 can no longer do, unless it is modified. The easiest way, avoiding use of CZ opcode bits, would be to move XORO32 to an empty D,S slot with S selecting single or double iterations.

There will be a way to do the jumping using a double iterator function. But it's not like there is a great need for this with a Xoroshiro32. It's just too small a state space. At 160 MHz and using the REP instruction, a cog can iterate the XORO32 instruction 64k states in 0.4096 ms. So, about 13.1 ms to produce those 32 seeds above.

Any jump, no matter how big, would take under 2 µs at 160 MHz. Here's a good one:

I've iterated xoroshiro32+ for the full 2^32-1 period to check each bit of the sum has one more '1' than '0' (as zero state is invalid). This is guaranteed with the original algorithm but we have changed bit 0 and I tested the other bits anyway. Each of sum[16:0], the parity of sum[15:0] and the parity of sum[16:0] do indeed have one more '1'.

Clearly, you have some code or tool at your disposal for jump key generation for the regular version of Xoroshiro. I wonder what it would take to modify that to handle a double iterating version like XORO32?

BTW: I tested the REP idea and it functionally works without error and at estimated speed. The S port insertion at the ALU doesn't corrupt the working state.

Chip,
Many thanks for the CORDIC! It's so painless and convenient to use that you'd hardly know it wasn't in the Cog.

Thanks. With the built-in K-factor compensation, it really became nice to use. To preserve the compensation and meet timing for OnSemi, I had to add 16 stages to it. That seemed like a lot, but to have lost the compensation would have been huge. That thing does stuff (rotate, vector, log, exp) that would take very lengthy routines to accomplish. Having those things in hardware is like sci-fi, to me. You can put a few instructions together and do amazing things. We really need the silicon with analog I/O to demonstrate what can be done with the CORDIC.

Tony,
For me, the most use I'd put a key generator to is finding all the full-period triplets of larger word sizes, eg: Xoroshiro48. By the looks it can generate any arbitrary position in the state sequence.

Though, we'd have to prove behaviour of when jumping past the end of a given period.

With each CORDIC iteration, X and Y grow. With infinite iterations, you get a fixed growth factor of:

After maybe a dozen iterations, you are pretty much there. We've got 32 iterations.

Anyway, this growth needs to be compensated for by multiplying the X and Y results by ~0.607. I made a BASIC program to find at what stages we could subtract a right-shifted amount of the current value from the current value, in order to arrive at the right compensation. It took 16 of these simple subtractors, but they each needed a stage, as I couldn't stack up two 40-bit adders in a single clock cycle. The reason the adders are 40 bits, instead of 32, is so that 8 fractional bits can be maintained. With those fractional bits and rounding, you get binary-perfect results.

Huh, that one is not in the Wikipedia list of disambiguous K-Factors. I'll have a go at adding it ... I put an entry in the talk page instead. I think it would have needed a separate article otherwise.

Brute force and ignorance of how to do it properly. Choose a start state, then iterate xoroshiro32+ the desired jump size to get an end state. Using the start state, run the jump function and test whether the resulting jump state is the end state for every possible 32-bit pattern until a match is found or not.

If the pattern bits are shifted out one at a time, the jump function can exit early when the remaining bits are all zero to save time. Finding the 64K and 128K jump bit patterns was quick as they are fairly close to the starting bit pattern count of zero, however I couldn't find any matching 32-bit pattern for 256K and 512K jumps, which took quite a while to fail.

Clearly, you have some code or tool at your disposal for jump key generation for the regular version of Xoroshiro. I wonder what it would take to modify that to handle a double iterating version like XORO32?

The tool was MASM and the code x86 assembler. The jump function needs single-stepping through the states, as '1' bits could be anywhere in the bit pattern, not just the even bits. I think a double-iterating jump function is impossible. XORO32 that can double- and single-step is possible.

Well, the Prop can already do a 32-bit parity in a single instruction. So if we went back to the first incarnation of XORO, and have the second half of the state being forwarded on to the following instruction's S-port, then a XORO64 would work quite well.

Sebastiano Vigna kindly emailed me all the xoroshiro32 characteristic polynomials, essential info for creating jump functions. The weight is the number of terms in the polynomial and in theory close to 16 is preferable. Our best triplet [14,2,7] has weight of 11 and second best [15,3,6] has weight of 13.

I hadn't looked closely at all the full-period triplets before. For every [a,b,c] there is a corresponding [c,b,a] that shares the same characteristic polynomial. Knowing this in advance means brute force searches can be reduced by half.

Hey Tony, thanks for getting that. It perfectly matches our 84 brute forced findings. I only just really looked at it now.

And the recommendation of targeting only the ones of closely matching "degree" with word size will be valuable with what's coming up ... I've spoken with Sebastiano again and, same as you, I just had to ask and he immediately popped me a complete list of full-period candidates for a Xoroshiro64 iterator! Amazing stuff. It isn't nicely sorted in mirror pairs like your one was though.

We need only two registers, for 32 bits of state and 32 bits of PRN.
Period is over 20000 years for six clocks at 160 MHz.

Only outstanding issue is parity for bit 0. XORO64 could report changed bit 0 from normal addition in the carry flag.

I think my XORO64 idea has not been understood fully. There is no need to save s1 separately because prn = s0 + s1. The three instructions above are the entire code if original sum[0] is acceptable. If XORO64 could flag when sum[0] and full parity are different (Z might be better choice than C), then code would increase to five instructions.