Bug Description

When doing benchmarks on the power8 platform which offers as much as 160 concurrent threads (20 physical cpu cores * 8 threads per core) we've seen sysbench spending huge amounts of cpu time in lrand48(). The effect was visible as decreased performance when doing OLTP read/write benchmarks at high concurrency. This is partly an effect of sysbench itself eating cpu, partly a effect of cache line pollution by the shared global RNG state.

Attached is a patch that changes the sb_rnd() function to use a thread local RNG. It's a straight forward implementation of a LCG with 32 bit length. More specifically I used the first parameter set from https://en.wikipedia.org/wiki/Linear_congruential_generator. The implementation uses autoconf to check for thread-local storage. if the build environment doesn't provide this, it falls back to the old behavior.

With the patch applied the peak throughput went up by 20% from 20K tps to 24K tps. I'd like to see this change in sysbench trunk soon.

1. SB_MAX_RND is *not* the modulus of the LCG. The modulus is 2^32, but output of the LCG is projected to a smaller range by the (seed % SB_MAX_RND) operation. There is no way that this could affect the resulting period length to become shorter than SB_MAX_RND itself.

2. the current implementation does the same thing. In sysbench.h there is

meaning the 31 random bits returned from lrand48() are projected to a smaller range.

Further more I think the modulus operation above is wrong. If we want to project a random number from a larger range inside the interval [0..SB_MAX_RND] (limits included) then we should use (lrand48() % (SB_MAX_RND+1)). Which is also more efficient because it can be implemented as (rand_input & SB_MAX_RND), thus avoiding a division.

I didn't read all the code, but do you remember why there is a fixed value of SB_MAX_RND at all? The only place in the code where I see it being used is when the result from sb_rnd() is transformed to a DOUBLE value in [0..1]. Wouldn't it be much better to chose SB_MAX_RND according to the RNG that is picked by autoconf? Then SB_MAX_RND would be 2^32-1 for lrand48() (which return 31 bits) and 2^33-1 for sb_rnd_local() (which returns 32 bits).

> 1. SB_MAX_RND is *not* the modulus of the LCG. The modulus is 2^32, but
> output of the LCG is projected to a smaller range by the (seed %
> SB_MAX_RND) operation. There is no way that this could affect the
> resulting period length to become shorter than SB_MAX_RND itself.
>

Right, I missed that the result of (seed % SB_RND_MAX) is not assigned
back to seed.

Anyway, I don’t want to change the default RNG behavior. All changes
should be optional and off by default.

Just for the record, there is another option to attack this problem: reduce number of lrand48() calls. 31 random bits is enough to generate 9 decimals using division (possibly slower), or 7 decimals using bitwise shift (possibly faster).

Calling lrand48() 7x or 9x times less per transaction should solve this problem, at least for given core/thread count.