At stackoverflow this question has been asked. It uses additional random entropy and a hash method (among others) to try and create a cryptographically secure pseudo-random number generator for PHP. PHP seems to use a Mersenne Twister algorithm with a large internal state and high period, but Wikipedia assures me that Mersenne Twister is not cryptographically secure.

Q: Could somebody please indicate what vulnerabilities there are using the PHP Mersenne Twister implementation as if it was cryptographically secure?

Additional Q: It would be very nice if somebody could go over to stackoverflow to see if the solution of H M is any better than using the default method and if it can be improved. Of course, the only really good way is to hold it against BSI and NIST test suites, but any improvement on the default implementation may be useful.

The source code in the stackoverflow question should be pretty easy to read, even for persons that are more mathematically inclined. Just as reference, I've included the source of the current methods within Zend PHP (which indeed seems to lack any kind of cryptographical algorithms).

Code is a partial copy of the Zend source for cryptographic analysis only. The ZEND source is protected by the PHP 3.01 license.

PS the initialization of the algorithm is just 32 or 64 bit, which is why I proposed at least to reseed it thouroughly on stackoverflow, but I wonder if that would be enough, as I've unfortunately proposed on stackoverflow (mostly because creating self defined cryptographic algorithms by users is frowned upon).

1 Answer
1

Well, the chief vulnerability is that if an attacker is given a large enough sample of Mersenne Twister output, he can then predict future (and past) outputs. This is a gross violation of the properties that a cryptographically secure random number generator is supposed to have (where you're supposed to not even be able to tell if the random bit string could have been produced by the RNG in question).

As for how this weakness may be exploited, well, consider if you use the output of the MT as a keystream for a stream cipher (that is, you exclusive-or it with the plaintext to form the ciphertext). Then, if the user sees an encrypted message, and correctly guesses a part of the plaintext which is 2496 characters long, he can then immediately recover the internal MT state (and decrypt the entire message). Even worse, if the user has a section of ciphertext 19968 characters long for which he knows that the corresponding plaintext had all MSBits clear (say, he knows that section was standard ASCII), then he can then use that to decrypt the entire message (even though he initially had no idea beyond that what the contents of that section were).

The first weakness (2496 bytes of consecutive outputs allows the attacker to recover the MT state) is fairly straightforward to explain; we examine the last four operations of the MT immediately before the output:

If we look closely, we see that these operations are all invertible; that is, if we were given a 4 byte output of the MT, we can compute the value s1 had before these operations took place.

Now, what MT does is update its state, and then output its entire state (disguised by the above transformation). Now, an attacker can invert this transformation, and so directly observe the MT state. All he needs is 624 words of output (or 2496 bytes), and he has learned the entire state; he can advance the MT forwards (or backwards; the update function is invertible), and generate as much output as he wants.

Now, one possible nit is for this straight-forward approach to work, these 2496 bytes must correspond to one MT reload cycle; that is, start at a multiple of 2496 bytes from the start of the stream). While this would be easy to work around (if we're given 4991 consecutive bytes of output, then there will be a 2496 byte section that lives entirely within a reload cycle), it turns out that we needn't bother, because it's broken even worse than that.

The second weakness is that, if we can get just about any sample of 19968 bits of output, we can also recover the MT state, and it doesn't matter of all those bits are from the same MT reload cycle, or spread across multiple. As for how this works, well, when we examine the MT update and disguise functions, we see that they're all linear in GF(2); this means that all outputs are linear functions of the 19968 initial state bits. Hence, if we get 19968 output bits, we can form 19968 linear equations on the initial state bits, and solve that set of linear equations. That gives us the initial state, and from that, we can compute everything. Now, some sets of 19968 linear equations might not turn out to be linearly independent; in practice, all that means is that we might need a couple extra known bits, or just deal us having a couple of different possibilities for initial state; neither of these are significant difficulties for an attacker.

Hence, the bottom line is that Wikipedia is right, Mersenne Twister is not even slightly cryptographically secure, and should not used as if it were.

Thanks! Would simply hashing the output of the algorithm, together with some time & I/O based information be enough to get a "secure enough" output? Is it possible for you to see if the method of "H M" at stackoverflow it slightly valid? I presume it will be safer because of the hash hiding the state, but that is not enough to say it is a provably secure algorithm IMHO.
–
Maarten BodewesMar 31 '12 at 19:37

@owlstead: Do you have a link to the stackoverflow question? On the other hand, if you're just giving the output of MT to a hash function, that'd probably be safe (if a bit slow). On the other hand, just hashing the value of a counter is also practically speaking safe (and large counters both have long cycle times, and are even faster to update than MT). As for "provability", that's generally in reference to some cryptographical assumption (e.g. factoring is hard); is "you can't find preimages to SHA-512 given related plaintexts" an acceptable assumption?
–
ponchoApr 3 '12 at 18:37

Ah, the link on top is not that visible I guess: here it is again. I deleted my own answer, as it was wrong. He's mixing the output and does some other things with it as well, but the main part is indeed a secure hash method. You can just hash the MT output, but you might be able to contact the MT in other ways, retrieving the internal state. In that case hashing does not work.
–
Maarten BodewesApr 3 '12 at 23:14