Primal Fear: Demuddling The Broken Moduli Bug

There’s been a lot of talk about this supposed vulnerability in RSA, independently discovered by Arjen Lenstra and James P. Hughes et al, and Nadia Heninger et al. I wrote about the bug a few days ago, but that was before Heninger posted her data. Lets talk about what’s one of the more interesting, if misunderstood, bugs in quite some time.

The “weak RSA moduli” bug is almost (and possibly) exclusively found within certificates that were already insecure (i.e. expired, or not signed by a valid CA).

This attack almost certainly affects not a single production website.

The attack utilizes a property of RSA whereby if half the private key material is shared between two public keys, the private key is leaked. Researchers scaled this method to cross-compare every RSA key on the Internet against every other RSA key on the Internet.

The flaw has nothing to do with RSA or “multi-secret” systems. The exact same broken random number generator would play just as much havoc, if not more, with “single-secret” algorithms such as ECDSA.

DSA, unlike RSA, leaks the private key with every signature under conditions of faulty entropy. That is arguably worse than RSA which leaks its private key only during generation, only if a similar device emits the same key, and only if the attacker finds both devices’ keys.

The first major finding is that most devices offer no crypto at all, and even when they do, the crypto is easily man-in-the-middled due to a presumption that nobody cares whether the right public key is in use.

Cost and deployment difficulty drive the non-deployment of cryptographic keys even while almost all systems acquire enough configuration for basic connectivity.

DNSSEC will dramatically reduce this cost, but can do nothing if devices themselves are generating poor key material and expecting DNSSEC to publish it.

The second major finding is that it is very likely that these findings are only the low hanging fruit of easily discoverable bad random number generation flaws in devices. It is specifically unlikely that only a third of one particular product had bad keys, and the rest managed to call quality entropy.

This is a particularly nice attack in that no knowledge of the underlying hardware or software architecture is required to extract the lost key material.

Recommendations:

Don’t panic about websites. This has very little to absolutely nothing to do with them.

When possible and justifiable, generate private key material outside your embedded devices, and push the keys into them. Have their surrouding certificates signed, if feasible.

Audit smartcard keys.

Stop buying or building CPUs without hardware random number generators.

Revisit truerand, an entropy source that only requires two desynchronized clocks, possibly integrating it into OpenSSL and libc.

When doing global sweeps of the net, be sure to validate that a specific population is affected by your attack before including it in the vulnerability set.

Start seriously looking into DNSSEC. You are deploying a tremendous number of systems that nobody can authenticate.

INTRODUCTION
If there’s one thing to take away from this entire post, it’s the following line from Nadia Heninger’s writeup:

Only one of the factorable SSL keys was signed by a trusted certificate authority and it has already expired.

What this means, in a nutshell, is that there was never any security to be lost from crackable RSA keys; due to failures in key management, almost certainly all of the affected keys were vulnerable to being silently “swapped out” by a Man-In-The-Middle attacker. It isn’t merely the fact that all the headlines proclaiming “0.2% of websites using RSA are insecure” are straight up false, because the flaws are concentrated on devices. It’s also the case that the devices themselves were already using RSA insecurely to begin with, merely by being deployed outside a key management system.

If there’s a second thing to take away, it’s that this is important research with real actions that should be taken by the security community in response. There’s no question this research is pointing to things that are very wrong with the systems we depend on. Do not make the mistake of discounting this work as merely academic. We have a problem with random numbers, that is even larger than Lenstra and Hughes and Heninger are finding with their particular mechanism.

1) Generating two random primes, p and q. These are private.
2) Multiplying p ad q together into n. This becomes public, because figuring out the exact primes used to create n is Hard.
3) Content is signed or decrypted with p and q.
4) Content is verified or encrypted with n.

It’s obvious that if p is not random, q can be calculated: n/p==q, because p*q==n. What’s slightly less obvious, however, is that if either p or q is repeated across multiple n’s, then the repeated value can be trivially extracted using an ancient algorithm by Euclid called the Greatest Common Denominator test. The trick looks something like this (you can follow along with Python, if you like):

Ah, now we have n2, which is the combination of a previously used p, and a newly generated q2. But look what an attacker with n and n2 can do:

>>> gcd(n,n2)
mpz(136093819)
>>> p
mpz(136093819)

Ta-da! p falls right out, and thus q as well (since again, n/p==q). So what these teams did was gcd every RSA key on the Internet, against every other RSA key on the Internet. This would of course be quite slow — six million keys times six million keys is thirty six trillion comparisons — and so they used a clever algorithm to scale the attack such that multiple keys could be simultaneously compared against one another. It’s not clear what Lenstra and Hughes used, but Heninger’s implementation leveraged some 2005 research from (who else?) Dan Bernstein.

That is the math. What is the impact, upon computer security? What is the actionable intelligence we can derive from these findings?

Uncertified public keys or not, there’s a lot to worry about here. But it’s not at all where Lenstra and Hughes think.

Our conclusion is that the validity of the assumption is questionable and that generating keys in the real world for “multiple-secrets” cryptosystems such as RSA is signicantly riskier than for “single-secret” ones such as ElGamal or (EC)DSA which are based on Die-Hellman.

The argument is essentially that, had those 12,000 keys been ECDSA instead of RSA, then they would have been secure. In my previous post on this subject (a post RSA Inc. itself is linking to!) I argued that this paper, while chock-full of useful and important data, failed to make the case that the blame could be laid at the feet of the “multiple-secret” RSA. Specifically:

Risk in cryptography is utterly dominated, not by cipher selection, but by key management. It does not matter the strength of your public key if nobody knows to demand it. Whatever the differential risk is that is introduced by the choice of RSA vs. ECDSA pales in comparison to whether there’s any reason to believe the public key in question is the actual key of the desired target. (As it turns out, few if any of the keys with broken moduli, had any reason to trust them in the first place.)

There aren’t enough samples of DSA in the same kind of deployments to compare failure rates from one asymmetric cipher to another.

If one is going to blame a cipher for implementation failures in the field, it’s hard to blame the maybe dozens of implementations of RSA that caused 12,000 failures, while ignoring the one implementation of ECDSA that broke millions upon millions of Playstation 3’s.

But it wasn’t until I read the (thoroughly excellent) blog post from Heninger that it became clear that, no, there’s no rational way this bug can be blamed on multi-secret systems in general or RSA in particular.

However, some implementations add additional randomness between generating the primes p and q, with the intention of increasing security:

If the initial seed to the pseudorandom number generator is generated with low entropy, this could result in multiple devices generating different moduli which share the prime factor p and have different second factors q. Then both moduli can be easily factored by computing their GCD: p = gcd(N1, N2).

OpenSSL’s RSA key generation functions this way: each time random bits are produced from the entropy pool to generate the primes p and q, the current time in seconds is added to the entropy pool. Many, but not all, of the vulnerable keys were generated by OpenSSL and OpenSSH, which calls OpenSSL’s RSA key generation code.

Would the above code become secure, if the asymmetric cipher selected was the single-secret ECDSA instead of the multi-secret RSA?

prng.seed(seed);
ecdsa_private_key=prng.generate_random_integer();

No. All public keys emitted would be identical (as identical as they’d be with RSA, anyway). I suppose it’s possible we could see the following construction instead:

…but I don’t think anyone would claim that ECDSA, DSA, or ElGamel is secure in the circumstance where half its key material has leaked. (Personal note: Prodded by Heninger, I’ve personally cracked a variant of RSA-1024 where all bits set to 1 were revealed. It wasn’t difficult — cryptosystems fail quickly when their fundamental assumptions are violated.)

Hughes has made the argument that RSA is unique here because the actions of someone else — publishing n composed of your p and his q — impact an innocent you. Alas, you’re no more innocent than he is; you’re republishing his p, and exposing his q2 as much as he’s exposing your q. Not to mention, in DSA, unlike RSA, you don’t need someone else’s help to expose your private key. A single signature, emitted without a strong RNG, is sufficient. As we saw from the last major RNG flaw, involving Debian:

Furthermore, all DSA keys ever used on affected Debian systems for signing or authentication purposes should be considered compromised; the Digital Signature Algorithm relies on a secret random value used during signature generation.

The fact that Lenstra and Hughes found a different vulnerability rate for DSA keys than RSA keys comes from the radically different generator for the former — PGP keys as generated by GNU Privacy Guard or PGP itself. This is not a thing commonly executed on embedded hardware.

There is a case to be made for the superiority of ECDSA over RSA. Certainly this is the conclusion of various government entities. I don’t think anyone would be surprised if the latter was broken before the former. But this situation, this particular flaw, reflects nothing about any particular asymmetric cipher. Under conditions of broken random number generation, all ciphers are suspect.

The headlines have indeed been grating. “Only 99.8% of the worlds PKI uses secure randomness”, said TechSnap. “EFF: Tens of thousands of websites’ SSL “offers effectively no security” from Boing Boing. And, of course, from the New York Times:

The operators of large Web sites will need to make changes to ensure the security of their systems, the researchers said.

The potential danger of the flaw is that even though the number of users affected by the flaw may be small, confidence in the security of Web transactions is reduced, the authors said.
…
“Some people may say that 99.8 percent security is fine,” he added. That still means that approximately as many as two out of every thousand keys would not be secure.

The operators of large web sites will not need to make any changes, as they are excluded from the population that is vulnerable to this flaw. It is not two out of every thousand keys that is insecure. Out of those keys that had any hope of being secure to begin with — those keys that participate in the (flawed, but bear with me) Certificate Authority system — approximately none of these keys are threatened by this attack.

It is not two out of every 1000 already insecure keys that are insecure. It is 1000 of 1000. But that is not changed by the existence of broken RSA moduli.

To be clear, these numbers do not come from Lenstra and Hughes, who have cautiously refused to disclose the population of certificates, nor from the EFF, who have apparently removed those particular certificates from the SSL Observatory (interesting side project: Diff against the local copy). They come from Heninger’s research, which not only discloses who isn’t affected (“Don’t worry, the key for your bank’s web site is probably safe”), but analyzes the population that is:

Finally, we get to why this research matters, and what actions should be taken in response to it. Why are large numbers of devices — of security devices, even! — not even pretending to successfully manage keys? And worse, why are the keys they do generate (even if nobody checks them) so insecure?

Practically every device on a network is issued an IP address. Most of them also receive DNS names. Sometimes assignment is dynamic via DHCP, and sometimes (particularly on corporate/professionally managed networks) addressing is managed statically through various solutions, but basic addressing and connectivity across devices from multiple vendors is something of a solved problem.

Basic key management, by contrast, is a disaster.

You don’t have to like the Certificate Authority system. I certainly don’t; while I respect the companies involved, they’re pretty clearly working with a flawed technology. (I’ve been talking about this since 2009, see the slides here. Slide 22 talks about the very intermediate certificate issuance that people are now worrying about with respect to Trustwave and GeoTrust) However…

Out of hundreds of millions of servers on the Internet with globally routable IP addresses, only about six million present public keys to be authenticated against. And of those with keys, only a relatively small portion — maybe a million? — are actually enrolled into the global PKI managed by the Certificate Authorities.

The point is not that the CA’s don’t work. The point is that, for the most part, clients don’t care, nothing is checking the validity of device certificates in the first place. Most devices, even security devices, are popping these huge errors every time the user connects to their SSL ports. Because this is legitimate behavior — because there’s no reason to trust the provenance of a public key, right or wrong — users click through.

And why are so many devices, even in the rare instance that they have key material on their administrative interfaces, unenrolled in PKI? Because it requires this chain of cross organizational interaction to be followed. The manufacturer of the device could generate a key at the factory, or on first boot, but they’re not the customer so they can’t have certificates signed in a customers namespace (which might change, anyway). Meanwhile, the customer, who can easily assign IP addresses and DNS names without constantly asking for anyone’s permission, has to integrate with an external entity, extract key material from the manufacturer’s generated content, provide it to that entity, integrate the response back into the device, pay the entity, and do it all over again within a year or three. Repeat for every single device.

Or, they could just not do all that. Much easier to simply not use the encrypted interface (it’s the Internal Network, after all), or to ignore the warning prompts when they come up.

One of the hardest things to convince people of in security, is that it’s not enough to make security better. Better doesn’t mean anything, it’s a measure without a metric. No, what’s important is making security cheaper. How do we start actually delivering value, without imagining that customers have an infinite budget to react with? How do we even measure cost?

I can propose at least one metric: How many meetings need to be held, to deploy a particular technology? Money is one thing but the sheer time required to get things done is a remarkably effective barrier to project completion. And the further away somebody is from a deploying organization — another group, another company, another company that needs to get paid thus requiring a contract and Purchasing’s involvement — the worst things are.

Devices are just not being enrolled into the CA system. Heninger found 19,610 instances of a single firewall — a security device — and not one of those instances was even pretending to be secure. That’s a success rate not of 99.8%, but of 0.00%.

The solution, of course, is DNSSEC. It’s not entirely true that third parties aren’t involved with IP assignment; it’s just that once you’ve got your IP space, you don’t have to keep returning to the well. It’s the same with DNS — while yes, you need to maintain your registration for whatever.com, you don’t need to interact with your registrar every time you add or remove a host (unless you’d like to, anyway). In the future, you’ll register keys in your own DNS just as you register IP addresses. It will not be drama. It will not be a particularly big deal. It will be a simple, straightforward, and relatively inexpensive mechanism by which devices are brought into your network, and then if you so choose, recognized as yours worldwide.

This will be very exciting, and hopefully, quite boring.

It’s not quite so simple, though. DNSSEC offers a very nice model — one that will be completely vulnerable to the findings of Lenstra, Hughes, and Heninger, unless we start dealing with these badly generated private keys.

DNSSEC will make key management scale. That is not actually a good thing, if the keys it’s spreading are in fact insecure! What we’re finding here is that a lot of small devices are generating keys insecurely. The biggest mistake we can make here is thinking that only the keys vulnerable to this particular attack, were badly generated.

There are many ways to fail at key generation, that aren’t nearly as trivially discoverable as a massive group GCD. Heninger found that Firewall Vendor Z had over a third of their keys either crackable via GCD or repeated elsewhere (as you’d expect, if both p and q were emitted from identical entropy). One would have to be delusional to expect that the code is perfectly secure two thirds of the time, and only fails in this manner every once in a while. The primary lesson from this incident is that there’s a lot more from where the Debian flaw came from. According to Heninger, much of the hardware they saw was actually running OpenSSL, pretty much the gold standard in available libraries for open cryptographic work.

It still failed.

This particular attack is nice, in that it’s independent of the particular vagaries of a device implementation. As long as any two nodes share either p or q, both p and q will be disclosed. But this is a canary in the coal mine — if this went wrong, so too must a lot more have. What I predict is that, when we crack devices open, we’re going to see mechanisms that rarely repeat, but still have an insufficient amount of entropy to “get the job done” — nodes seeding with MAC addresses (far fewer than 48 bits of entropy, when you think about it), the ever classic clock seeds, even fixed “entropic starting points” stored on the file system are going to be found.

Tens of thousands of random keys being not quite so random is one thing. But that a generic mechanism found this many issues, strongly implies a device specific mechanism will be even more effective. We should have gone looking after the Debian bug. I’m glad Lenstra, Hughes, and Heninger et al. did.

1) Obviously, don’t panic about any website being vulnerable. This is pretty much a web interface problem.

2) When possible and justifiable, generate private key material outside your embedded devices, and push the keys into them. Have their surrouding certificates signed, if feasible.

This is cryptographic heresy, of course. Private keys should not be moved, as bits are cheap. But, you know, some bits are cheaper than others, and I’ve got a lot more faith now in a PC generating key material at this point than some random box. (No way I would have argued this a few weeks ago. Like I said from the beginning, this is excellent work.) In particular, I’d regenerate keys used to authenticate clients to servers.

3) Audit smartcard keys.

The largest population of key material that Lenstra, Hughes, and Heninger could never have looked at, are the public keys contained within the world’s smartcard population. Unless you’re the manufacturer, you have no way to “look inside” to see if the private keys are being reasonably generated. But you do have access to the public keys, and most likely you don’t have so many of them that the slow mechanism (gcd’ing each moduli against each other moduli) is infeasibly slow. Ping me if you’re interested in doing this.

4) Stop buying or building CPUs without hardware random number generators.

Alright, CPU vendors. There’s not many of you, and it’s time for you to stop being so deterministic. By 2014, if your CPU does not support cryptographically strong random number generation, it really doesn’t belong in anything trying to be secure. Since at this point, that’s everything, please start providing a simple instruction that isn’t hidden in some random chipset somewhere for developers and kernels to use to seed proper entropy. It doesn’t even need to be fast.

I hear Intel is adding a Hardware RNG to “Ivy Bridge”. One would also be nice for Atom.

5) Revisit truerand, an entropy source that only requires two desynchronized clocks, possibly integrating it into OpenSSL and libc.

This is the most controversial advice I’ve ever given, and most of you have no idea what I’m talking about. From the code, written originally seventeen years ago by D. P. Mitchell and modified by the brilliant Matt Blaze (who’s probably going to kill me):

* Truerand is a dubious, unproven hack for generating “true” random
* numbers in software. It is at best a good “method of last resort”
* for generating key material in environments where there is no (or
* only an insufficient) source of better-understood randomness. It
* can also be used to augment unreliable randomness sources (such as
* input from a human operator).
*
* The basic idea behind truerand is that between clock “skew” and
* various hard-to-predict OS event arrivals, counting a tight loop
* will yield a little bit (maybe one bit or so) of “good” randomness
* per interval clock tick. This seems to work well in practice even
* on unloaded machines…

* Because the randomness source is not well-understood, I’ve made
* fairly conservative assumptions about how much randomness can be
* extracted in any given interval. Based on a cursory analysis of
* the BSD kernel, there seem to be about 100-200 bits of unexposed
* “state” that changes each time a system call occurs and that affect
* the exact handling of interrupt scheduling, plus a great deal of
* slower-changing but still hard-to-model state based on, e.g., the
* process table, the VM state, etc. There is no proof or guarantee
* that some of this state couldn’t be easily reconstructed, modeled
* or influenced by an attacker, however, so we keep a large margin
* for error. The truerand API assumes only 0.3 bits of entropy per
* interval interrupt, amortized over 24 intervals and whitened with
* SHA.

Disk seek time is often unavailable on embedded hardware, because there simply is no disk. Ambient network traffic rates are often also unavailable because the system is not yet on a production network. And human interval analysis (keystrokes, mouse clicks) may not be around, because there is no human. (Remember, strong entropy is required all the time, not just during key generation.)

what’s also not around on embedded hardware, however, is a hypervisor that’s doling out time slices. And consider: What makes humans an interesting thing to build randomness off of, is that we’re operating on a completely different clock and frequency than a computer. We are a “slow oscillator” that chaotically interacts with a fast one.

Well, most every piece of hardware has at least two clocks, and they are not in sync: The CPU clock, and the Real Time Clock. You can add a third, in fact, if you include system memory. All three may be made “slow oscillators” relative to the rest, if only by running in loops. At the extreme scale, you’re going to find slight deviations that can only be explained not through deterministic processes but through the chaotic and limited tolerances of different clocks running at different frequencies and different temperatures.

On the one hand, I might be wrong. Perhaps this research path only ends in tears. On the other hand, it is the most persistent delusion in all of computing that there’s only one computer inside your computer, rather than a network of vaguely cooperative subsystems — each with their own clocks that are in no way synchronized against one another.

It wouldn’t be the first time asynchronous behavior made a system non-deterministic.

Ultimately, the truerand approach can’t make things worse. One of the very nice things about entropy sources is that you can stir all sorts of total crap in there, and as long as there’s at least 80 or so bits that are actually difficult to predict, you’ve won. I see no reason we shouldn’t investigate this path, and possibly integrate it as another entropy source to OpenSSL and maybe libc as well.

(Why am I not suggesting ‘switch your apps over to /dev/random? Because if your developers did enough to break what is already enabled by default in OpenSSL, they did so because tests showed the box freezing up, and aren’t going to appreciate some security yutz telling them to ship something that locks on boot.)

6) When doing global sweeps of the net, be sure to validate that a specific population is affected by your attack before including it in the vulnerability set.

7) Start seriously looking into DNSSEC. You are deploying a tremendous number of systems that nobody can authenticate.

It’s 2012, and we’re still having problems with Random Number Generators. Crazy, but we’re still suffering (mightily) from SQL injection and that’s a “fixed problem” too. This research, even if accompanied by some flawed analysis, is truly important. There are more dragons down this path, and it’s going to be entertaining to slay them.