Wednesday, March 25, 2015

x86 is a high-level language

Just so you know, x86 machine-code is now a "high-level" language. What instructions say, and what they do, are very different things.

I mention this because of those commenting on this post on OpenSSL's "constant-time" calculations, designed to avoid revealing secrets due to variations in compute time. The major comment is that it's hard to do this perfectly in C. My response is that it's hard to do this even in x86 machine code.

Consider registers, for example. Everyone knows that the 32-bit x86 was limited to 8 registers, while 64-bit expanded that to 16 registers. This isn't actually true. The latest Intel processors have 168 registers. The name of the register in x86 code is really just a variable name, similar to how variables work in high-level languages.

So many registers are needed because the processor has 300 instructions "in flight" at any point in time in various stages of execution. It rearranges these instructions, executing them out-of-order. Everyone knows that processors can execute things slightly out-of-order, but that's understated. Today's processors are massively out-of-order.

Consider the traditional branch pair of a CMP (compare) followed by a JMPcc (conditional jump). While this is defined as two separate instructions as far as we humans are concerned, it's now a single instruction as far as the processor is concerned.

Consider the "xor eax, eax" instruction, which is how we've traditionally cleared registers. This is never executed as an instruction, but just marks "eax" as no longer used, so that the next time an instructions needs the register, to allocate a new (zeroed) register from that pool of 168 registers.

Consider "mov eax, ebx". Again, this doesn't do anything, except rename the register as far as the processor is concerned, so that from this point on, what was referred to as ebx is now eax.

The processor has to stop and wait 5 clock cycles to read something from L1 cache, 12 cycles for L2 cache, or 30 cycles for L3 cache. But because the processor is massively out-of-order, I can continue executing instructions in the future that don't depend upon this memory read. This includes other memory reads. Inside the CPU, the results always appear as if the processor executed everything in-order, but outside the CPU, things happen in strange order.

This means any attempt to get smooth, predictable execution out of the processor is very difficult. That means "side-channel" attacks on x86 leaking software crypto secrets may always be with us.

One solution to these problems is the CMOV, "conditional move", instruction. It's like a normal "MOV" instruction, but succeeds or fails based on condition flags. It can be used in some cases to replace branches, which makes pipelined code more efficient in some cases. Currently, it takes constant time. When moving from memory, it still waits for data to arrive, even when it knows it's going to throw it away. As Linus Torvalds famously pointed out, CMOV doesn't always speed up code. However, that's not the point here -- it does make code execution time more predictable. But, at the same time, Intel can arbitrarily change the behavior on future processors, making it less predictable.

The upshot is this: Intel's x86 is a high-level language. Coding everything up according to Agner Fog's instruction timings still won't produce the predictable, constant-time code you are looking for. There may be some solutions, like using CMOV, but it will take research.

14 comments:

Inside the CPU, the results always appear as if the processor executed everything in-order, but outside the CPU, things happen in strange order. - shouldn't inside and outside be swapped in this sentence?

@richie: "inside the CPU results appear..." means "from the machine code point of view the model is sequential execution" whereas "outside the CPU, things happen in strange order" means "if you look at the accesses to the main memory for instance, then the order will not match what you could expect after having seen the machine code".

not to mention numbers will always be pseudo-random, so the very act of generating the numbers might be revealing of system time (for the seed). not sure if it would be possible to glean, but perhaps with enough samples.

I have a proposition crypto people. Lets all auto-negotiate DH AES on the fly. By chunk. Lets all agree on some number. Say: 4096. And use that as a "chunk" size. We've all got psuedo-random number generators, right? Our banks run IvyBridge or later Gen cores and can generate actual crypto-grade random numbers, right? Lets all agree to disagree and use a top layer of 4096 RSA to authenticate and 4096 individually DH'd AES chunks to talk to the bank. Problem solved?

Seems like one defense against timing attacks would be to schedule your reply. Regardless of how long the string compare took, you'll respond with the result x msec from now. That seems like it'd work for a remote service at least, as long as you leave enough time to actually perform the compare.

Er, s/took/will take/. You schedule the reply before doing the compare. You might schedule it for (x * #chars) msec from now, so that you don't run out of time for longer compares, if that's a concern (and leaking the length of the input string isn't).

So do what you'd do in a high-level language - use that language's crypto library. No?

Processors already contain "hardware" (often a lesser or greater amount of µcode) AES implementations and RNGs. The requirements of crypto are quite different from those of normal code. Processors need to explicitly support it.