Thursday, July 26, 2018

NetSpectre: not much of a PowerPC threat either

In the continuing death march of Spectre side-channel variants for stealing data, all of the known attacks thus far have relied upon code running locally on the computer (so don't run sketchy programs, which have much better ways of pwning your Power Mac than slow and only occasionally successful data leaks). As you'll recall, it is possible for Spectre to succeed on the G5 and 7450 G4e, but not on the G3 and 7400.

The next generation is making Spectre go remote, and while long hypothesized it was never demonstrated until the newest, uh, "advance" called NetSpectre (PDF). The current iteration comes in two forms.

The first and more conventional version is like Spectre in that it relies on CPU cache timing. A victim application would have to have something called a "leak gadget," similar to the one in Spectre where network-facing code processes some network packet with a condition that's usually true and sets a flag based on a data bit of interest in memory. The processor, after enough training by the attacker, then is induced to mispredict, which means the flag is now in the cache even though it never observably changed. This could be done as with the example in the paper, where an attacker sends packets with multiple normal bitstream lengths, training the predictor, and then suddenly sends one with an abnormal or out-of-bound one. The flag isn't actually set, but the misprediction caused it to be loaded into the CPU cache. Later on, the application executes a "transmit gadget" that uses that flag to do a network-observable operation. The flag is in the cache, so the transmit gadget runs just a little bit faster, and the attacker can infer that data bit.

This sounds very slow and error-prone, and it is. In fact, it would be even worse on our slower systems: besides the fact that it presupposes the machine is vulnerable to Spectre in the first place (G3 and 7400 systems don't seem to be), we would generate packets much slower than a modern system, meaning the attacker would have to wait even longer to differentiate a response and the difference between the flag being and not being in the cache is likely to be drowned out by the other code that needs to execute to generate a network response. Looking at the histogram for the ARM core they tested, which is more comparable to the PowerPC than an Intel CPU, there is substantial overlap between the '1' and '0'; if network latency intervenes, it could take literally millions of measurements to extract even a single bit. And that's assuming the attacker knows enough about the innards of your network-facing application (like TenFourFox, or what have you) to even know the memory location they're looking for. Even with that sizeable advantage, even when attacking a far faster computer over a local network, it took 30 minutes for the researchers to exfiltrate just a single byte of data. Under the most optimal conditions for such an attack, a Quad G5 would probably require several times longer; a 7450 would take longer still.

The researchers, however, recognized this and looked for other kinds of network-observable side channels that could be faster to work with than the CPU cache. The vast majority of modern CPUs these days have some sort of SIMD instruction set for working on big chunks of data at once. We have the 128-bit AltiVec (VMX) in Power Mac land on G4s and G5s, for example, and later Power ISA chips like the POWER9 have an extension called VSX; Intel for its part historically offered MMX and the SSE series of instructions all the way up to things like AVX2. AltiVec and VSX are pretty well-designed and reasonably power-efficient extensions but only work on 128 bits of data at once, whereas AVX2 was extended to 256 (AVX-512 even supports 512-bit registers). Intel's larger SIMD implementations require more power to run and the processor actually turns off the circuitry operating on the upper 128 bits of its AVX vector registers when they aren't needed. With that crucial bit of knowledge you can probably write the end of this paragraph already, but turning on the upper 128 bits is not instantaneous and can incur a noticeable penalty on execution if the upper bits aren't already activated. If you can get the processor to speculatively execute an AVX2 instruction operating on the upper bits based on the data bit of interest, you can then infer from how quickly that instruction executed what the data bit was, the execution time itself inferred from a later network-visible operation that also uses the AVX2 upper unit. The AVX2 upper unit cycles on and off with roughly a 1ms latency, an eternity in computing, but it requires very few network measurements to distinguish bits and reduces the time to exfiltrate a byte to around 8 minutes in the paper.

No PowerPC chip used in any Power Mac behaves in this fashion, even with AltiVec instructions. The G3 doesn't have AltiVec (duh), and the AltiVec units in the 7400/G5 (they use similar designs) and the 7450 are always active. AltiVec instructions weren't implemented on "big POWER" until the POWER6, and even for the POWER6 through POWER9, I can't find anything in IBM's technical documentation that says any chip-internal functional unit, whether FPU, LSU, vector unit or otherwise, is dynamically powered down when not in use.