Wednesday, January 31, 2018

And everything under the sun is in tune
But the sun is eclipsed
By the moon

(Taken with a Canon PowerShot SX1 IS between 3:45 and 5am Pacific time from Orbiting Floodgap HQ in Southern California, processed on my trusty Quad G5. The timelapse came out wonderfully with a bit of manual correction.)

Sunday, January 28, 2018

I'm typing this in an early build of TenFourFox Feature Parity Release 6. This version contains speculative fixes for hangs on Facebook and crashes with textboxes on some systems, plus tuned-up UI, accelerated video frame colour conversion and -- the biggest feature -- basic adblock integrated directly into the browser core. The basic adblock is effective enough that I've even started running "naked" without Bluhell Firewall and it's so much quicker that there's no add-on overhead. I have some more features planned with a beta somewhere around mid-February. Watch for it.

I bought the Talos II because I wanted something non-x86 without lurking proprietary obscenities like the Intel Management Engine (or even AMD's Platform Secure Processor) that was nevertheless powerful enough to match those chips in power, and the only thing practical and even close to it is modern Power ISA. It had to be beefier than the Quad G5 I'm typing this on, which is why beautiful but technically underwhelming systems like the AmigaOne X5000 were never an option because this 11-year-old Quad mops the floor with it (no AltiVec, wtf!). It had to be practical, i.e., in a desktop form factor with a power draw that wouldn't require a second electrical meter, and it had to actually exist. Hello, Talos. It was pretty clear even before I decided on the specific machine that my next desktop computer wasn't going to be a Mac; I briefly toyed with gritting my teeth and waiting around for whatever the next iteration of the Mac Pro would be, but eventually concluded pro users just weren't a priority demographic to Apple's hardware designers anymore. After all, if we were, why would they make us wait so long? And why should I wait and pay buck$$$ for another iteration of an architecture I don't like anyway?

That's the absolute last straw for me. Sure, as someone who actually uses them on my internal network, I could reinstall them or anything else Apple starts decommissioning in Homebrew (right up until Apple takes some other component away that can't be easily restored in that fashion, or decides to lock down the filesystem further). Sure, hopefully if I upgrade (I use this term advisedly) my Haswell i7 MacBook Air from Sierra to High Sierra, I might not have too many bugs, and Apple might even fix what's left, maybe, or maybe in 10.14, maybe. Sure, I could vainly search for 64-bit versions of the tools I use, some of which might not exist or be maintained, and spend a lot of time trying to upgrade the ones I've written which work just fine now (breaking the unified build I do on my G5 and being generally inconvenient), and could click through the warning you'll now get in 10.13.4 whenever you open a 32-bit app and leap whatever hoops I have to jump through on 10.14 to run them "without compromise."

Sure, I could do all that. And I could continue to pay a fscking lot of money for the privilege of doing all that, too. Or, for the first time since 1987, in over thirty years of using Macs starting with my best friend's dad's Macintosh Plus, I could say I'm just totally done with modern Macs. And I think that's what I'll be doing.

Because the bottom line is this: Apple doesn't want users anymore who just want things to keep working. Hell, on this Quad in 10.4, I can run most software for 68K Macs! (in fact, I do -- some of those old tools are very speedy). But Classic ended with the Intel Macs, and Rosetta crapped out after 10.6. Since then every OS release has broken a little here, and deprecated a little there, and deleted a little somewhere else, to where every year when WWDC came along and Apple announced what they were screwing around with next that I dreaded the inevitable OS upgrade on a relatively middling laptop I dropped $1800 on in 2014. What was it going to break? What new problems were lurking? What would be missing that I actually used? There was no time to adapt because soon it was onto next year's new mousetrap and its own set of new problems. So now, with the clusterflub that Because I Got High Sierra's turned out to be, I've simply had enough. I'm just done.

So come on, you Apple apologists. Tell me how Apple doesn't owe me anything. Tell me how every previous version of OS X had its bugs, and annual major OS churn actually makes good sense. Tell me how it's unfair that poor, cash-starved Apple should continue to subsidize people who want to run perfectly good old software or maintain their investment in peripherals. Tell me how Apple's doing their users a favour by getting rid of those crufty niche tools that "nobody" uses anyway, and how I can just deal. If this is what you want from your computer vendor, then good for you because by golly you're getting it, good and hard. For me, this MacBook's staying on Sierra and I'll wipe it with Linux or FreeBSD when Sierra doesn't get any more updates. Maybe there will be a nice ARMbook around by then because I definitely won't be buying another Mac.

Saturday, January 20, 2018

TenFourFox Feature Parity Release 5 final is available for testing (downloads, hashes, release notes). There are no other changes other than the relevant security updates and the timer resolution reduction for anti-Spectre hardening. Assuming no major issues, it will become live on Monday evening Pacific time.

For FPR6, there will be some bug fixes and optimizations, and I'm also looking at the feasibility of basic CSS Grid support, requestIdleCallback() (using our Mach factor code to improve system utilization), date-time pickers and some JavaScript speedups. Also, I'd like to welcome fresh meat new contributors Ken and Raphael who have submitted fixes for compiling TenFourFox with gcc 4.8.5 and are working with me on support for gcc 6.4.0, where we currently have a startup crash if optimization is enabled. If you're interested in helping us investigate, see issue 464.

Recall from our most recent foray into the Spectre attack that I believed the G3 and 7400 would be hard to successfully exploit because of their unusual limitations on speculative execution through indirect branches. Also, remember that this PoC assumes the most favourable conditions possible: that it already knows exactly what memory range it's looking for, that the memory range it's looking for is in the same process and there is no other privilege or partition protection, that it can run and access system registers at full speed (i.e., is native), and that we're going to let it run to completion.

miniupnp's implementation uses the mftb(u) instructions, so if you're porting this to the 601, you weirdo, you'll need to use the equivalent on that architecture. I used Xcode 2.5 and gcc 4.0.1.

Let's start with, shall we say, a positive control. I felt strongly the G5 would be vulnerable, so here's what I got on my Quad G5 (DC/DP 2.5GHz PowerPC 970MP) under 10.4.11 with Energy Saver set to Reduced Performance:

Twiddling CACHE_HIT_THRESHOLD to any value other than 1 caused the test to fail completely, even on the working scenarios.

These results are frankly all over the map and only two scenarios fully work, but they do demonstrate that the G5 can be exploited by Spectre. That said, however, the interesting thing is how timing-dependent the G5 is, not only to whether the algorithm succeeds but also to whether the algorithm believes it succeeded. The optimized G5 versions have more trouble recognizing if they worked even though they do; the fastest and most accurate is actually -arch ppc970 -O0. I mentioned the CPU speed for a reason, too, because if I set the system to Highest Performance, I get some noteworthy changes:

The speed increase causes one more scenario to succeed, but which ones do differ and it even more badly tanks some of the previously marginal ones. Again, twiddling CACHE_HIT_THRESHOLD to any value other than 1 caused the test to fail completely, even on the working scenarios.

What about more recent Power ISA designs? Interestingly, my AIX Power 520 server configured as an SMT-2 two-core four-way POWER6 could not be exploited if CACHE_HIT_THRESHOLD was 1. If it was set to 80 as the default exploit has, however, on POWER6 the exploit recovers all bytes successfully (compiled with -O3 -mcpu=power6). IBM has not yet said as of this writing whether they will issue patches for the POWER6.

I should also note that the worst case on the G5 took nearly seven seconds to complete at reduced power (-arch ppc7400 -O0), though the best case took less than a tenth of a second (-arch ppc970 -O0). The POWER6 took roughly three seconds. These are not fast attacks for the limited number of bytes scanned.

Given that we know the test will work on a vulnerable PowerPC system, what about the ones we theorized were resistant? Why, I have two of them right here! Let's cut to the chase, friends, your humble author's suspicions appear to be correct. Neither my strawberry iMac G3 with Sonnet HARMONi CPU upgrade (600MHz PowerPC 750CX) running 10.2.8, nor my Sawtooth G4 file server (450MHz PowerPC 7400) running 10.4.11 can be exploited with any of ppc, ppc750 or ppc7400 at any optimization level. They all fail to recover any byte despite the exploit believing it worked, so I conclude the G3 and 7400 are not vulnerable to the proof of concept.

The attacks are also quite slow on these systems. To run on the lower clock speed Sawtooth took almost 5 seconds in realtime, even at -arch ppc7400 -O3 (seven seconds in the worst case), and pegged the processor during the test. Neither system has power management and ran at full speed.

That leaves the 7450 G4e, which as you'll recall has notable microarchitectural advances from the 7400 G4 and differences in its ability to speculatively execute indirect branches. What about that? Again, some highly timing-dependent results. First, let's look at my beloved 1GHz iMac G4 (1GHz PowerPC 7450), running 10.4.11:

This is also all over the place, but quite clearly demonstrates the 7450 is vulnerable and actually succeeds more easily than the 970MP did. (This iMac G4 does not have power management.) Still, maybe we can figure out under which circumstances it is, so what about laptops? Let's get out my faithful 12" 1.33GHz iBook G4 (PowerPC 7447A), running 10.4.11 also. First, on reduced performance:

This almost completely succeeds! Even the scenarios that are wrong are still mostly correct; these varied a bit from run to run and some would succeed now and then too. The worst case timing is an alarming eighth of a second.

What gets weird is the DLSD PowerBook G4, though. Let's get out the last and mightiest of the PowerBooks with its luxurious keyboard, bright 17" high-resolution LCD and 1.67GHz PowerPC 7447B CPU running 10.5.8. The DLSD PowerBooks are notable for not allowing selectable power management ("Normal" or automatic equivalent only), and it turns out this is relevant here too:

This is an upgraded stepping of the same basic CPU, but the attack almost completely failed. It failed in an unusual way, though: instead of using the question mark placeholder it usually uses for an indeterminate value, it actually puts in some apparently recovered nonsense bytes. These bytes are almost always garbage, though one did sneak in in the right place, which leads me to speculate that the 7447B is vulnerable too but something is mitigating it.

This DLSD is different from my other systems in two ways: it's got a slightly different CPU with known different power management, and it's running Leopard. Setting the iBook G4 to use automatic ("Normal") power management made little difference, however, so I got down two 12" PowerBook G4s with one running 10.4 with a 1.33GHz CPU and the other 10.5.8 with a 1.5GHz CPU. The 10.4 12" PowerBook G4 was almost identical to the 10.4 12" in terms of vulnerability, but it got interesting in on the 10.5.8 system. In order, low, automatic and highest performance:

Leopard clearly impairs Spectre's success, but the DLSDs do seem to differ further internally. The worst case runtime on the 10.5 1.5GHz 12" was around 0.25 seconds. The real test would be to put Tiger on a DLSD, but I wasn't willing to do so with this one since it's my Leopard test system.

Enough data. Let's irresponsibly make rash conclusions.

The G3 and 7400 G4 systems appear, at minimum, to be resistant to Spectre as predicted. I hesitate to say they're immune but there's certainly enough evidence here to suggest it. While there may be a variant around that could get them to leak, even if it existed it wouldn't do so very quickly based on this analysis.

The 7450 G4e is more vulnerable to Spectre than the G5 and can be exploited faster, except for the DLSDs which (at least in Leopard) seem to be unusually resistant.

Power management makes a difference, but not enough to completely retard the exploit (again, except the DLSDs), and not always in a predictable fashion.

At least for these systems, cache size didn't seem to have any real correlation.

Spectre succeeds more reliably in Tiger than in Leopard.

Later Power ISA chips are vulnerable with a lot less fiddling.

Before you panic, though, also remember:

These were local programs run at full speed in a test environment with no limits, and furthermore the program knew exactly what it was looking for and where. A random attack would probably not have this many advantages in advance.

Because the timing is so variable, a reliable attack would require running several performance profiles and comparing them, dramatically slowing down the effective exfiltration speed.

This wouldn't be a very useful Trojan horse because sketchy programs can own your system in ways a lot more useful (to them) than iffy memory reads that are not always predictably correct. So don't run sketchy programs!

No 7450 G4 is fast enough to be exploited effectively through TenFourFox's JavaScript JIT, which would be the other major vector. Plus, no 7450 can speculatively execute through TenFourFox's inline caches anyway because they use CTR for indirect branching (see the analysis), so the generated code already has an effective internal barrier.

Arguably the Quad G5 might get into the speed range needed for a JavaScript exploit, but it would be immediately noticeable (as in, jet engine time), not likely to yield much data quickly, and wouldn't be able to do so accurately. After FPR5 final, even that possibility will be greatly lessened as to make it just about useless.

I need to eat dinner. And a life. If you've tested your own system (Tobias reports success on a 970FX), say so in the comments.

Friday, January 5, 2018

UPDATE: IBM is releasing firmware patches for at least the POWER7+ and forward, including the POWER9 expected to be used in the Talos II. My belief is that these patches disable speculative execution through indirect branches, making the attack much more difficult though with an unclear performance cost. See below for why this matters.

Most of the reports on the Spectre speculative execution exploit have concentrated on the two dominant architectures, x86 (in both its AMD and Meltdown-afflicted Intel forms) and ARM. In our last blog entry I said that PowerPC is vulnerable to the Spectre attack, and in broad strokes it is. However, I also still think that the attack is generally impractical on Power Macs due to the time needed to meaningfully exfiltrate information on machines that are now over a decade old, especially with JavaScript-based attacks even with the TenFourFox PowerPC JIT (to say nothing of various complicating microarchitectural details). But let's say that those practical issues are irrelevant or handwaved away. Is PowerPC unusually vulnerable, or on the flip side unusually resistant, to Spectre-based attacks compared to x86 or ARM?

For the purposes of this discussion and the majority of our audience, I will limit this initial foray to processors used in Power Macintoshes of recent vintage, i.e., the G3, G4 and G5, though the G5's POWER4-derived design also has a fair bit in common with later Power ISA CPUs like the Talos II's POWER9, and ramifications for future Power ISA CPUs can be implied from it. I'm also not going to discuss embedded PowerPC CPUs here such as the PowerPC 4xx since I know rather less about their internal implementational details.

First, let's review the Spectre white paper. Speculative execution, as the name implies, allows the CPU to speculate on the results of an upcoming conditional branch instruction that has not yet completed. It predicts future program flow will go a particular way and executes that code upon that assumption; if it guesses right, and most CPUs do most of the time, it has already done the work and time is saved. If it guesses wrong, then the outcome is no worse than idling during that time save the additional power usage and the need to restore the previous state. To do this execution requires that code be loaded into the processor cache to be run, however, and the cache is not restored to its previous state; previously no one thought that would be necessary. The Spectre attack proves that this seemingly benign oversight is in fact not so.

To determine the PowerPC's vulnerability requires looking at how it does branch prediction and indirect branching. Indirect branching, where the target is determined at time of execution and run from a register rather than coding it directly in the branch instruction, is particularly valuable for forcing the processor to speculatively execute code it wouldn't ordinarily run because there are more than two possible execution paths (often many, many more, and some directly controllable by the attacker).

The G3 and G4 have very similar branch prediction hardware. If there is no hinting information and the instruction has never been executed before (or is no longer in the branch history table, read on), the CPU assumes that forward branches are not taken and backwards branches are, since the latter are usually found in loops. The programmer can add a flag to the branch instruction to tell the CPU that this initial assumption is probably incorrect (a static hint); we use this in a few places in TenFourFox explicitly, and compilers can also set hints like this. All PowerPC CPUs, including the original 601 and the G5 as described below, offer this level of branch prediction at minimum. Additionally, in the G3 and G4, branches that have been executed then get an entry in the BHT, or branch history table, which over multiple executions records if the branch is not taken, probably not taken, probably taken or taken (in Dan Luu's taxonomy of branch predictors, this would be two-level adaptive, local). On top of this the G3 and G4 have a BTIC, or branch target instruction cache, which handles the situation of where the branch gets taken: if the branch is not taken, the following instructions are probably in the regular instruction cache, but if the branch is taken, the BTIC allows execution to continue while the instruction queue continues fetching from the new program counter location. The G3 and 7400-series G4 implement a 512-entry BHT and 64-entry, two-instruction BTIC; the 7450-series G4 implements a 2048-entry BHT and a 128-entry, four-instruction BTIC, though the actual number of instructions in the BTIC depends on where the fetch is relative to the cache block boundary. The G3 and 7400 G4 support speculatively executing through up to two unresolved branches; the 7450 G4e allows up to three, but also pays a penalty of about one cycle if the BTIC is used that the others do not.

The G5 (and the POWER4, and most succeeding POWER implementations) starts with the baseline above, though it uses a different two-bit encoding to statically hint branch instructions. Instead of the G3/G4 BHT scheme, however, the G5/970 uses what Luu calls a "hybrid" approach, necessary to substantially improve prediction performance in a CPU for which misprediction would be particularly harmful: a staggering 16,384-entry BHT but also an additional 16,384-entry table using an indexing scheme called gshare, and a selector table which tells the processor which table to use; later POWER designs refine this further. The G5 does not implement a BTIC probably because it would not be compatible with how dispatch groups work. The G5 can predict up to two branches per cycle, and have up to 16 unresolved branches.

The branch prediction capabilities of these PowerPC chips are not massively different from other architectures'. The G5's ability to keep a massive number of unresolved branch instructions in flight might make it actually seem a bit more subject to such an attack since there are many more opportunities to load victim process data into the cache, but the basic principles at work are much the same as everything else, so none of our chips are particularly vulnerable or resistant in that respect. Where it starts to get interesting, however, is when we talk about indirect branches. There is no way in the Power ISA to directly branch to an address in a register, an unusual absence as such instructions exist in most other contemporary architectures such as x86, ARM, MIPS and SPARC. Instead, software must load the instruction into either of two special purpose registers that allow branches (either the link register "LR" or the counter register "CTR") with a special instruction (mtctr and mtlr, both forms of the general SPR instruction mtspr) and branch to that, which can occur conditionally or unconditionally. (We looked at this in great detail, with microbenchmarks, in an earlier blog post.)

To be able to speculatively execute an indirect branch, even an unconditional one, requires that either LR or CTR be renamed so that its register state can be saved as well, but on PowerPC they are not general purpose registers that can use the regular register rename file like other platforms such as ARM. The G5, unfortunately in this case, has additional hardware to deal with this problem: to back up the 16 unresolved branches it can have in-flight, LR and CTR share a 16-entry rename mapper, which allows the G5 to speculatively execute a combination of up to 16 LR or CTR-referencing branches (i.e., b(c)lr and b(c)ctr). This could allow a lot of code to be run speculatively and change the cache in ways the attacker could observe. Substantial preparation would be required to get the G5's branch history fouled enough to make it mispredict due to its very high accuracy (over 96%), but if it does, the presence of indirect branches will not slow the processor's speculative execution down what is now the wrong path. This is at least as vulnerable as the known Spectre-afflicted architectures, though the big cost of misprediction on the G5 would make this type of blown speculation especially slow. Nevertheless, virtually all current POWER chips would fall in this hole as well.

But the G3 and G4 situation is very different. The G3 actually delays fetch and execution at a b(c)ctr until the mtctr that leads it has completed, meaning speculative execution essentially halts at any indirect branch. The same applies for the LR, and for the 7400. CTR-based indirect branching is very common in TenFourFox-generated code for JavaScript inline caches, and code such as mtlr r0:blr terminates nearly every PowerPC function call. No fetch, and therefore no speculative execution, will occur until the special purpose register is loaded, meaning the proper target must now be known and there is less opportunity for a Spectre-based attack to run. Even if the processor could continue speculation past that point, the G3 and 7400 implement only a single rename register each for LR and CTR, so they couldn't go past a second such sequence regardless.

The 7450 is a little less robust in this regard. If the instruction sequence is an unconditional mtlr blr, the 7450 (and, for that matter, the G5) implements a link stack where the expected return address comes off a stack of predicted addresses from prior LR-modifying instructions. This is enough of a hint on the 7450 G4e to possibly allow continued fetch and potential speculation. However, because the 7450 also has only a single rename register each for LR and CTR, it also cannot speculatively execute past a second such sequence. If the instruction sequence is mtlr bclr, i.e., there is a condition on the LR branch, then execution and therefore speculation must halt until either the mtlr completes or the condition information (CR or CTR) is available to the CPU. But if the special purpose register is the CTR, then there is no address cache stack available, and the G4e must delay at an mtctr b(c)ctr sequence just like its older siblings.

Bottom line? Spectre is still not a very feasible means of attack on Power Macs, as I have stated, though the possibilities are better on the G5 and later Power ISA designs which are faster and have more branch tricks that can be subverted. But the G3 and the G4, because of their limitations on indirect branching, are at least somewhat more resistant to Spectre-based attacks because it is harder to cause their speculative execution pathways to operate in an attacker-controllable fashion (particularly the G3 and the 7400, which do not have a link stack cache). So, if you're really paranoid, dust that old G3 or Sawtooth G4 off. You just might have the Yosemite that manages to survive the computing apocalypse.

Thursday, January 4, 2018

UPDATE: Yes, TenFourFox will implement relevant Spectre-hardening features being deployed to Firefox, and the changes to performance.now will be part of FPR5 final. We also don't support SharedArrayBuffer anyway and right now are not likely to implement it any time soon.

UPDATE the 2nd: This post is getting a bit of attention and was really only intended as a quick skim, so if you're curious whether all PowerPC chips are vulnerable in the same fashion and why, read on for a deeper dive.

If you've been under a rock the last couple days, then you should read about Meltdown and Spectre (especially if you are using an Intel CPU).

Meltdown is specific to x86 processors made by Intel; it does not appear to affect AMD. But virtually every CPU going back decades that has a feature called speculative execution is vulnerable to a variety of the Spectre attack. In short, for those processors that execute "future" code downstream in anticipation of what the results of certain branching operations will be, Spectre exploits the timing differences that occur when certain kinds of speculatively executed code changes what's in the processor cache. The attacker may not be able to read the memory directly, but (s)he can find out if it's in the cache by looking at those differences (in broad strokes, stuff in the cache is accessed more quickly), and/or exploit those timing changes as a way of signaling the attacking software with the actual data itself. Although only certain kinds of code can be vulnerable to this technique, an attacker could trick the processor into mistakenly speculatively executing code it wouldn't ordinarily run. These side effects are intrinsic to the processor's internal implementation of this feature, though it is made easier if you have the source code of the victim process, which is increasingly common.

Power ISA is fundamentally vulnerable going back even to the days of the original PowerPC 601, as is virtually all current architectures, and there are no simple fixes. So what's the practical impact to Power Macs? Well, not much. As far as directly executing an attacking application, there are a billion more effective ways to write a Trojan horse than this, and they would have to be PowerPC-specific (possibly even CPU-family specific due to microarchitectural changes) to be functional. It's certainly possible to devise JavaScript that could attack the cache in a similar fashion, especially since TenFourFox implements a PowerPC JIT, but such an attack would -- surprise! -- almost certainly have to be PowerPC-specific too, and the TenFourFox JIT doesn't easily give up the instruction sequences necessary. Either way, even if the attacker knew exactly the memory they wanted to read and went to its address immediately, the attack would be rather slow on a Power Mac and you'd definitely notice the CPU usage whether it succeeded or not.

There are ways to stop speculative execution using certain instructions the processor must serialize, but this can seriously harm performance: speculative execution, after all, is a way to keep the processor busy with (hopefully) useful work while it waits for previous instructions to complete. On PowerPC, cache manipulation instructions, some kinds of special-purpose register accesses and even instructions like b . (branch to the next instruction, essentially a no-op) can halt speculative execution with a sometimes notable time penalty. I think there may be some ways we can harden the TenFourFox JIT with these instructions used selectively to reduce their overhead, though as I say, I don't find such attacks very practical on our geriatric machines in general.

Anyway, you can sleep well, because everybody's all in the same boat. Perhaps it's time to dust off those old strict CPUs. The world needs a port of Classilla to the Commodore 64. :)