Thursday, October 20, 2016

Ars Technica is reporting an interesting attack that uses a side-channel exploit in the Intel Haswell branch translation buffer, or BTB (kindly ignore all the political crap Ars has been posting lately; I'll probably not read any more articles of theirs until after the election). The idea is to break through ASLR, or address space layout randomization, to find pieces of code one can string together or directly attack for nefarious purposes. ASLR defeats a certain class of attacks that rely on the exact address of code in memory. With ASLR, an attacker can no longer count on code being in a constant location.

Intel processors since at least the Pentium use a relatively simple BTB to aid these computations when finding the target of a branch instruction. The buffer is essentially a dictionary with virtual addresses of recent branch instructions mapping to their predicted target: if the branch is taken, the chip has the new actual address right away, and time is saved. To save space and complexity, most processors that implement a BTB only do so for part of the address (or they hash the address), which reduces the overhead of maintaining the BTB but also means some addresses will map to the same index into the BTB and cause a collision. If the addresses collide, the processor will recover, but it will take more cycles to do so. This is the key to the side-channel attack.

(For the record, the G3 and the G4 use a BTIC instead, or a branch target instruction cache, where the table actually keeps two of the target instructions so it can be executing them while the rest of the branch target loads. The G4/7450 ("G4e") extends the BTIC to four instructions. This scheme is highly beneficial because these cached instructions essentially extend the processor's general purpose caches with needed instructions that are less likely to be evicted, but is more complex to manage. It is probably for this reason the BTIC was dropped in the G5 since the idea doesn't work well with the G5's instruction dispatch groups; the G5 uses a three-level hybrid predictor which is unlike either of these schemes. Most PowerPC implementations also have a return address stack for optimizing the blr instruction. With all of these unusual features Power ISA processors may be vulnerable to a similar timing attack but certainly not in the same way and probably not as predictably, especially on the G5 and later designs.)

To get around ASLR, an attacker needs to find out where the code block of interest actually got moved to in memory. Certain attributes make kernel ASLR (KASLR) an easier nut to crack. For performance reasons usually only part of the kernel address is randomized, in open-source operating systems this randomization scheme is often known, and the kernel is always loaded fully into physical memory and doesn't get swapped out. While the location it is loaded to is also randomized, the kernel is mapped into the address space of all processes, so if you can find its address in any process you've also found it in every process. Haswell makes this even easier because all of the bits the Linux kernel randomizes are covered by the low 30 bits of the virtual address Haswell uses in the BTB index, which covers the entire kernel address range and means any kernel branch address can be determined exactly. The attacker finds branch instructions in the kernel code such as by disassembling it that service a particular system call and computes (this is feasible due to the smaller search space) all the possible locations that branch could be at, creates a "spy" function with a branch instruction positioned to try to force a BTB collision by computing to the same BTB index, executes the system call, and then executes the spy function. If the spy process (which times itself) determines its branch took longer than an average branch, it logs a hit, and the delta between ordinary execution and a BTB collision is unambiguously high (see Figure 7 in the paper). Now that you have the address of that code block branch, you can deduce the address of the entire kernel code block (because it's generally in the same page of memory due to the typical granularity of the randomization scheme), and try to get at it or abuse it. The entire process can take just milliseconds on a current CPU.

The kernel is often specifically hardened against such attacks, however, and there are more tempting targets though they need more work. If you want to attack a user process (particularly one running as root, since that will have privileges you can subvert), you have to get your "spy" on the same virtual core as the victim process or otherwise they won't share a BTB -- in the case of the kernel, the system call always executes on the same virtual core via context switch, but that's not the case here. This requires manipulating the OS' process scheduler or running lots of spy processes, which slows the attack but is still feasible. Also, since you won't have a kernel system call to execute, you have to get the victim to do a particular task with a branch instruction, and that task needs to be something repeatable. Once this is done, however, the basic notion is the same. Even though only a limited number of ASLR bits can be recovered this way (remember that in Haswell's case, bit 30 and above are not used in the BTB, and full Linux ASLR uses bits 12 to 40, unlike the kernel), you can dramatically narrow the search space to the point where brute-force guessing may be possible. The whole process is certainly much more streamlined than earlier ASLR attacks which relied on fragile things like cache timing.

As it happens, software mitigations can blunt or possibly even completely eradicate this exploit. Brute-force guessing addresses in the kernel usually leads to a crash, so anything that forces the attacker to guess the address of a victim routine in the kernel will likely cause the exploit to fail catastrophically. Get a couple of those random address bits outside the 30 bits Haswell uses in the BTB table index and bingo, a relatively simple fix. One could also make ASLR more granular to occur at the function, basic block or even single instruction level rather than merely randomizing the starting address of segments within the address space, though this is much more complicated. However, hardware is needed to close the gap completely. A proper hardware solution would be to either use most or all of the virtual address in the BTB to reduce the possibility of a collision, and/or to add a random salt to whatever indexing or hashing function is used for BTB entries that varies from process to process so a collision becomes less predictable. Either needs a change from Intel.

This little fable should serve to remind us that monocultures are bad. This exploit in question is viable and potentially ugly but can be mitigated. That's not the point: the point is that the attack, particularly upon the kernel, is made more feasible by particular details of how Haswell chips handle branching. When everything gets funneled through the same design and engineering optics and ends up with the same implementation, if someone comes up with a simple, weapons-grade exploit for a flaw in that implementation that software can't mask, we're all hosed. This is another reason why we need an auditable, powerful alternative to x86/x86_64 on the desktop. And there's only one system in that class right now.

Okay, okay, I'll stop banging you over the head with this stuff. I've got a couple more bugs under investigation that will be fixed in 45.5.0, and if you're having the issue where TenFourFox is not remembering your search engine of choice, please post your country and operating system here.

Let's not mince words, however: it's also not cheap, and you're gonna plunk down a lot if you want this machine. The board runs $4100 and that's without the CPU, which is pledged for separately though you can group them in the same order (this is a little clunky and I don't know why Raptor did it this way). To be sure, I think we all suspected this would be the case but now it's clear the initial prices were underestimates. Although some car repairs and other things have diminished my budget (I was originally going to get two of these), I still ponied up for a board and for one of the 190W octocore POWER8 CPUs, since this appears to be the sweetspot for those of us planning to use it as a workstation (remember each core has eight threads via SMT for a grand total of 64, and this part has the fastest turbo clock speed at 3.857GHz). That ran me $5340. I think after the RAM, disks, video card, chassis and PSU I'll probably be all in for around $7000.

Too steep? I don't blame you, but you can still help by donating to the project and enable those of us who can afford to jump in first to smoothe the way out for you. Frankly, this is the first machine I consider a meaningful successor to the Quad G5 (the AmigaOne series isn't quite there yet). Non-x86 doesn't have the economies of scale of your typical soulless Chipzilla craptop or beige box, but if we can collectively help Raptor get this project off the ground you'll finally have an option for your next big machine when you need something free, open and unchained -- and there's a lot of chains in modern PCs that you don't control. You can donate as little as $10 and get this party started, or donate $250 and get to play with one remotely for a few months. Call it a rental if you like. No, I don't get a piece of this, I don't have stock in Raptor and I don't owe them a favour. I simply want this project to succeed. And if you're reading this blog, odds are you want that too.

The campaign ends December 15. Donate, buy, whatever. Let's do this.

My plans are, even though I confess I'll be running it little-endian (since unfortunately I don't think we have much choice nowadays), to make it as much a true successor to the last Power Mac as possible. Yes, I'll be sinking time into a JIT for it, which should fully support asm.js to truly run those monster applications we're seeing more and more of, porting over our AltiVec code with an endian shift (since the POWER8 has VMX), and working on a viable and fast way of running legacy Power Mac software on it, either through KVM or QEMU or whatever turns out to be the best option. If this baby gets off the ground, you have my promise that doing so will be my first priority, because this is what I wanted the project for in the first place. We have a chance to resurrect the Power Mac, folks, and in a form that truly kicks ass. Don't waste the opportunity.

Now, having said all that, I do think Raptor has made a couple tactical errors. Neither are fatal, but neither are small.

First, there needs to be an intermediate pledge level between the bare board and the $18,000 (!!!!) Warren Buffett edition. I have no doubt the $18,000 machine will be the Cadillac of this line, but like Cadillacs, there isn't $18,000 worth of parts in it (maybe, maybe, $10K), and this project already has a bad case of sticker shock without slapping people around with that particular dead fish. Raptor needs to slot something in the middle that isn't quite as wtf-inducing and I'll bet they'll be appealing to those people willing to spend a little more to get a fully configured box. (I might have been one of those people, but I won't have the chance now.)

Second, the pledge threshold of $3.7 million is not ludicrous when you consider what has to happen to manufacture these things, but it sure seems that way. Given that this can only be considered a boutique system at this stage, it's going to take a lot of punters like yours truly to cross that point, which is why your donations even if you're not willing to buy right now are critical to get this thing jumpstarted. I don't know Raptor's finances, but they gave themselves a rather high hurdle here and I hope it doesn't doom the whole damn thing.

On the other hand, doesn't look like Apple's going to be updating the Mac Pro any time soon, so if you're in the market ...

On to 45.5.0 beta 2 (downloads, hashes). The two major changes in this version is that I did some marginal reduction in the overhead of graphics primitives calls, and completed converting to AltiVec all of the VP9 inverse discrete cosine and Hadamard transforms. Feel free to read all 152K of it, patterned largely off the SSE2 version but still mostly written by hand; I also fixed the convolver on G4 systems and made it faster too. This is probably the biggest amount of time required by the computer while decoding frames. I can do some more by starting on the intraframe predictors but that will probably not yield speed ups as dramatic. My totally unscientific testing is yielding these recommendations for specific machines:

I'd welcome your own assessments, but since VP8 (i.e., MediaSource Extensions off) is "good enough" on the G5 and actually currently better on the G4, I've changed my mind again and I'll continue to ship with MSE turned off so that it still works as people expect. However, they'll still be able to toggle the option in our pref panel, which also was fixed to allow toggling PDF.js (that was a stupid bug caused by missing a change I forgot to pull forward into the released build). When VP9 is clearly better on all supported configurations then we'll reexamine this.

No issues have been reported regarding little-endian JavaScript typed arrays or our overall new hybrid endian strategy, or with the minimp3 platform decoder, so both of those features are go. Download and try it.

Saturday, October 8, 2016

I've been waist-deep on AltiVec intrinsics for the last week converting some of those big inverse discrete cosine and Hadamard transforms for TenFourFox's vectorized PowerPC VP9 codec. The little ones cause a noticeable but minor improvement, but when I got the first large transform done there was a big jump in performance on this quad G5. Note that the G5, even though its vector unit is based on the 7400 and therefore weaker than the 7450's, likes long strings of sequential code it can reorder, which is essentially what that huge clot of vector intrinsics is, so I have not yet determined if I've just optimized it well for the G5 or it's generalizeable to the G4 too. My theory is that even though the improvement ratio is about the same (somewhere between 4:1 and 8:1 depending on how much data they swallow per cycle), these huge vectorized inverse transforms accelerate code that takes a lot of CPU time ordinarily, so it's a bigger absolute reduction. I'm going to work on a couple more this weekend and see if I can get even more money out of it. 720p playback is still out of the question even with the Quad at full tilt, but 360p windowed is pretty smooth and even 360p fullscreen (upscaled to 1080p) and 480p windowed can keep up, and it buffers a lot quicker.

The other thing I did was to eliminate some inefficiencies in the CoreGraphics glue we use for rendering pretty much everything (there is no Skia support on 10.4) except the residual Cairo backend that handles printing. In particular, I overhauled our blend and composite mode code so that it avoids a function call on every draw operation. This is a marginal speedup but it makes some types of complex animation much smoother.

Overall I'm pretty happy with this and no one has reported any issues with the little-endian typed array switchover, so I'll make a second beta release sometime next week hopefully. MSE will still be off by default in that build but unless I hear different or some critical showstopper crops up it will be switched on for the final public release.

When I sat down at my G5 this warm Southern California Saturday morning, however, I noticed that MenuMeters (a great tool to have if you don't already) showed the Quad was already rather occupied. This wasn't a new thing; I'd seen what I assumed was a stuck cron job or something for the last several Saturday mornings and killed it in the Activity Monitor. But this was the sixth week in a row it had happened and it looked like it had been running for over three hours wasting CPU time, so enough was enough.

The offending process was something running /usr/bin/find to find, well, everything (that wasn't in /tmp or /usr/tmp), essentially iterating over the whole damn filesystem. A couple of ps -wwjp (What Would Jesus Post?) later showed it was being kicked off as part of the update system for an old Unix dragon of yore, locate.

There are no less than three possible ways to find files from the command line in OS X macOS. One is the venerable find command, which is the slowest of the lot (it uses no cache) and the predicates can be somewhat confusing to novices, but is guaranteed to be up-to-date because it doesn't rely on a pre-existing database and will find nearly anything. The second is of course Spotlight, which is accessible from the Terminal using the mdfind command. There are man pages for both.

The third way is locate, which is easier than find and faster because it uses a database for quick lookups, but less comprehensive than Spotlight/mdfind because it only looks for filenames instead of within file content as well, and the updater has to run periodically to stay current. (There's a man page for it too.) It would seem that Spotlight could completely supersede locate, and Apple thinks so too, because it was turned into a launchd.plist in 10.6 (look at /System/Library/LaunchDaemons/com.apple.locate.plist) and disabled by default. That's not the case for 10.5 and previous, however, and I have so many files on my G5 by now that the runtime to update the locate database is now close to five hours -- on an SSD! And that's why it was still running when I sat down to do work after breakfast.

I don't use locate because Spotlight is more convenient and updates practically on demand instead of weekly. If you don't either, then niced or not it's wasted effort and you should disable it from running within your Mac's periodic weekly maintenance tasks. (Note: on 10.3 and earlier, since you don't have Spotlight, you may not want to do this unless locate's update process is tying up your machine also.) Here's how:

On 10.5, the weekly periodic script can be told specifically not to run locate.updatedb. Edit /etc/defaults/periodic.conf as root (such as sudo vi /etc/defaults/periodic.conf -- you did fix the sudo bug, right?) and set weekly_locate_enable to "NO".

On 10.4 and before (I checked this on my 10.2.8 strawberry iMac G3 as well, so I'm sure 10.3 is the same), the weekly script doesn't offer this option. However, it does check to see if locate.updatedb is executable before it runs it, so simply make it non-executable: sudo chmod -x /usr/libexec/locate.updatedb