Monday, January 17, 2011

Lies, damn lies, statistics, and the G5: a beta 9 postmortem

Beta 9 has been an ... interesting release. Besides a serious blocker that escaped Mozilla's notice causing large portions of some pages to not render at all (bug 623852), and a flickery title bar which is also a Mozilla bug (partially fixed for Tiger in issue 16; see also bug 621762), we are also seeing significantly faster screen display times and the first community statistics for the nanojit.

For those who have been sleeping under a bridge and not reading this blog or the last couple entries, the nanojit is a component of Mozilla's JavaScript interpreter that turns execution "traces" into machine language, a limited and specific form of Just-In-Time compilation (hence the name). When the trace is executed again, the compiled code is run instead, which -- at least in theory -- should be significantly faster. The last couple entries go into this in nauseating detail, so read them if you'd like to get up to speed on the concept.

The nanojit was developed on my quad G5, and the timings were done on it in the expectation that the G5's clock speed, out-of-order execution hardware (being a POWER4 disguised as a PowerPC) and fat memory bandwidth should enable it to be the fastest performer. Full-tilt, running at Highest power, the native, non-accelerated JavaScript interpreter in TenFourFox b8 and b9 chugs through SunSpider in around 3500ms. This is about 200ms faster than Firefox 3.6, and is clearly the fastest pure-interpreter score of any Power Mac in Firefox. When the nanojit is turned on for the G5, however, this number actually gets worse -- to around 5000ms. On Dromaeo, the nanojit is slightly slower overall, dragged down by the SunSpider score. Given that the G5 has all this awesome hardware in-core to execute code as fast as possible, I concluded that the nanojit would not be as valuable on PowerPC as it has been to other platforms as the G3 and G4 don't have that extra silicon. I left the code in there for people to play with and turn the pref on or off (see the Release Notes for how to do so), since Dromaeo showed it to be valuable in certain cases and perhaps I could deal with them predictively.

Well, the tables are turning over. Thanks to user PoLiYa who wrote in to tell me about his fabulous score on SunSpider with his PowerBook G4/1.5GHz and the nanojit turned on: a whopping 2866ms, down from around 6400ms. That's not a typo; his 1.5GHz G4 benched faster than a 2.5GHz G5 on SunSpider with the nanojit. I thought this had to be a mistake, so I got out my iBook/1.33GHz and repeated the test. It was no mistake. On the G4, the nanojit not only worked, it easily cut SunSpider scores in half, and Dromaeo was also about double the speed. User agg23 over at 68KMLA tried it on his G3, and also saw a similar speed improvement, from 16000ms down to around 8900. Try it yourself and post your stats in the comments.

This does have a penalty to pay: since the trace only pays off when it's cached, the browser does indeed cache it to take advantage of it as much as possible, and this in turn causes greater memory pressure. Benchmarks really make the increased memory usage noticible, but even stock Firefox can bloat a little with regular use despite its aggressive garbage collection algorithms. This was never an issue for PowerPC Firefox before, because it never had a nanojit, but now it does. (This does not happen when the nanojit is disabled, which is the default state in b9 right now.)

There is at least one bug in the nanojit I am aware of that can cause crashes or weird behaviour, but it is infrequently triggered and I have it fixed for beta 10 already.

So, now that we know it is highly beneficial for all but the G5 processor, the plan for beta 10 is to ship the G3 and G4 versions (7400 testing is pending) with the nanojit turned on for content (web pages). If this works, it may also be enabled for chrome (i.e., the browser's own JavaScript, and add-on JavaScript), but we'll start small to make sure there are no unexpected surprises.

Where did the G5 go wrong? The most reasonable explanation is to conclude that the G5 pays a heavier penalty for stalls and/or memory access than the other processors, and the PowerPC nanojit's built-in optimizer negates the 970's out-of-order advantage because it is already putting the instructions in as optimal an execution order as possible (given the relatively poor optimizability of the code to start with; see the blog entry two back about that). Indeed, there are quite a few stalls in Shark when you load the generated code into it and analyze it, but these stalls occur for the G3 and G4 too; therefore, they simply must hurt more on the G5, and the longer pipeline is likely to blame. It is worth noting that the G5's pipeline is even longer than the POWER4 from which it descends, purely for purposes of faster clock speed. Coincidentally enough, I recently acquired a POWER6 as a household server and it will be interesting to see how well it does with this code when I get Firefox 4 compiled on it. As it stands right now, however, the nanojit will be disabled for G5 by default. In the near future I will experiment with modifications to the JavaScript tracer algorithm to see if I can predict those specific code situations where the nanojit pays off for the G5 too. You can still turn it on, but it will be shipped with the prefs off in that specific build.

Assuming Mozilla does not come out with a beta 11, beta 10 will be the final TenFourFox beta. I'm not planning to release a release candidate; we don't have the space, and I'm not all that certain b10 will be the last Firefox beta either. If it is, we will follow up b10 with a formal release and welcome the masses to our fold. Tell your friends: we're bringing our beloved old Macs into the next browser generation. More innovations to come.