Links

Nitro-Extreme on ARM. What happens to nitro during extreme conditions?

Recently, a new branch has appeared in the WebKit trac, called Nitro-Extreme. I know the work has not been finished yet, but it never hurts to take a look at the current revision.

What is the big deal about this branch? The guys at Apple changed the format of the core structure of JavaScriptCore: the JSValue. All JavaScript-level objects (from booleans to arrays) can be represented as JSValues, and many functions take JSValues as input and output arguments. JSValue was a 32 byte aligned pointer before, which could also hold atomic data types if the low-order bits were non-zero. Unfortunately, double precision floating-point numbers cannot be atomic types on 32 bit machines, because pointers are only 4 bytes long. In this branch, JSValue was extended with another 4 bytes, so it is unnecessary to allocate memory for doubles anymore. Therefore, the amount of work performed by the garbage collector can be reduced (hopefully). Garbage collection is an expensive operation, may take 20-30% of the total runtime. The downside of the new JSValue is that loading and storing them requires two CPU instructions instead of one. Dispite of the cache improvements, memory accesses are still the bottleneck of the CPU performance, especially on embedded systems.

I made measurements with Nitro-Extreme on a Nokia N810 internet tablet equipped with an OMAP-2420 ARM CPU. The results are produced using the interpreter, since the branch does not yet support jit. First, we start with SunSpider, as it is the default benchmark of WebKit:

As you can see, the speed is greatly increased for both the math and the 3d groups but the others (especially bitops and contrloflow) suffer a great performance loss. Overall, the performance is slighlty increased for SunSpider.

WindScorpion takes much more time to complete than the other ones, so we usually run it only once. To give a real chance for the Nitro-Extreme branch we have repeated the whole measurement multiple times, and selected the best result. Since standard deviation cannot be calculated from one sample, the runtime environment yields that the performance change is "not conclusive".

Can we see behind the raw runtimes? Yes we can, since we have an Intel-XScale cycle accurate simulator called XEEMU. The simulated CPU is configured for 600MHz with 32K instruction and data caches. The system calls are handled by the Linux kernel (version 2.6.21.5). We have selected two candidates from SunSpider: 3d-cube and controlflow-recursive. The first one became much faster, the latter became much slower.

The number of extra instructions only partly explains the longer runtime of the trunk, we need to search further. A memory stall cycle means that the CPU cannot execute the next insturction, because either the instruction is not yet delivered by the instruction fetch stage, or a required input data is not yet loaded from the memory.

Number of stall cycles caused by memory: 257012628 (Trunk)
Number of stall cycles caused by memory: 86277920 (Nitro-Extreme)

In this case the cache access pattern is much better for Nitro-Extreme, thus it runs much faster. Perhaps the memory scan performed by the garbage collector caused extra cache line evictions for the trunk.

IPC (instruction per cycle):
Trunk IPC: 0.68
Nitro-extreme IPC: 0.65
The workload for the core is nearly the same. This is a similar case to controlflow-recursive. The extra executed instructions caused the runtime increase here as well.