Tuesday, May 2, 2017

Three winners combine to make Java go fast, start fast, and stay fast.

By now you should be able find lots of material
online about Azul's new Falcon compiler technology, which was just released as the default JIT optimizer in Zing, our already-very-cool JVM. Falcons and JIT compilers are obviously all about speed. But with Falcon, Zing doesn't just bring more speed to Java. It brings that speed faster. It brings that speed sooner. It brings that speed all the time.

Falcon produces faster code. That's the point. And by bringing an LLVM backend optimizer to the JVM, Falcon lets Zing leverage the optimization work of literally hundreds of other developers who have been (and will continue to be) busy adding more and more optimizations and getting the valuable features of each new processor generation to actually get used by optimized code. A great example of this benefit in play is vectorization. LLVM's (and hence Falcon's) vectorization engine will now match normal Java loops with modern instructions, making code like this:

run faster (as in up to 8x faster) on modern servers than the current HotSpot JVM will (see details of jmh microbenchmark here, and try for yourself with the trial version of Zing). This loop has a predicated operation (add only when the number is even), which makes it hard to match with the vector instructions (SSE, AVX, etc.) that have been around for quite a while. But when the same, unmodified classes are executed on a newer server that has AVX2 instructions (which include some cool new vector masking capabilities) code like this will get fully vectorized, with the speed benefits of the more sophisticated instructions exposed. The cool part is not just the fact that such code gets to be vectorized, or that it is fast. It's that Falcon gets to do this sort of thing without Azul engineers putting in tons of engineering effort to optimize for and keep up with new processors. Others (e.g. Intel) have spent the last few years contributing optimizations to LLVM, and we (and your Java code) now get to benefit from that work.

Could other JIT compilers do this? Of course they could. Eventually. With enough work. But they haven't yet. Falcon is ahead in optimization adoption because it gets to leverage an ongoing stream of new optimization contributions made by others. And we expect it to stay ahead that way. With Falcon and LLVM, Zing gets to be fast sooner, and on an ongoing basis.

Of course Falcon is not just about leveraging other people's work and contributions. We like that, but we had to sink in a bunch of our own work and contributions to make it all possible. The Falcon project at Azul has made significant improvements to LLVM's ability to optimize code in managed runtime environments that includes things like JITs, speculative optimization, deoptimization, and Garbage Collection. When we started, LLVM was able to deal with these concepts to various degrees, but having any of them around in actual code ended up disabling or defeating most optimizations. Over the past three years, Azul's LLVM team has improved that situation dramatically, and successfully and repeatedly landed those improvements upstream such that they can benefit others in the LLVM community (in other runtimes, not just in Java).

With Falcon, we also had to build a host of runtime and Java-specific optimizations that are typical of optimizing JIT compilers, but not typical in static compilation environments; implicit null checks, speculative devirtualization and guarded inlining are just a few examples. Happily, we found that the tooling available with LLVM and it's maturity as an optimization-creation platform have made our new optimization development velocity dwarf the speed at which we were able to create new optimizations in the past.

Great. But how does all this "now we have faster code" stuff make a trifecta? Falcon completes a picture that we've been working towards at a Azul for a while now. It has to do with speed, with how speed behaves, and the ways by which speed can be made to behave better.

Albatross takeoff run

Since the late 1990s when JIT compilers were added to the JVM, Java has had a reputation for being "eventually pretty fast". But that speed has been flakey compared to traditional static environments. Like an albatross, we have come to expect the JVM to take time (and many temporary slowdowns) to get up to speed and airborne. And due to the common pauses associated with typical runtime behaviors (like Stop-The-World GC pauses), Java's eventual speed couldn't even be called "eventually consistent". It has been predictably unpredictable. Consistently inconsistent.

For a quick overview of the various aspects of "speed" that JVM designs have been challenged with, lets start by looking at what JIT compilers typically do. Most JIT environments will typically load code and start executing it in a slow, interpreted form. As "hotter" parts of the code are identified, they are compiled using relatively cheap Tier 1 compiler that focuses on producing "better than interpreted" code, which is used to record the detailed profiling needed for eventual optimization. Finally, after a piece of code has labored through enough slow interpreter and Tier 1 (profiling) execution, higher optimizations are applied. This typically happens only after 10,000+ slower executions have passed.

The chart to the right depicts this evolution of code execution over time, showing the relative portions of interpreted, Tier1 (profiling), and optimized code that is dynamically executed. Since many valuable JIT optimizations rely on speculative assumptions, reversions to interpreted code can (and will) happen as the code learns which speculations actually "stick". Those spikes are referred to as "de-optimizations".

Interpreted and Tier 1 compiled code tend to be significantly slower than optimized code. The amount of time it takes an operation to complete (e.g. the response time to a service request) will obviously be dramatically affected by this mix. We can see how response time evolves over time, as the portion of code that is actually optimized grows and eventually stabilizes.

(In this same depiction, we can see the impact of Stop-The-World GC pauses as well).

We can translate this operation completion time behavior to a depiction of speed over time, showing the speed (as opposed to time) contribution that different optimization level have towards overall speed:

This is how Java "speed" has behaved until now. You can see the albatross takeoff run as it takes time to work up speed. You can see the eventual speed it gains as it stabilizes on running optimized code. And you can also see the all-too-frequent dips to (literally) zero speed that occur when the entire runtime stalls and performs a Stop-The-World Garbage Collection pause.

Zing, with Falcon and friends, now improves speed in three different ways:

Falcon literally raises the bar by applying new optimizations and better leveraging the latest hardware features available in servers, improving the overall speed of optimized code.

ReadyNow and it's profile-playback capabilities remove the albatross takeoff run, replacing it with an immediate rise to full optimized speed at the very beginning of execution.

These three key features play off each other to improve the overall speed picture, and finally provide Java on servers a speed profile and speed-behavior-over-time similar to the speed people have always expected from C/C++ applications. With Zing, Java is now not only faster, it is consistently faster. From the start.

Albatrosses arguing about the future

Falcon is real. Fast-from-the-start Java is here with ReadyNow. And Stop-The-World GC stalls are a thing of the past with C4. We are kicking ass and taking names. Sign up, or just take Zing for a spin with a free trial.

Or, you could just keep taking those albatross takeoff runs every time, ignore the regular dips to zero speed and altitude, and just keep listening to people who explain how that's all ok, and how some day someone will eventually find (and ship?) some other holy graal...