Wednesday, August 10, 2011

This is the first of a set (not a series...there's no particular order) of articles I'll write on how JRuby is using invokedynamic. Hopefully they will show Rubyists how drastically invokedynamic is going to improve JRuby, and show other JVM language folks how to use invokedynamic effectively.

Hello friends!

I figured it's about time for me to start writing a bit on how JRuby is actually using invokedynamic.

As of today, JRuby utilizes invokedynamic far more than any other mainstream JVM language. We have worked very closely with the JSR leads and the OpenJDK developers to make sure invokedynamic runs well. And we have been advocating invokedynamic as a game-changer for the JVM and for JVM languages.

Let's explore one area where JRuby is using invokedynamic: Ruby's "constant" lookup.

Non-constant "Constants"

A constant in Ruby is defined on a class or module, and is subject to Ruby's typical namespacing logic. Constants start with a capital letter.

I often put "constants" in parentheses because constant values can be reassigned. This will usually produce a warning...but not an error. This means we can't simply look up constant values once and never look them up again (without special tricks I'll get into later).

Constant lookup is a also bit more complicated than method lookup. When retrieving a constant, Ruby first scans lexically-enclosing scopes' classes and modules for the constant. If the constant can't be found, the next search walks the current class's inheritance hierarchy. If we still can't find the constant, const_missing is called on the current class.

In order to make constant lookup fast, we want to do some sort of caching. In classic JRuby, Ruby 1.9 (YARV), Rubinius, and probably most other modern Ruby implementations, this is done with a global serial number. Whenever a constant is updated or a module is included (changing the inheritance hierarchy) all cached constants everywhere are forced to lookup again.

I have played with mechanisms for reducing the global impact of constant invalidation, but because constants can be looked up lexically it's simply too complicated to localize (since we need invalidate classes down-hierarchy from the change and we also need to update all lexical scopes that might see the change).

Constant Invalidation in JRuby 1.6

The logic in JRuby 1.6 goes something like this:

If cache is empty or invalid, retrieve the constant value in the usual way (lexical, hierarchical search). Store the value with the current global constant serial number.

On subsequent lookups, check cache for validity against the global constant serial number. If we have a value cached and the cache is still valid, return it.

If any constant in the system is updated, or if a module is included into an existing class hierarchy, flip the serial number and force future constant lookups to re-cache.

This turns out to work fairly well. The same mechanism in Ruby 1.9 produced drastically faster constant lookups, and JRuby's performance is even better than 1.9.

But there's a problem here. Because there's this constant pinging of the global constant serial number, every constant access can potentially produce a new value. So we're paying the cost to check that serial number as well as interfering with optimizations that want to see constant values actually be constant.

Can we do better?

Quick Invokedynamic Primer

The main atom of invokedynamic is the MethodHandle. Method handles are essentially function pointers, which can point at Java methods or fields, constructors, constant values, or other method handles. Invokedynamic also provides the MethodHandles utility class, which lets us juggle method handles in various ways:

You can think of method handles and the chains of adapter handles that stitch them together as a special sort of functional language the JVM knows how to optimize. Given a chain of handles, you should usually get a piece of code that optimizes as well as (or better, in some cases) writing the same logic by hand in Java.

The invokedynamic bytecode simply provides a place to plug a method handle chain into code. When the JVM encounters an invokedynamic bytecode, it calls a "bootstrap method" associated with that bytecode for further instructions.

The bootstrap method returns a CallSite object, provided in java.lang.invoke. There are constant call sites for constant values, mutable call sites for when the target handle chain may have to change, and volatile call sites for when those changes must immediately be reflected across threads.

Once a CallSite has been installed for a given invokedynamic, subsequent hits skip the bootstrapping process, and we're off to the races.

SwitchPoint

I mentioned that the MethodHandles class provides a "guardWithTest" method for combining a test, a target (the "then" branch), and a fallback (the "else" branch). SwitchPoint, also in java.lang.invoke, acts like an on/off guardWithTest that once turned off can never be turned on again. You provide a target and fallback, and until the "switch" is thrown the target will be invoked. After the switch is thrown the fallback will be called.

What's the difference between this and a guardWithTest where the test just pings some global value? The difference is that SwitchPoint doesn't need to check anything.

Optimization and Deoptimization in the JVM

When the JVM decides to optimize a piece of code, it does so in an optimistic way. In very broad terms, this means it assumes its information up to this point is perfect: no new methods or classes will be introduced, profiling information is accurate, etc. Based on this "perfect" view of the world, it aggressively optimizes code.

Of course, the world isn't perfect. The JVM has to give up profiling and monitoring at some point, so it always has an imperfect view of the system. In order to avoid its aggressive optimizations triggering a fatal error later on, JVMs like OpenJDK (Hotspot) do something called deoptimization.

Deoptimization is the process by which running, optimized code can adapt on-the-fly to a changing system. In OpenJDK, there's several ways this is accomplished:

Branches out of compiled code back into the interpreter, when compiled code is determined to be invalid.

Guards around inlined virtual method accesses, to ensure we're still calling against the same class.

On-stack replacement, for fixing up a running method already on the native call stack

...

Because of this ability to deoptimize, it's possible to support zero-cost guards at the JVM level. Returning to SwitchPoint, we can see how this new form of "guardWithTest" can be basically free: we're explicitly telling the JVM this switch is a rare occurrence it can optimize aggressively.

SwitchPoint for Constant Lookup

JRuby on invokedynamic uses SwitchPoint for constant lookup, as you'd expect. Instead of actively pinging that global constant serial number, we instead use a global SwitchPoint object to guard all cached constant accesses. When it comes time to invalidate the system's constants, we just flip the SwitchPoint off and create a new one. All SwitchPoint-guarded constant accesses in the system must then recache and use the new SwitchPoint.

In a well-behaved system, we should reach a steady state where no new constants are being defined and no new modules are being introduced. Because we're using SwitchPoint, the stable state means all constant accesses are treated as truly constant by the JVM, allowing optimizations that were impossible before. And of course this also means that we've achieved constant lookup performance very near a theoretical maximum.

Numbers

First, a caveat: SwitchPoint is implemented in a fairly naïve way in the released OpenJDK 7, using a volatile field as the switch value. As a result, SwitchPoint guardWithTest is very slow currently, and JRuby's SwitchPoint-based constant logic must be enabled. I show numbers below based on leading-edge Hotspot compiler patches that will go into the first update release (numbers provided by one of the Hotspot devs, Christian Thalinger...thanks Christian!)

The benchmark we're running is a modified version of bench_const_lookup in JRuby's benchmark suite. The modification here runs more iterations (10M instead of 1M) with more constant lookups (50 instead of 10) to get a better idea of optimized performance.

As I said before, this is pretty good. JRuby's existing constant lookup performance is roughly 2x faster than Ruby 1.9.2.

Next, we'll try JRuby with SwitchPoint constants on Java 7 (released version, so we expect this to be slow):

The perf hit of purely volatile SwitchPoint is apparent.

And finally, JRuby with SwitchPoint constants on a dev build of Hotspot, which uses deoptimization rather than a volatile field:

This is basically the performance of the 10M iteration loop alone. In fact, if you look at the resulting optimized assembly, the constant accesses have been eliminated entirely since they're optimistically inlined and never used. Of course this would normally not happen in real code, but it shows how much better the JVM can optimized Ruby's behavior using invokedynamic.

Tuesday, August 2, 2011

Java 7 has landed, with a modest set of new features and a few major improvements as well. What can you expect from JRuby running on Java 7?

What's In Java 7

The biggest changes in Java 7 are not related to the Java language at all. Sure, there's the "project coin" enhancements to the Java language, which add some exception-handling shortcuts, new literals for numbers, arrays, hashes, the oft-requested "strings in switch" support, and a few other things. But they're modest incremental changes; the real revolution is at the JVM and JDK level.

Invokedynamic

The most important change in Java 7 is the incorporation of a new bytecode -- invokedynamic -- and an API for building chains of "method handles" to back that bytecode up.

You can look at invokedynamic as a way for JVM users to communicate directly with the optimizing backend of the JVM. Method handles act as both function pointers and as function combinators, allowing a built-in way to construct a call protocol flow from a caller to a callee. You can move arguments around, insert new arguments, process existing arguments and return values, catch exceptions, and perform fast guarded branches between two (or more) paths. The invokedynamic bytecode itself provides a bytecode-level hook to which you attach your method handle chain, with the assumption that the JVM can optimize that chain directly into the invokedynamic caller.

The tl;dr is that invokedynamic makes it possible for the JVM to see through complicated method call logic, such as that found in dynamic languages, and optimize that logic like it would for regular "static" calls.

JRuby's master branch already takes heavy advantage of invokedynamic, by routing most Ruby calls through invokedynamic operations. For simple paths and those that have been optimized by the Hotspot guys (Hotspot is the VM at the core of OpenJDK), invokedynamic often provides performance improvements of 150-200%, with work ongoing to make it even faster. Other paths may not be as well-optimized by the "dot zero" version of OpenJDK 7, so there's opportunity to improve them.

Because JRuby is already well along the road to utilizing invokedynamic, you can try it out today.

Build your own JRuby from master or grab a snapshot from our CI server.

We're looking for small benchmarks that show the performance of invokedynamic (good or bad), so please contact me, the JRuby team, or the JRuby users mailing list with your reports from the field. Also, feel free to open performance bugs on the JRuby bug tracker if invokedynamic performs worse than non-invokedynamic. Pass -Xcompile.invokedynamic=false to JRuby to revert to the old non-invokedynamic logic.

NIO.2

NIO is Java's "New IO" APIs, a set of wrappers around low-level file-descriptor logic and memory buffers. NIO has been around since Java 1.4, but the recent update -- dubbed NIO.2 -- brings a sorely-needed update to the functionality provided:

Filesystem operations (like symlinks, permissions, etc) are now almost all available through NIO.2's filesystem APIs. This also includes standard, cross-platform support for filesystem events, such as watching a directory for changes (using efficient OS-level operations, rather than polling).

File and directory walking now comes with considerably less overhead and more options for filtering directory lists before handing filenames off to user code. There's also support for opening a directory directly and walking its contents as you would a file.

Most IO channel types now have asynchronous versions. Asynchronous in this case means "punt my IO operation to a built-in thread pool", with subsequent code checking on the status of those operations and getting results from a "future" handle.

For JRuby, the new IO APIs will mean we can support more filesystem operations across platforms without resorting to native code. It will also provide JRuby users a means of handling filesystem events and asynchronous IO operations without using a platform-specific library. We have not yet started adding NIO.2 support to JRuby's core classes, but that will come soon.

General Improvements

There's lots of smaller, less flashy changes in OpenJDK that also appear to help JRuby.

Even without invokedynamic, the latest OpenJDK 7 builds usually perform better than OpenJDK 6. Some benchmarks have proven to be as much as 2x faster, just by upgrading the JVM! General perf improvements will be more modest, but in almost every case we've tested OpenJDK 7 definitely performs better.

The release of OpenJDK 7 also brings improvements to the "tiered" compilation mode. Tiered compilation aims to merge the benefits of the "client" mode (fast startup) with those of the "server" mode (maximum peak performance). You can turn on tiered compilation using -XX:+TieredCompilation (in JAVA_OPTS or at the "java" command line, or prefixed with -J when passed to JRuby). We're looking for user reports about how well "tiered" mode works, too.

This general improvement means that even JRuby 1.6.x users can take advantage of OpenJDK 7 today, with the promise of even bigger improvements in JRuby 1.7 (our target release for pervasive invokedynamic support).

Consistency

As with previous Java releases, a great deal of care has been taken to ensure existing applications work properly. That applies as well to Java 7. We have been testing against Java 7 for over a year, on and off, and recently started running tests "green" with even heavy invokedynamic use.

We have made no major Java 7-specific fixes in JRuby...it should generally "just work".

Let Us Know!

As always, we really want to hear from you bleeding-edge users that are playing around with JRuby on Java 7. Please don't be shy...let us know how it works for you!

Update: The Hotspot guys have been helping me find invokedynamic bottlenecks in a few JRuby microbenchmarks, and discovered that a flaw in invokedynamic was causing too much code to inline, forcing out more important optimizations. The details belong in another post, but they offered me a long Hotspot flag to accomplish basically what their fix does: -XX:CompileCommand=dontinline,org.jruby.runtime.invokedynamic.InvokeDynamicSupport::invocationFallback ... With this flag, performance on e.g. "tak" easily beats stock JRuby (see the third benchmark run here: https://gist.github.com/1121880).

I would recommend trying this flag if you are finding invokedynamic slowdowns in JRuby.