Generating BMH$Species classes with a jlink plugin
==================================================
With the Indify String Concat - ISC - JEP (http://openjdk.java.net/jeps/280)
comes various strategies to generate very efficient code for string
concatenation in Java.
However, the one that brings the best throughput potential also has a bit of a
startup issue due to pulling in more of the java.lang.invoke and depending more
heavily on indy to function.
For example this little program:
public class HelloConcat {
public static String value = "Concat!";
public static int i = 17;
public static float f = 17.0f/5.0f;
public static double d = 17.0/5.0;
public static boolean b = true;
public static String value2 = " Still here?";
public static void main(String ... args) throws Exception {
System.out.println("Hello " + value);
System.out.println("int: " + i);
System.out.println("bool: " + b);
System.out.println("float: " + f);
System.out.println("double: " + d);
System.out.println("Hello " + value + " " + value2);
}
}
... ran with:
java -Djava.lang.invoke.stringConcat=MH_INLINE_SIZED_EXACT -XX:+UseParallelGC HelloConcat
... takes about 298ms on my machine. Ouch! (Granted: this is a dual-socket
machine where a simple Hello World barely gets anywhere near 100ms without tuning
heavily for startup, e.g., pinning to one socket, and jigsaw alone adds about 60ms
at time of writing).
Well, if we run with -Xlog:classload we could notice these:
[0,188s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L source: jrt:/java.base
[0,213s][info][classload] java.lang.invoke.BoundMethodHandle$Species_LL source: __JVM_DefineClass__
[0,233s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L3 source: __JVM_DefineClass__
[0,238s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L4 source: __JVM_DefineClass__
[0,245s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L5 source: __JVM_DefineClass__
[0,249s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L6 source: __JVM_DefineClass__
[0,252s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L6I source: __JVM_DefineClass__
[0,257s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L6II source: __JVM_DefineClass__
[0,260s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L6IIL source: __JVM_DefineClass__
[0,290s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L7 source: __JVM_DefineClass__
[0,292s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L8 source: __JVM_DefineClass__
[0,295s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L9 source: __JVM_DefineClass__
[0,297s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L10 source: __JVM_DefineClass__
[0,300s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L10I source: __JVM_DefineClass__
[0,304s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L10II source: __JVM_DefineClass__
[0,306s][info][classload] java.lang.invoke.BoundMethodHandle$Species_L10IIL source: __JVM_DefineClass__
All except $Species_L are generated at runtime, and spinning such classes are a
known startup cost we have to take to get applications using indy up and
running. Also, it seems MH_INLINE_SIZED_EXACT is hungry for those Species
classes.
Anyhow, these classes are generated dynamically because we don't know
beforehand which a given program will need, and for some purposes like embedded
we don't want to generate a lot of classes statically for footprint reasons.
Now, with jigsaw comes many things, for example a way to generate things at
link time using jlink plugins.
Here's an experimental patch to add such a plugin to allow us to generate those
BoundMethodHandle$Species classes at link time, with some degree of flexibility:
http://cr.openjdk.java.net/~redestad/scratch/bmh_species_gen.01/
jlink plugins like this are run when generating the runnable image, e.g.,
the JDK itself, and can be turned on or off or configured for specific purposes
depending on what is better for a specific deployment.
Well, does it help? Using our little HelloConcat program to "benchmark" the
difference:
time for i in {1..100}; do java -Djava.lang.invoke.stringConcat=MH_INLINE_SIZED_EXACT -XX:+UseParallelGC HelloConcat > /dev/null; done
Before:
real 0m29.823s
user 1m25.412s
sys 0m10.836s
After:
real 0m28.492s
user 1m14.540s
sys 0m9.996s
So this simple plugin gave us about 13ms improvement in wallclock time and
spends about 15% less cycles overall. Not bad, I guess, but there's still
room for improvement. I think continuing to expand this plugin approach to
include more of java.lang.invoke might get us much closer to the goal of
delivering a lot of cool, new high-performance features in JDK 9 while at the
same time improve on startup.