Wednesday, 30 March 2016

Continuing on from my last post, here we'll be looking at flags used to control the C2 or server compiler of the Hotspot JVM.

Configuration

In order to reduce the noise created in the compilation logs, we'll be disabling tiered compilation so that only the server compiler will be used. This is done using the following flag:

-XX:-TieredCompilation

We'll also be producing more detailed compiler logs using:

-XX:+UnlockDiagnosticVMOptions
-XX:+LogCompilation

These flags will cause the JVM to generate a file called hotspot_<pid>.log in the current working directory, containing detailed information on the operation of the compiler.

C2 Compile Thresholds

The server compiler has the same threshold flags as the profile-guided client compiler.
Looking at the flags related to Tier4 compilation options (Tier4 is the server compiler), we can see a similar set to those for the Tier3 thresholds described in the last post:

So we might expect to be able to run the same experiments regarding the triggering of invocation, back-edge and compile thresholds.
However, since we are disabling tiered-compilation to reduce noise, these thresholds will not affect compilation. In order to control the operation of the server compiler when tiered-mode is disabled, we need to use the following flags:

Note that we are no longer seeing information about the tier in the PrintCompilation output. In order to confirm that the server compiler is operating here, we can look at the more detailed LogCompilation output for compile task 75:

We can see that the compiler being used in this compile task is C2 and that the interpreter invocation count iicount is 200.

C2 BackEdge Threshold

The server compiler's handling of loop back-edge thresholds is again different to the tiered C1 flags. Using this example program we can see that an on-stack replacement is triggered when the back-edge count is 14563.
This is despite the BackEdgeThreshold flag value being set to a lower value.

What is interesting is that the nmethod node contains a count that is equal to the value of -XX:CompileThreshold. If we reduce this threshold to 5000, we can see that the on-stack replacement happens sooner:

Here, OSR occurs after a back-edge count of 7793, while the nmethod node has count='5000'.
From these observations, we can infer that loop back-edge compilation triggers are related to the CompileThreshold flag, and that if we wish to control when the server compiler kicks in, we need to alter only the CompileThreshold flag.

Inlining

When a method is converted to a native method, the compiler has the option to perform a further optimisation: inlining.
Inlining callee methods reduce method-dispatch overhead, and can allow the compiler a broader scope for further optimisation, e.g. dead-code elimination or escape analysis.
Inlining decisions are based on the size of the method to be inlined. There are two thresholds that we need be concerned with:

-XX:MaxInlineSize
-XX:FreqInlineSize

These thresholds are specified in byte-codes. Let's start with an example of a method that is small enough for inlining:

Here we can see that the invocation of the shouldInline method is at byte-code 1, so the output of PrintInlining is referring to the call-site that is inlined (the @ 1 part of the log entry).
If we reduce the MaxInlineSize parameter to be less than 10 byte-codes using -XX:MaxInlineSize=9, then inlining will fail:

Note the message callee is too large - this is something to look out for if you expect methods in hot code-paths to be inlined; it means that the compiler did not inline this method due to its size.
Now, the default value of MaxInlineSize is 20 byte-codes, which is not a lot of code. The compilation process is a trade-off between achieving good performance, and the space overhead of compiled code, among other things.

The compiler will inline your 21 byte-code method, if it is called often enough. In called frequently enough, the size threshold that determines inlining is FreqInlineSize.
Let's re-run our experiment, and increase the number of invocations:

First, we see the same message declaring the callee method to be too large, but later on in the compilation process, the callee method is inlined. This corresponds with the message inline (hot), meaning that the runtime has decided this method is called frequently enough to inline.

If we reduce the FreqInlineSize to be less than 10 byte-codes using -XX:FreqInlineSize=9, then inline will once again fail:

Summary

We have seen that further to the Tier3 client compiler thresholds, there are Tier4 thresholds for the longer-running C2 compiler. When tiered-compilation is disabled, other threshold flags come into play.

Inlining decisions are based on the size of the callee method, and the frequency with which is it called. The Hotspot compiler will attempt to aggressively inline hot methods, so it is important to understand whether the design of our code is hindering the ability of the compiler to perform available optimisations.

In my next post, I'll be looking at some of the tooling available to help analyse and understand the operation of the JVM Hotspot compiler.

Saturday, 5 March 2016

In this post, we will explore some of the various flags that can affect the operation of the JVM's JIT compiler.
Anything demonstrated in this post should come with a public health warning - these options are explored for reference only, and modifying them without being able to observe and reason about their effects should be avoided.
You have been warned.

The two compilers

The JVM that ships with OpenJDK contains two compiler back-ends:

C1, also known as 'client'

C2, also known as 'server'

The C1 compiler has a number of different modes, and will alter its response to a compilation request given a number of system factors, including, but not limited to, the current workload of the C1 & C2 compiler thread pool.

Given these different modes, the JDK refers to different tiers, which can be broken down as follows:

Tier1 - client compiler with no profiling information

Tier2 - client compiler with basic counters

Tier3 - client compiler with profiling information

Tier4 - server compiler

From this point on, when referring to the C1 compiler, I'm talking about Tier3.

Thresholds

At a very high level, the JVM bytecode interpreter uses method invocation and loop back-edge counting in order to decide when a method should be compiled.
Since it would be wasteful and expensive to compile methods that are only ever called a small number of times, the interpreter will wait until a method invocation count is over a particular threshold before it is compiled.

Thresholds for various levels of compilation can be modified using flags passed to the JVM on the command line.

The first such threshold that is likely to be triggered is the C1 Compilation Threshold.

Flags side-note

To view all the available flags that can be passed to the jvm, run the following command:

java -XX:+PrintFlagsFinal

Running this on my local install of JDK 1.8.0_60-b27 shows that there are 772 flags available:

[pricem@metal ~]$ java -XX:+PrintFlagsFinal 2>&1 | wc -l
772

For the truly intrepid, there are even more tunables available if we unlock diagnostic options (more on this later):

This setting informs the interpreter that it should emit a compile task to the C1 compiler when an interpreted method is executed 200 times.

Observing this should be simple - all we need to do is write a method, call it 200 times and watch the compiler doing its work.
Enabling logging of compiler operation is a simple matter of supplying another JVM argument on start-up:

-XX:+PrintCompilation

Without further ado, let us try to observe our method being compiled after 200 invocations. The script being called will log any statements from the program, and also any other output to stdout that is relevant to compilations for this project.

C1 Loop Back-edge Threshold

As mentioned earlier, the JVM bytecode interpreter will also monitor loop counts within a method. This mechanism allows the runtime to spot that a method is hot despite it not being invoked many times.
For example, if we have a method that contains a loop executing many thousands of times, we would want that method to be compiled, even if it was only invoked relatively infrequently.

Once again, there seems to be a slight difference in the required number of loop iterations and the specified threshold. In this case, we need to execute the loop 60416 times in order for the interpreter to recognise this method as hot. 60416 just happens to be 1024 * 59, it's almost as though there's a pattern here...

PrintCompilation format

In order to understand what is happening here, we need to take a brief foray into understanding the output from the PrintCompilation command. Rather than draw my own fancy graphic, I'm going to reference a slide from Doug Hawkins' excellent talk JVM Mechanics.

PrintCompilation log format

Using this reference, we can break down the information in the log output from our test program:

With a little bit of reasoning, we can figure out that bytecode 5 is the point at which we load the loop counter variable i in order to do the comparison to the loopCount parameter.

This bytecode index then, is at the start of the loop, and would be an ideal place to jump to executing the newly compiled method.

On-Stack Replacement

On-Stack replacement is a mechanism that allows the interpreter to take advantage of compiled code, even when it is still executing a loop for that method in interpreted mode.
If we imagine a hypothetical workflow for our JVM to be:

Start executing a method loopyMethod in the interpreter

Within loopyMethod, we execute an expensive loop body 1,000,000 times

The interpreter will see that the loop count has exceeded the Tier3BackedgeThreshold setting

The interpreter will request compilation of loopyMethod

The method body is expensive and slow, and we want to start using the compiled version immediately. Without OSR, the interpreter would have to complete the 1,000,000 iterations of slow interpreted code, dispatching to the complied method on the next call to loopyMethod()

With OSR, the interpreter can dispatch to the compiled frame at the start of the next loop iteration

Execution will now continue in the compiled method body

C1 Compilation Threshold

There is one other threshold that we need to concern ourselves with, and that is the Tier3CompileThreshold. This particular setting is used to catch a method containing a hot loop, whose back-edge count is not high enough to trigger on-stack replacement due to a high loop back-edge count.

The heuristic for determining whether a method should be compiled, described here, looks something like this:

We need to make sure that the method is called fewer than Tier3InvocationThreshold times and greater than Tier3MinInvocationThreshold times, while increasing the back-edge count to greater than Tier3CompileThreshold. On the next invocation of the method, compilation should occur.

So, if we invoke a method 100 times, and it generates a loop back-edge count of 21 per invocation, then we should exceed the Tier3CompileThreshold:

100 + (100 * 21) == 2200 > Tier3CompileThreshold

On the 101st invocation, the interpreter should trigger a compilation.

Of course, given that so far each threshold seems to have had some power-of-two-based wiggle room as far as the interpreter is concerned, this magic formula doesn't work out exactly. In fact, in this example, the method must be executed 147 times in order for compilation to occur!

It can be seen that in this scenario, we have not triggered the invocation threshold (i.e. invocation count < 200), nor have we triggered the back-edge threshold. The interpreter has correctly identified the method as being worthy of compilation, so the runtime is able to provide an optimised version for future invocations.

Summary

We have seen that for the C1 compiler when operating in tiered mode, there are 3 flags that control when a method is considered for compilation.

In my next post, I'll be looking at the corresponding flags for the C2 compiler, and how they are affected by tiered and non-tiered mode.