This post is very long, but please try to read it without skipping. The text naturally builds up on the observations done earlier in the same text. That’s also why the first part is rather boring and slow. Buckle up, and read. If you are running out of steam, take a break, and pick up you where you left. Please read it in entirety before asking questions!

Preface

Programming languages like Java provide the facilities for subtyping/polymorphism as one of the ways to construct modular and reusable software. This language choice naturally comes at a price, since there is no hardware support for virtual calls, and therefore runtimes have to emulate this behavior. In many, many cases the performance of method dispatch is not important. Actually, in a vast majority of cases, the low-level performance concerns are not the real concerns.

However, there are cases when method dispatch performance is important, and there you need to understand how dispatch works, what runtimes optimize for you, and what you can do to cheat and/or emulate similar behavior in your code. For example, in the course of String Compression work, we were faced with the problem of selecting the coder for a given String. The obvious and highly maintainable approach of creating a Coder interface, a few implementations, and dispatching the virtual calls over it, had met some performance problems on the very tiny benchmarks. Therefore, we needed to contemplate something better. After a few experiments, this post was born as a reference for others who might try to do the same. This post also tangentially touches the inlining of virtual calls, as the natural thing during the optimization.

As a good tradition, we will take some diversions into benchmarking methodology and general low-level performance engineering, so that even though the post itself is targeted at platform people, the general crowd can still learn a few tricks. As usual, if you still haven’t learned about JMH and/or haven’t looked through the JMH samples, then I suggest you do that first before reading the rest of this post for the best experience.

This post also assumes a good understanding of Java, Java bytecode, compilers, runtimes, and x86 assembly. Many readers complain my posts lack completeness in the sense that I omit the details how to arrive to a particular conclusion. This time I decided to use lots of notes to highlight how one can decipher the compiler decisions, how to read the generated assembly, etc. It should be fun and educational, but you may also ignore the parts formatted like callouts, if you don’t have time.

1. Problem

We can formalize the problem as follows. Suppose we have a class Data:

public class Data {
byte[] data;
}

…​and we want to do different things based on that data. Suppose we have N versions of code, each of those versions provide some sort of meaning for the data. In String Compression, for example, data can be either 1-byte-encoded array, or 2-byte-encoded array. Something somewhere should provide a meaning for that data; in other words, demangle it and/or do useful work. Suppose, for example, we have a "coder" abstraction that does something along the lines of:

The question is, what is the best way to implement different coders and dispatch over them?

2. Experimental Setup

2.1. Hardware

While the exact code generation details surely differ among the platforms, in hindsight, most of the behaviors we are about to show are platform-agnostic, and done in high-level optimizers. Because of that, we simplify our lives and run the tests on a single 1x4x2 i7-4790K 4.0 Ghz, Linux x86_64, JDK 8u40 EA. Suspicious readers are welcome to reproduce the results on their respective platforms.

2.2. Cases To Try

Granted, you may just inline the coder implementations straight into Data, but that’s a questionable practice, especially if you want to have multiple coder implementations. Therefore, we consider these ways to implement coders:

Static: Make static implementations of Coders, all with static methods.

Dynamic_Interface: Make Coder a proper interface, and provide the implementations.

Dynamic_Abstract: Same as above, but make Coder the abstract superclass, and provide the implementations.

All these three methods of implementing the behavior need some way to encode it in the Data itself. We can come up with four schemes of such an encoding:

ID_Switch: Store a byte ID, and select the coder by switching over it

ID_IfElse: Store a byte ID, and select the coder by if-else-ing over it

Bool_IfElse: Store a boolean, and select the coder by if-elsing over it. (Works only for N=2)

Ref: Store a Coder reference, and do a virtual call.

There are also other ways to encode, e.g. storing the java.lang.reflect.Method or java.lang.invoke.MethodHandle, but we don’t care about them at this point, because both are somewhat similar to Ref, and also require some advanced compiler magic to work in a performant way. We will cover these in the future.

The choice of the encoding scheme also has to take the footprint into the consideration. Additional field in the Data object can increase the instance size. However, since Java object are usually aligned at 8 bytes, there is a sizeable "alignment shadow" in the object, where one can cram the field without increasing the instance size. Hiding a boolean/byte field there could be easier than putting an entire reference field. JOL can be used to look into field/object layout.

Combining these behaviors and selection schemes, we have quite a few variants already. Thankfully, not all of them are required for a concrete N. However, it would be interesting to dissect each of the relevant ones to understand how runtimes deal with code like this.

2.3. Benchmarks

The source code for the benchmarks is available here. Since we are dealing with a very narrow VM effect, let me explain a few things about the benchmark. This is how a single benchmark case looks like:

We do a loop over a few Data classes. That will be important when we start to look into dealing with different coders: runtime should be "poisoned" with the different types of coders at the call site. In the example above, the test has two types of coders, with ID=0 and ID=1.

We hand-optimize the loop in @Benchmark method, because its performance is important when we deal with the interpreter.

The benchmark loop does not consume the payload result. This is done to make the nanobenchmark amplify the costs of the payload itself, not the infrastructure costs. This practice is DANGEROUS and ignores the JMH recommendations. This trick can only be done by the trained professionals, who would then verify it does not obliterate the benchmark, due to e.g. dead code elimination.

The payload method (do_Dynamic_Interface_Ref in this example) is annotated with a special JMH hint that prevents inlining of payload into the benchmark loop. This is an important prerequisite for avoiding dead code elimination, at least on HotSpot today. It also helps to separate the payload body from the remaining code in the assembly dump.

2.4. VM Modes

By default, HotSpot VM these days provides the tiered compilation: first, the code is interpreted, then it’s compiled with a baseline compiler (also known as "client" compiler, or C1), and finally it is compiled with an aggressively optimizing compiler (also known as "server" compiler, or C2).

The actual pipeline is more complicated. VM constants for tiers of compilation shine some light on what each level does. level=1 is pure C1, level=4 is pure C2. Levels in-between are C1 with profiling. The transitions between levels are governed by an advanced policy, and you can talk about them forever.

While most people care about the performance in C2, since the code that gets aggressively optimized is by construction the same code that consumes lots of time, we also care about the performance in pathological modes where the code hasn’t yet made it to C2 compilation.

3. Single Type

If you are inexperienced with walls of assembly, it is a good idea to pay attention to this part.

Let us start with scenario with only a single coder. While this scenario is superficial, because you might as well inline the coder implementation into the data class itself, it still serves as an interesting basic case that we want to untangle before diving into more complicated cases. Here, we only concern ourselves with Ref style of encoding, since other styles have almost no meaning here.

The JMH benchmarks consider a single Coder0 that extends AbstractCoder, and that’s the only subclass of this abstract class. Coder0 also implements Coder interface, and there is only a single interface implementer. Even though we have a singular Coder0 that implements/extends both the abstract class and interface, we try to call for its work either via abstract class (invokevirtual), interface (invokeinterface), or the static method (invokestatic).

We can immediately spot the difference: static case is faster than dynamic, and interface case is slower than abstract class. While the difference seems minuscule, in some cases, like ours, it might matter a great deal. Let’s figure out why the performance is different. To do that, we employ JMH’s -prof perfasm profiler, which maps the Linux perf event sampling onto disassembly.

There are other ways to get the low-level profiling and disassembly, including but not limited to dealing with VMyourself, using Solaris Studio Performance Analyzer, or JITWatch and others. Every approach has its advantages and its limits, learn how to employ the right tool for a particular case. JMH perfasm is ideal for analyzing the JMH nanobenchmarks we will deal with in this post.

Editorial note. The assembly listings you will see in this post are slightly different from what you will see in the actual assembly dump from the VM. It would be much larger, but thanks to perfasm you will have to only care about the hottest blocks. It will also have lots of other info, including the machine addresses, Java bytecode instructions, Java stack traces, and other interesting VM comments. We would like to keep you oblivious of those comments at this point, in order to focus on things that do matter now. You can always repeat the tests in your environment and get the full disassembly.

This method yields a simple and straight-forward code, and that is not surprising: we just call the static method. Static methods are also statically bound, which means we don’t need to resolve what method should we call in runtime. We know exactly what methods we are about to call statically, at compile time. The same thing applies to our own private methods, they are also statically resolvable.

How did we figure a thing about Coder0.staticWork() body? Well, the original assembly dump will say that mov 0xc(%r12,%r11,8),%eax is "arraylength" in disguise, but what does it actually mean? What you are looking at is the "decoding" of the compressed reference, merged with the actual read. That instruction is the equivalent of loading %r12 + %r11*8 + 0xc, where %r12 is a base address, %r11 is the compressed address of data, 8 is the compressed references multiplier, and 0xc is an offset of length field within the byte[] array. The presence of these rich addressing modes in current CPUs is one of the reasons why compressed references implementations provide a bearable generated code overhead. When in doubt, you can always dump the assembly with compressed references disabled (-XX:-UseCompressedOops)

Another interesting question is about null checks. Java spec requires to check the byte[] instance for null before polling its length. We know the length is stored within the byte[] instance anyway, so we need a non-null instance in the generated code. Can you spot the null-check in the assembly above? You can’t, because there is no explicit null check there. So, how would runtime fit the specification obligations or at least not crash? The hint is in the full assembly dump, which will say "implicit exception, dispatches to $addr" against the mov instruction. De-referencing the null pointer will cause CPU to cause SEGV signal to the process. Luckily, VM has the SEGV handler armed that can handle this, and throw an appropriate NullPointerException, transfer the control back, etc. Since SEGV event has a return address, VM knows at which point in the generated code the null-check had fired. This optimization greatly improves the performance of ubiquitous null checks guaranteed by Java.

The only remaining part, other than %rbp (stack pointer) handling, is the test %eax,0x15f7c4e0(%rip). This seems cryptic, since the status flags this instruction would set are not used. Why this instruction is even there? If you look into the raw assembly dump, you will see "# poll" comment against that line, this should give you a hint. In short, this is a safepoint poll, the part of VM-Code interface. VM maintains a special memory page that has the READ permissions, and when it needs the generated code to report back as soon as possible, it drops the privileges to NONE, causing the trap when test instruction tries to access it, thus transferring the control to the trap handler, and then to the VM. You can read more about safepoint mechanics somewhere else, e.g. in Nitsan’s "Where is my safepoint?".

In human words, it loads the data field, it does the null-check for abstractCoder, because we are about to call a virtual method on it, and then we have the inlined body of Coder0.abstractWork(). The cost of the null check is what we see as the difference against a plain static call.

How did we figure out the thing about null checking? Well, we already know %r12 is a compressed references base address, see the note above. We also know that %rsi is the reference to Data, and we are pulling the field at offset 0x18 from there. You may want to cross-check the object layout with JOL to make sure. Combining these two observations, we figure this compares the object reference to the base address, in other words, checks it for nullity.

One can wonder: what happens if we pulled a different coder from the field and not Coder0? Would we then proceed to the inlined body of Coder0, even though our original program is definitely saying otherwise? The answer is: you are looking at the result of a speculative optimization. VM knows there is only a single subclass of AbstractCoder that could possibly be here (knowledge coming from the Class Hierarchy Analysis, or CHA), and therefore we may skip checking the type. If our tests had a more complex hierarchy down the AbstractCoder, the VM would have to assume something else could enter here. In fact, we will see that once we start to deal with multiple types of coders.

Another puzzling thing about this: Java supports dynamically loaded classes! What happens if we compile the current code speculatively, but then some bastards receive another subclass of AbstractCoder over the network and load it? Everything breaks? The answer is: runtime can detect when the conditions under which the code was compiled are invalidated, and discard the compiled code. This is where runtimes shine: they control the code, they control the environment where the code is run, and therefore can optimize speculatively. Remember the note about the safepoints above? This is one of the things that help VM to quickly get the control back from the generated code when needed.

Notice here that the CHA is not playing on our side. Granted, we could have analyzed the same for interfaces, figure out the interface has only a single implementer, etc., but the reality is that real interfaces usually have lots of implementors, and so implementing and supporting this in the VM does not pass a reasonable cost-benefit analysis.

Therefore, the compiled code has to check the coder type, and the performance difference against simple static case is explained by this. It is also less cheap than null-checking the coder instance, since we need to pull and compare the actual classword.

How did we figure out the classword is being loaded? Remember the note above, we know that we are reading something at offset 0x8 from a compressed address %r8. %r8 is a Coder reference, and so we pull the 8 bytes from 8 byte offset from the object itself. We know that is where a classword lies.

VM does the type checks by comparing the class word with a pre-compiled constant. That constant is actually the canonical address of native VM structure that mirrors the class. Since it’s canonical, and also because VM takes care of not moving it, we can compile the needed address right into the generated code, reducing these checks to just pointer comparison.

3.2. C1

As noted above, the generated code passes a few tiers of compilation before it reaches the heavily-optimized version. Sometimes what you transiently have before reaching the optimal performance also matters, this is why we should also look into C1, telling VM to stop the tiered pipeline at level 1, with -XX:TieredStopAtLevel=1.

Ideally, we would want to limit the tiered compilation level only for a specific benchmark method, but it is simpler to run the entire code on lower level. This is valid since we don’t compare the performance of C1 vs C2, but merely study how a concrete compiler behaves.

These cases are again remarkably simple, because we know the call target exactly, as we did in the same C2 case.

The compressed references packing/unpacking code may look like this as well. Notice it is almost the same as in mov 0x10(%rsi),%eax; mov 0xc(%r12,%eax,8) when %r12 is 0. shl $0x3, <anything> instructions are the markers of compressed references decoding going on, if there is no other reasons to do the shifts.

The minute differences in code generation between compilers is not a surprise. What’s surprising is that C2-ish instruction choice is marginally slower than C1-ish instruction choice:

C2: 29.575 ±(99.9%) 1.116 us/op
C1: 26.347 ±(99.9%) 0.178 us/op

Following up on this difference will be interesting, but outside the scope of this work. This observation is another example of the importance of very careful comparisons for the code coming through different toolchains. Read "Java vs. Scala: Divided We Fail" for a funnier example of the same effect.

First of all, you have to notice the abstractWork method is not inlined. This is because C1 is not able to figure out the static target for the call: the call target is not trivial, and the CHA is not working for abstract classes here.

One can untangle that as follows, run with -XX:+PrintCompilation -XX:+PrintInlining. For this method in C1 compile we will see this "cryptic" message:

Now, if you search the HotSpot source for "no static binding", you will find a relevant part of C1 GraphBuilder, which implies the inlining of this method will happen if it’s static/private/dynamic, or otherwise final. Placing the final modifier over Coder0.abstractWork would not help here, because compiler tries to verify if AbstractCoder.abstractWork is final or not. CHA might have saved us here, but the block above explicitly rules it out (the reason for that is unknown to me).

The interface call is inlined more or less nicely, but there is a significant dance for type checking. The largest part of this method seems to wait on cmp 0x38(%rdx),%rcx and associated jumps, although it’s hard to tell what’s the problem exactly: it may be a cache miss on getting 0x38(%rdx) (though unlikely in a nano-benchmark), or some weird pipeline effect with two jumps following each other. Either way, the jmpq there seems suspicious, since we might as well fall-through. This can be fixed with another pass of peephole optimizer that could at least nop out the jmpq, but remember we are talking about C1, which goal is to produce the code fast, even at the expense of it’s quality.

How did we figure cmp 0x38(%rdx),%rcx is a type check? First, we know that before that instruction the %rdx register contained the classword, because we loaded it from the offset of 0x8 from the object itself, see the notes above. We also know %rdx is the address of some native data structure in the VM. Now, figuring out the layout of that native structure is somewhat hard. As a bullet-proof solution, you would employ the Serviceability Agent for a task like this. For our study, it is enough to figure out that we compare the actual class with a expected class, hence this is a type check.

3.3. Interpreter

The interpreter tests are interesting because you don’t expect an optimizing compiler to come and save you. That’s certainly true, and if you run the tests with -Xint, you will see something like this:

It would seem that the interpreter is so slow, the cost of the different call itself is drowning there. However, if you profile the interpreter with something like JMH perfasm, then you will see that we are spending the most of the time doing stack banging. Working that around gives a nice performance boost with -Xint -XX:StackShadowPages=1 gives:

Meddling with -XX:StackShadowPages may undermine the JVM ability to deal with stack overflow errors. Do not use this blindly in production. Read more about stack overflows here. We are reducing the number of shadow pages here to magnify the effect we are after.

As in C2 case, the static cases are faster, and interface case is slower than abstract one. Let us look into…​ the generated code! We wouldn’t spend much time here, but just outline a few key things.

As unusual as it sounds, our interpreter generates the code. For each bytecode instruction it generates one or more platform-specific stubs that together implement the abstract bytecode executing machine. With Futamura projections, one can easily argue this is actually a template compiler. The established term for this implementation is "template interpreter" though, because it does interpret the bytecode in a (weird, but) straight-forward way. The debate how would one call such an implementation further outlines that the line separating an interpreter and a compiler is very thin, if it exists at all.

The assembly for the stubs are the "fun" reading, and we would not go there. Instead, we will do a differential analysis, since the stubs are roughly the same for all the cases, and only their time distributions differ.

It is different from the previous one in the sense that invokestatic is gone, and replaced by invokevirtual for our abstract call, and we also have the throw exception stub, which is here to handle possible NPEs. This difference explains why static_Ref is faster — for the same reason it is faster in compiled versions. If the call receiver is known, everything gets easier.

dynamic_Interface_Ref is roughly the same, but having invokeinterface stub:

3.4. Discussion I

Looking at the case of a single coder, we can already spot a few things:

If you know a call receiver exactly at development/compile time, it is a good idea to call it statically. This practice actually coincides with good engineering: if the method behavior does not depend on the receiver state, it might as well be static to begin with.

C2 will try to guess the receiver based on profile, but it has to do the defensive type checks and/or null checks, as mandated by the language specification. While this is more than enough in a vast majority of cases, the inherent costs of these mandatory checks are sometimes visible on a very high magnification.

C1 is sloppy when it comes to exploiting the static type information, as well as profile. This may affect warmup times and time-to-performance. It’s an open question if you can implement the C2-like smart moves, without compromising the compilation time.

To reiterate, the method invocation performance is not a concern most of the time, and the VM optimizations are actually closing the gap very tightly. We will see what happens when they can’t, as we go on.

4. Two Types

Now that we learned some basic stuff about the single type calls, let’s see how runtimes handle the calls with different targets. The simplest way to achieve that in benchmark would be to modify our single coder benchmark to segregate two Coders.

This further complicates benchmarking because now we have to decide what distribution of coders to measure. Should it be 50/50? Should it be 1/42? In hindsight, we are going to introduce the bias parameter which will say what ratio does Coder0 consume in the distribution, and measure the benchmarks at 0.0, 0.1, 0.5, 0.9, and 1.0 ratios.

Measuring only a single bias is a benchmarking error. It has a chance to put the benchmark in some specific point condition, that might be not what the real code is experiencing. You should try to poke at different settings to see how the benchmark responds. See more discussion in "Nanotrusting the Nanotime". Our first slab of experiments did 0.0, 0.5, and 1.0, but we figured all these cases are degenerate in one way or the other, see below. That’s why we added a bit more realistic 0.1 and 0.9.

4.1. Reference Tests

Again, let us "eat the elephant one bite at a time", and start from the tests that we are already familiar with, the Ref family. static_Ref is not applicable here anymore, because there is no way to select two coder implementations with static methods and nothing else (we will see how auxiliary selectors perform later). That leaves dynamic_Interface_Ref and dynamic_Abstract_Ref.

Notice a few things about these results already. First, the performance depends on the bias; we can already speculate why, but we will dive into the exact reasons a bit later. Second, the performance difference is symmetrical around 1/2 bias, which already gives you a few hints about what’s happening. Third, the abstract case and the interface case seem to perform the same.

All right, let us disassemble some stuff! Take dynamic_Interface_Ref with bias = 0.0, that is, only Coder0 instances are present in the code:

Notice anything? The code looks similar to what we had before, but this time we check the actual type of coder, and then call into runtime if we see something else, not Coder0. The VM was actually smart enough to figure the only type here is Coder0.

It is sometimes enlightening to view the type profile per call site. It’s doable with -XX:+PrintCompilation -XX:+PrintInlining (and additional -XX:+TraceTypeProfile for older VMs). For this test, it will indeed say there is only one type (Coder0) was observed:

What’s an "uncommon trap"? That’s another part of VM-Code interface. When compiler knows some branch/case is unlikely, it can emit a simple call back to VM instead of generating a whole lot of not-to-be-used code. This can greatly cut the compile times, as well as the generated code footprint, and provide denser hot paths, e.g. in loops. In this case, stepping on that uncommon trap may also mean our profile information about coder being only the Coder0 is outdated, and we need to re-profile, and recompile.

Now we actually have the two real coder implementations. Notice, however, the codepath leading to Coder1 is pushed out from the straight path, and the more frequent case of Coder0 is taking it’s place. If we were to run with bias = 0.9, the places of Coder0 and Coder1 would reverse here, because profile says so, and code layouter consults with it.

In other words, compilers use frequency-based basic block layout, that puts the basic blocks on a straight path based on the transition frequencies between the basic blocks. That has a handy effect for machines: the most frequent code is linear, which plays nicely with instruction caches and decoders. This also has a handy effect on humans: if you look at the compare-and-branch sequence in the generated code, then the unlikely branch will probably be fanned out, and the likely branch will follow through.

Now, because of runtime profile affecting the block layout, this is arguably is a degenerate case, where the tiny swing to either Coder0 or Coder1 case can affect the layout decision, potentially leading to a great run-to-run variance. From performance standpoint, however, such a distribution means lots of mispredicted branches, which will affect the performance on the modern CPUs.

For example, if we run under perf (JMH conveniently provides the support with -prof perf), then we will see progressively worse branch miss ratios. 8% branch miss ratio is rather bad, and the hit to IPC (Instructions Per Cycle) is enormous.

One might wonder, why wouldn’t we sort the datas array before doing the experiments, wouldn’t that help? Sure, it will help branch prediction in these tests, but this is bad benchmarking, for three reasons. First, this "lucky" coder distribution is unlikely in real life. We can live with that as long as we are doing synthetic benchmarks anyway. Second, we need to take the hardware effects into the account, at least in order to estimate how the effect we are after compares with something we can’t really control.

Third, if you have a large sorted array, then chances are you will trip the profiler half-way through processing the array, and therefore skew the type profile. Imagine the array of 10K elements, first half populated with the elements of one type, and the other half with the elements of another. If compiler trips after traversing the first 7.5K, then the "observed" type profile would be 66.(6)% of type 1, and 33.(3)% of type 2, therefore obliterating the experimental setup.

dynamic_Abstract_Ref behaves almost exactly the same, and generates almost the same code, which explains why the performance is similar to dynamic_Interface_Ref.

C1 experiences the same symmetry around bias = 0.5, but now the Interface and Abstract cases are clearly different. If you disassemble both, you will see that in both cases the calls are not inlined, and look similarly to C1 version in single type case. However, there are also additional pieces of hot code, attributed to the VM stubs that do the actual vtable (virtual calls table) and itable (interface calls table) dispatches.

Note that interface selection is generally larger, because dispatching the interface call takes much more work, and this explains the performance difference between dynamic_Interface_Ref and dynamic_Abstract_Ref cases.

The interface calls are interesting in a way you have to deal with the dispatch. Virtual calls are easy in the sense that we can construct the vtables for entire class hierarchy on the spot, and have just a single indirection off the vtable to resolve the call target. Interface calls, however, are more complicated: we only observe the interface klass, not the implementation class, and therefore we can’t predict at which offset in concrete class vtable to look for the interface call target.

Therefore, we have to consult the interface table for the given object reference, figure out the vtable for that particular interface, and then call off the resolved vtable. The interface table lies in generic location within each class, and therefore itable (interface table) stub can just look up there. You may learn more about the runtime representation of vtables and itables in HotSpot source.

The difference between the invokevirtual and invokeinterface dispatch also explains why interpreter tests for dynamic_Abstract_Ref are faster than dynamic_Interface_Ref. We would not dive further into this. Interested readers are invited to profile these scenarios if they feel important of them.

4.1.4. Discussion II.1

We have seen that the cost of dispatching across two types varies greatly between the compilers and interpreters.

The most common observable effect is the difference between invokevirtual and invokeinterface. While the good engineering practice tells us to use interfaces where possible (especially functional interfaces for lambdas), the plain virtual calls can be much cheaper since they don’t involve the complicated call target lookups. It is not the immediate concern for most of the code that would be compiled with C2 though. But both C1 and interpreter demonstrate the measurable difference between calling via invokevirtual and invokeinterface.

C2 does an interesting profile-guided optimization based on the observed type profile. If there is only a single receiver type (that is, the call site is monomorphic), it can simply check for the predicted type, and inline the target directly. The same optimization can and will be applied if there are two receiver types observed (that is, the call site is bimorphic), at the cost of two branches.

Even with aggressive bimorphic inlining, the hardware has problems with mispredicted branches that will inevitably be present if the types are non-uniformly distributed. However, this is still better than going via the vtable/itable stubs, and experience the same mispredict there, as evidenced by comparing C2 vs. C1.

4.2. Cheating The Runtime

All right, now that we know the problems with generic dispatch over virtual and interface calls, can we cheat? We already know that static calls are perfectly resolved by most compilers, and even the interpreter. The caveat is that we can’t dispatch multiple types over the static method.

Can we implement the auxiliary selectors though? That is, can we store some field in the Data class itself, and dispatch across static calls by ourselves? Let’s implement these:

Notice how similar it is to monomorphic inlining! We even have the uncommon trap on the cold branch. The only difference against the dynamic_Abstract_Ref case is that we don’t mess with comparing the classword, which saves some instructions. This explains the minuscule improvement.

…​but this is the real cheat: instead of comparing the classword twice, we compare the flag once, and select either alternative. This explains a significant improvement against the C2 bimorphic inlining. But even in this case, the branch misprediction costs provide the asymmetry across different biases. bias = 0.5, obviously, has the largest chance of mispredicted branches.

C1 is not emitting the uncommon traps, and generates both branches. The frequency-based basic block layout still moved the most frequent code path higher. Note how both Coder0 and Coder1 calls are inlined.

This code looks more aesthetically pleasing to my eyes: there is only a single test, and two complete branches. It would be even better if global code motion moved the getfield of $data before the branch, but again, the goal for C1 is to produce the code fast.

Argh, but this trick does not work for the interpreter. The reason is actually quite simple: the additional accesses for coder static instances cost something.

4.2.4. Discussion II.2

As we predicted, cheating VM by dispatching the work manually works:

In C2 case, we save a few bytes in the instruction stream by not comparing the classword, and generally producing denser code. It is cheating, since compiler has to provide the capabilities for de-optimization, handling other types, if any come in the future, etc., but our own code ignores all those concerns. That is, it’s not the compiler fault to produce less optimal code, it’s just has to be more generic.

In C1 case, we enable the inlining of target methods with using static targets. The dispatch code compiles into something similar to what the C2 version is doing. Therefore, the performance across C1 and C2 is more consistent here.

Once again, these optimizations should be employed only where the peak performance is needed. The code otherwise produced by C2 is rather nice.

4.3. Cheated By The Runtime

All right, but what about combining the auxiliary selector and dynamic cases? Surely we will set ourselves up for better inlining decisions if we do dynamic calls off the static final constants? This is where it starts to get funny. Let’s take C2 again, and run one of the dynamic_…​ tests with a trick:

In fact, it is almost the same code that static_Bool_isElse is generating! Dumb performance tests like this will delude you into believing you this trick costs you nothing when a sufficiently smart compiler is around. It gets obvious you shouldn’t rely on this, if when you try with non-trivial bias, like bias=0.5:

See what happened? We do double work: first, we select the static field based on the flag, as our Java code does; then, we are doing the bimorphic inlining of based on the actual type. This is regardless the fact we may infer the actual type from the static final constant itself. The trick is not working, and it actually degrades performance compared to dynamic_Ref case.

4.4. Emulating The Inlining

An impatient reader may wonder what will happen if we emulate the type checks with explicit instanceof checks? We can easily add a few more tests that dispatch like that, for example:

Of course, we don’t know beforehand if the performance is affected by instanceof-ing against abstract or interface instance, and therefore we need to check both. Similarly, we don’t know if calling the dynamic method after the instanceof check will be nicely optimized, so we have to try calling static method as well.

Not only these pieces of code are exactly the same, they are also equivalent to letting C2 doing the bimorphic inlining on its own! Let us remember this low-level trick, as we will reuse it later when we deal with several types.

This is possible since instanceof is routinely handled by compilers, and parsed into the same typecheck as the compiler would do itself. It is not surprising, then, to see VM had generated exactly the same code.

C1 responds to this hack less candidly. dynamic cases are generally slower, because there is still a virtual/interface call involved, that C1 is not able to inline. static cases are generally faster, because they avoid doing virtual/interface calls. Still, static cases are not on par with clean "by ID" selectors, since the generated code is much less optimized. Interested readers are invited to disassemble and study it.

4.5. Discussion II

Looking at the case of two coders, we can spot a few things:

The usual coding practice of separating the distinct behavior in the distinct (sub)classes or interface implementors is nicely optimized in C2. C1 has some problems with efficiently inlining the bimorphic cases, but thankfully users will only observe this behavior transiently, until C2 kicks in.

In some cases, when you know the call targets exactly, it might be a good idea to dispatch statically across a few implementations, in order to avoid dealing with compiler optimizations that still have some overhead left in the generated code. Again, doing so will probably yield a non-generic solution, and that’s one of the reasons it could be faster. The compilers should provide generic implementations, and hence can help only that much.

The performance costs we observed, especially in an optimized case, are related to hardware branch (mis)predictions. Even the most optimized code still ought to have branches. It may look like there is no reason to inline the virtual calls then, but that’s a shortsighted view. The inlining actually broadens the scope of other optimizations, and that alone is, in many cases, enough reason to inline. With very thin nanobenchmarks, that advantage cannot be properly quantified.

5. Three Types And Beyond

With three coder types, the coder distribution makes for even more challenging task. The easiest solution would seem to run with different "parts" per coder type.

5.1. Monomorphic cases

Remember from the previous parts, that compilers make the inlining decisions based on profile. When there is only a single observed type per call site (i.e. the call site is "monomorphic"), or two observed types per call site (i.e. the call site is "bimorphic"), at least C2 is able to optimize this case. Before going further, it makes sense to verify this still holds true for our three-type benchmarks. This way we control that our knowledge from the previous part can be used in this test.

Everything adds up so far. We covered why the static cases are slightly faster than dynamic cases before.

It is interesting to note that even though …​ifElse cases produce the static layout, the performance of which should theoretically be affected by which coder dominates, it is not affected in practice. For example, static_ID_ifElse with p1/p2/p3 = 0/0/1 looks like this:

The same picture is painted in the C1 case. We already know why static cases are slightly faster than dynamic cases, see the cases for single and two types. Here, however, the position at which the type is located seems to matter a bit.

The instruction selection is a bit different, as one could expect from different compilers, which may explain the performance difference between the compiled code versions, and why C1 version is reacting to how many branches the code is taking.

My own speculation is like this: cmp $0x0, %eax is actually slightly slower than test %eax, %eax. These tiny micro-architectural differences are sometimes what nano-benchmarks are all about.

Again, nothing new here. We already know that C1 is not able to do proper bimorphic inlining, and we are stuck with non-optimal code. Our custom dispatch enables the inlining, and avoids the VM dispatch.

5.3. Megamorphic cases

Now, when there are more than two types, we are stepping into something new. To make the matters interesting, we are measuring in three configurations: evenly distributing the types (1/1/1), first type to take 90% of all instances (18/1/1), and first type to take 95% of all instances (38/1/1).

static cases get better when runs are biased towards one type. This is due to the better branch prediction. This difference provides an estimate what one should expect from the optimized virtual call. (We additionally verified exactly the same result is produced when we bias the distribution towards any other type, not only the first).

When the types are evenly distributed, we get a severe performance hit in dynamic_…​ cases. This is because HotSpot thinks the call site now has too many receiver types; in other words, the call site is megamorphic. Current C2 does not do megamorphic inlining at all.

…​with the exception of one case, when there is a clear winner, claiming >90% of type profile.

Why we have tried 38/1/1? This distribution allocates 95% of all targets to a single coder, and we know that C2 implementation sets the threshold for this optimistic inlining via TypeProfileMajorReceiverPercent at 90%.

Briefly following this up, and disassembling dynamic_Abstract_Ref at 18/1/1 yields:

static cases behave the same as in C2 case. dynamic cases are not inlined regardless of the profile information. We have seen it does not inline even the bimorphic calls, and naturally it does not inline the megamorphic ones as well. The difference between different distributions is explained by branch prediction in either the VM stubs, or our own static dispatchers.

5.4. Cheating The Runtime

We have already cheated with static cases in this section, but remember the instanceof trick. We know that doing manual instanceof is akin to doing the VM work. Can we play God here, and peel the first coder under the instanceof check? In other words, do something like this:

What do we see here? Well, it turns out in a truly megamorphic case of 1/1/1, peeling the first coder is clearly profitable. It is profitable with a skewed distribution of 18/1/1 as well. And it stops being profitable at very skewed distribution of 38/1/1. Can you already guess why? You should have learned enough already to provide a sound hypothesis. If you can, then my work is done here.

See the magic here? The call on else branch inlined both coders. Note this works even if we leave the virtual/interface call under the check, because now we have two call sites, both with distinct type profiles. First call site would be inlined as monomorphic, and the type check would be merged with the original instanceof. Second call site is now effectively bimorphic, and enjoys the bimorphic inlining.

Unfortunately, this trick is only helping that much on C1. In the absence of profile-assisted inlining, there is not relief from reducing the type profile to two types. Again, interested readers are welcome to disassemble this case if they want.

5.5. Discussion III

Looking at the case of three coders, we can spot a few things:

Even if the static analysis shows there are three possible call targets, compilers can use the dynamic type profile to speculatively optimize for the exact receivers. C2 does this routinely in monomorphic and bimorphic cases, and sometimes in megamorphic cases, when there is a clear winner. C1 does not inline megamorphic calls.

The (absence of) megamorphic inlining hurts at least the nano-benchmarks. There seem to be three ways to avoid it, either peel off the hottest type with manual checks, or decrease TypeProfileMajorReceiverPercent to let VM figure out and inline the most frequent target, or do static dispatch over the known targets. The deal with megamorphic inlining would need to be solved in one way or the other on the VM side, since this potentially affects lots of modern workloads.

Conclusion

DO NOT TRY TO OPTIMIZE LOW-LEVEL STUFF UNLESS YOU REALLY HAVE TO.

PERFORMANCE ADVICE MAY BE HAZARDOUS FOR GENERAL PERFORMANCE, MAINTAINABILITY, AND SANITY. IT IS ALSO KNOWN TO HAVE A VERY LIMITED SHELF LIFE, AND CAUSE HEADACHES WHEN NOT USED CAREFULLY. SEEK YOUR PERFORMANCE ENGINEER’S APPROVAL BEFORE USING.

You should not normally bother with method dispatch performance. The optimizing compiler that produces the final code for hottest methods is able to inline most virtual calls. The remaining points are valid given you identified your problem as the method dispatch/inlining performance problem.

You should actually care about the inlineability of the target method, e.g. it’s size, modifiers, etc. In this post, we have ignored this aspect, since we were using tiny trivial methods. However, if the target method cannot be successfully devirtualized, the inlining will not happen. The inlining actually broadens the scope of other optimizations, and that alone is, in many cases, enough reason to inline. With very thin nanobenchmarks, that advantage cannot be properly quantified.

Class Hierarchy Analysis is able to statically figure out there is only a single subclass of a given class, even in the absence of profile information. Take care when you are adding more subclasses to otherwise a lone super-class in the hierarchy: CHA can fail then, and invalidate your previous performance assumptions.

When CHA fails, C1 inlining for virtual and interface calls also fails, since type profile is not available.

Even when CHA fails, monomorphic and bimorphic calls are routinely inlined by C2. Morphicity is derived from the runtime type profile collected by interpreter or C1.

Megamorphic calls are very bad, neither C1 nor C2 can inline those. At this point, there seem to be three ways to avoid it, either peel off the hottest type with manual checks, or decrease TypeProfileMajorReceiverPercent to let VM figure out and inline the most frequent target, or do static dispatch over the known targets. VM might do better for these cases in future though.

If you have a simple method implementation that does not depend on instance state, it is a good idea to make it static, for both maintainability and performance reasons. Carrying around the class instance just to make the virtual call off it makes life harder for runtime.

When you need peak performance for method dispatch, you may choose to manually dispatch over the static implementations. This goes against the usual development practice, but then quite a few low-level performance hacks take the same diversion route. This is what we are going to do in String Compression work, especially given String is already final, and adding a reference field to String is worse for footprint than adding a byte ID.

If you are choosing between interfaces and abstract classes, interfaces should not be your choice. Unoptimized interface calls are a burden. But, if you have to care about this difference, it means the profile-based de-virtualization and inlining did not happen, and you are probably already screwed.

Mispredicted branches are the killer. If you want to rigorously quantify the effects of low-level hacks like this, you have to consider trying different source data mixes to exploit different branch prediction behaviors, or choose the most unfavorable one pessimistically.