Post by Dr. Cliff Click

04/04/2011

Function inlining in JVMs is a solved problem, right? It’s a key performance optimization routinely done by JIT’s everywhere (some might say: THE key optimization). Inlining has more than a decade of fine tuning in the Java context and over 40 years of production experience in the compilers and systems before that. JIT’s routinely inline large functions (thousands of bytecodes), inline nested functions many layers deep, and even inline functions which are only known dynamically – based on type profiling information. JIT’s may use Class Hierarchy Analysis or aggressive type analysis to decide the possible targets of a function call, they might inline multiple targets and use a runtime test to decide the correct flavor, and/or have multiple fall-back positions if the actual runtime types don’t match the profiled types. Non-final functions are routinely inlined, and the JVM Does The Right Thing if the function is later overridden. With all this going on, what’s “The Problem” with inlining?

“The Problem” is simply this: new languages on the JVM (e.g. JRuby) and new programming paradigms (e.g. Fork Join) have exposed a weakness in the current crop of inlining heuristics. Inlining is not happening in a crucial point in hot code exposed by these languages, and the lack of inlining is hurting performance in a major way. AND the inlining isn’t happening because The Problem is a hard one to solve; (i.e. it’s not the case that we’ll wave our wands and do a quick fix & rebuild HotSpot and the problem will go away). John Rose, Doug Lea and I all agree that it’s a Major Problem facing JVMs with long and far reaching implications.

The Problem is getting the right amount of context in hot inner loops – which also contain a mega-morphic virtual call in the loop and not much else. Let me give you an example. Here’s a simple loop doing “bitblit”: mashing two rectangles of bits (an image) together using some function chosen before the loop, in this case OR’ing two rectangles together essentially merging the image. This is a function done constantly by all graphics libraries. Even as you read this in your browser, the GUI is busy merging rectangles of bits as you scroll the blog around.

Inlining the function “or_word” is crucial to performance here. Without inlining the compiler doesn’t know what the loop body does (because function calls can in general do anything), and with inlining it can understand the entire function completely – and then see it’s a simple loop around a stream of array references. At this point the JIT can do range-check elimination, loop unrolling, and prefetching – leading to an easy 10x speedup over the not-inlined loop. We are so used to the performance level, we’re not realizing that lots of optimizations have to happen “just right” to make this go fast.

At this point it’s a no-brainer: inline, optimize and performance is good! Baseball, apple pie, Mom, what’s the problem? The Problem is, is that there are multiple variations on “or_word” AND the wrapping iterator gets complex. It’s the product of these two parts getting complicated that makes The Problem.

Suppose the wrapping blit iterator gets complex – it’s walking all the bits in a canvas of partially obscured rectangular images, and each image is a collection of 24 bits worth of pixels and is possibly boxed or bounded by some other rectangles:

Or imagine this function is really the iterator over a Fork-Join style parallel operator, doing a recursive decent breakdown of the Images – with decisions being made about how many CPUs, whether to split the current work pile, or do it all on one Thread, or join finished work bits together. Such iterators might get really complex.

Now on top of the complex iterator we want to do something other than “or_word“. We might also want to “and_word” (good for masking out images), “xor_word” (merging images), “scale_word“, “filter” and so on. What we really want is a way to pass in the function to apply on the basic data bits in the innermost loop of our complicated iterator. In Java we often do this with either a Callable or a Runnable:

Great use of Abstraction! We need only 1 copy of our large complicated iterator, and we can do all kinds of things with Images. Alas, that inner loop now contains a function call that needs inlining… and there are dozens of different functions for “fcn2arg.call”. The JIT does not know which one to inline here – because all dozen different functions are called at various times. Typically then the JIT does not inline any of them, and instead opts for a virtual call. Alas, while the virtual call itself isn’t too bad the lack of knowledge of what goes on inside the virtual call prevents all kinds of crucial optimizations: loop unrolling, range-check elimination, all kinds of prefetching and alias analyses. In short, we lose our 10x speedup.

How do we get it back? One way might be to make the inner function call a static (final) call – then the JIT can inline and voila! apple pie again! Of course, if we do that we need an iterator for the “or_word” version, and one for the “and_word” version and one for the “xor_word” version and… we need a lot of them. Worse: we need a new one for each new function we can think of; we can’t just name them all up front. So we’ll end up with a ton of these iterators each with a custom call in the inner loop, plus some way to generically make more of them. Too many iterators and we start blowing out the i-cache on our CPU (and that will cost us 10x by itself), and besides it’s a pain to maintain dozens of cloned complex iterators. Blanket inlining is not the answer.

We don’t really need to clone all those complex iterators – we really only need to clone the hottest inner loop parts, not all the complex wrappers around the lowest levels of looping. And we don’t really need to clone for all possible functions, just the ones that are getting called by the current program. After all, that’s one of the reasons to JIT: you should only end up compiling the parts of your program that are really using the CPU cycles.

What we’d really like is some sort of more controlled code cloning – one that allows inlining of megamorphic function calls in hot inner loops, and does so under control of the JIT and JVM proper – which can profile and pick the loops that need cloning. John Rose, Doug Lea and I have been going ’round and ’round on the right way to do this. I have my favorite solution (coming up). I’m not going to try and explain the alternatives here – see http://groups.google.com/group/jvm-languages/browse_thread/thread/72a5f4b16eba3505 and JSR335 for some details. I’m blogging here to raise awareness and to educate people on what’s going on – because there are serious long term implications for the Java community at work here!

My take on the right way to Solve This: ask programmers to write their programs in a “megamorphic inlining friendly” coding style, and the JITs can then optimize the code.

Basically the JIT can see a hot loop around a megamorphic call, the entire method (not counting what is getting called) is small and mostly consists of array references and a loop. The megamorphic call is loop-invariant: the target is unchanged by any loop iteration. Furthermore it’s passed in as an argument to the function. Now all the JIT has to do is compile versions of inner_blit specialized by the first argument, plus some sort of 1st-argument dispatching logic. Like all such compilations it’s driven by frequency and profiling. Since the inner_blit code is small it can be cloned without risking i-cache blow-out.

In short, we’d get a controlled amount of code cloning plus all the right levels of inlining to get our 10x performance boost back. And we can do it without changing either the JVM spec nor the Java language. JVMs that do not support such specialized cloning will run the code with the same performance they always have. JVMs WITH such support will run this code much much faster… encouraging JVMs everywhere to be improved to speed up this coding style (or pass by the historical wayside).

31 Responses to “Fixing The Inlining “Problem””

But isn’t it simpler to perform guaded devirtualization for fcn2arg.call(…), and then do loop unswitching to propagate type guard outside the loops? That will effectively specialize blit(), and no fuzz with inner_blit() that way. If I$ is the concern, then limit the scope of loop unswitching up until the method size threshold is reached.

Good call, but it doesn’t work: you have to re-compile the entire method every time a new hot fcn2arg.call appears. Since the entire method now has loop bodies for each fcn2arg, you’ll get giant functions that need recompiling ever-more-gianter as you add more fcn2arg.call’s. Each loop+fcn2arg needs to be compiled on it’s own, ideally, with some kind of way to switch between versions. Your “guarded devirtualization” needs to pick these functions. It might be a simple if-tree of subtype checks, but you can be more clever if you involve the runtime. A hashtable for instance.
Cliff

void foo(….){
…do something pre loop;
bar_in_the_loop(…);
…do something post loop;
}
bar_in_the_loop – call is still virtual, but not so expensive.
And JIT creates as much bar_in_the_loop’s as required. Each bar_in_the_loop is a loop specialization for exact bar implementation (bar is devirted and inlined).
When new bar appears, JIT have to compile new bar_in_the_loop, but need no touch neither foo() neither others bar_in_the_loops().

Oooh – nice one. The JIT needs to decide to make bar_in_the_loop some internal hidden virtual function of the CallTwoArg klass, and make as many implementations as then appear. You only get to devirtualize on one value, but that should do it (currying would allow further devirtualizations on other arguments). There are a couple of tricky bits here: adding a virtual call late in the game messes with the selection of v-table indexes… or perhaps these v-call indexes are selected from a different name-space of indexes over the normal JVM calls. The JIT needs to decide when it’s profitable to make an “invokespecialized” call. The funny internal bar_in_the_loop methods would need to be executable without the JIT doing all the hard work – to handle various bailout corner cases. i.e., bar_in_the_loop would need some sort of bytecode executable format that the interpreter would grok. These methods would have to be hidden from Reflection & JVMTI. I’m sure I’m missing other issues. On the up-side, the dispatch rules and new-bar-extension-logic are straightforward: they follow the general path of virtual calls. I like this idea, I’m gonna have to think on this some more.

Agree, vtable indexing may be a problem, If we can’t design a way when we will use some synthetic class hierarchy where only the one method will be in the vtable – bar_in_the_loop. And why we should do specialization only by the single method – sometimes we want more – bar1_and_bar2_and…_barN_in_the_loop – that case will be a mess.

About interpreter. May be interpreter shouldn’t work with bar_in_the_loop. If we need interpretation – let’s interpret the original foo(). But in this case when new bar appers we have two choices: change compilation scheme – compile new bar_in_the_loop immediately without preinterpretation or deoptimize foo in order to collect new profile.

You want to profile in the interpreter and in lower compilation tiers, to help with the final compile. If using the vtable indexing scheme a new bar_in_loop doesn’t invalidate any of the older ones.
Cliff

The point of my “coding style solution” is to try and make the hot loop inside a small function, so that inlining might happen. But how much inlining is enough? In order to make the call static, it must be inlined until “fcn2arg” becomes known-to-the-compiler. If the complex iterator is really huge, this might entail vast amounts of inlining.

I always wondered if Java could make use of compiler pragmas, for such “safe” uses as optimization hints. The faith in increasingly smart JITs usually stops me from asking that feature more loudly, but now it seems we’re hitting some walls. So, if you are willing to tell programmers to code in a way that helps the compiler, why not doing it right? Patterns/idioms like you suggest are fragile because you make the big assumptions that a) programmers will be familiar with JIT wizardry and know all the rules of “inlining-friendly code”, b) this optimization will be portable across JVMs (what happens if HotSpot and J9 require slightly different code patterns to inline loops with megamorphic calls)?, c) performance can go away after a simple maintenance to the code – some other programmer who didn’t know the intention of inlining adds an extra if() to the inner loop to fix some bug and bang! no more inlining

It’s arguably better to just capitulate and offer an @Inlining annotation that the programmer puts in methods, or maybe code blocks, that he knows to be critical. Or maybe some higher-level annotation that just requests maximum optimizations… exactly which ones, like inlining or loop unrolling, are a JIT’s decision. With this annotation you also save some warmup time because it can bias a mixed-mode VM to optimize that code immediately, or at least much faster, instead of waiting thousands of interpreted or fast-compiled executions.

This reduces the problems I mention above considerably. he only required programmer skill is knowing which code is critical to optimize – and it’s reasonable to expect that, remarkably for high-perf code written by professionals, like a dynamic language runtime or a graphics kernel. This assumption works well in languages like C++.

And while we’re doing that, why not suggesting other “smart” annotations that could really boost performance above what JITs are now capable to do. For example, a @Packed annotation to request a class to be as “light” as possible: tighter packing of fields (including aggressive packing like stuffing many boolean/byte/short fields in a single word); smaller object header without monitor (VM can use an external monitor table if program ever tries to sync/wait/notify on object – just to not break the rule that “annotations cannot change language semantics”). This may be great for some class that will have millions of instances clogging the heap, so the shaving of a few bytes per instance may be critical. I can think of a dozen other interesting optimization annotations.

@Inlining – NOT possible to inline all those megamorphic calls, and often not profitable. Knowing *where* to put that @inlining token AND structuring your code such that it will help is at least as much a burden as what I am proposing.

J9-vs-JRockit-vs-HotSpot – I purposely picked a style which is very simple and direct. Really all I need/want is: hot megamorphic call receiver is passed in as an argument, and the body of the function isn’t “too big”. This is generic enough of a requirement to work for all JVMs, although the limits of “not too big” will probably vary.

(you mention that as a possible benefit it’ll save on warmup time. It won’t to any measurable degree)

@Packed – this is perhaps more interesting, as changing all instances of a (extremely common) class to some denser format is doable but certainly harder than just manifesting that class in the dense format from the git-go… but that’s a topic for another blog.

Sounds like the JVM needs to learn that when a loop is hot, complex and megamorphic – it should pull the loop out into a function specialised by the megamorphic function.
I.e. it should learn to convert inlining-unfriendly code into inlining-friendly code.

In fact, the JVM could simply do that unconditionally as it loads bytecode if it seems a complex iterator with a megamorphic call. The extra branch would go into the noise most of the time, and for hot loops, the JVM would have nice friendly code to optimise.

This is somewhat more difficult than having the programmer break out the hot small inner loop into a private function. It’s certainly do-able, but would definitely require more engineering. I’m asking the programmer (via a “coding style”) to do some work that’s relatively easier for him and harder for me.

But in graphics library you won’t have more than a very few of these huge functions, and they’ll be used extensively.

Yes everyone wants the flexibility of custom shaders (or whatever), but in practice they only use a few of these functions.

I mean this is what’s done in C++ graphics libraries using templates, which essentially leads to exactly what you’re describing : specialized functions, one for each call parameter. It’s not a problem.

The problem is: you have to be able to analyze the code structure enough to figure out the fcn2arg call target at compile time. With C++ templates you (the programmer) is telling the compiler explicitly what the target is. In Java, I have to figure it out myself from context. More inlining *sometimes* gives me enough context, but often not. Usually, I would just have to insert some sort of type-specialization test after the point where fcn2arg gets loaded from some opaque location – and begin cloning from that point on. Hence I might end up cloning in the middle of a function.

Thanks for posting this. I’m not sure if it’s a coincidence, but we’re discussing The Problem on the JVM Languages list here: http://groups.google.com/group/jvm-languages/browse_thread/thread/72a5f4b16eba3505

The question I posed to the group is how to start solving this now on current JVMs. As you point out, JRuby and other JVM languages are using that function pattern more and more. In JRuby’s case, there’s both our call site logic (ping through a generic piece of code to dispatch to the right method) and closures (pass a bit of code into an iterator). Both cases would be helped by specialized cloning.

In the short term, I’ll probably be taking the approach of manually specializing in hot cases at the bytecode level. For example…I have a mode in JRuby where instead of all code calling through CachingCallSite, they instead call through small stub methods, one per call site, that lookup and dispatch to the method directly. This makes it possible to inline across a dynamic call boundary. It also has the unfortunate side effect of consuming part of the inlining budgets, so it’s not a perfect solution.

For closures, we will likely duplicate the hot closure-receiving calls near the call site, allowing the JVM to treat unique method + closure combinations as monomorphic paths. This should allow inlining of method plus closure back into the caller. I do worry about the code size issue, however; cloning an entire body of Ruby code is nontrivial.

At the end of the day, though, this is really something that JVMs should be doing. There’s more and more of this functional decomposition going on, we must fix The Problem in a suitable way for these newer languages.

Yes, but that’s the trouble for JIT compiler, right? If the basic method is too large, so megamorphic devirtualization will blow up method, and JIT suffers, that might indicate the problems with JIT. I wonder if JIT can actually proactively detect this condition and generate hidden “stubs” for inner loops, i.e. “fold” your CFG to make compilation easier?

Though I’m for the suggestion on general code style for Java: use smaller methods and let JIT arrange the rest for you. I’m reasoning against doing performance hacks in the code where it does not make sense (from the maintainability perspective) to do so.

I don’t think you need to force developer to write their code in a ‘megamorphic inlining friendly’ manner.
I see that as an OSR VM implementation failure.
You should have more than one OSR by method and you should have a dispatch table/hash map that dispatch to a specialized version
of the OSR code.

On thing that trace based JIT has shown is that a loop or function are valid optimization entry points. So like you can have an inlining cache for function at callsite, you should have a kind of inlining cache at the start of a loop.

Making loop headers a separate optimization entry point would be a Big Change for HotSpot. Certainly do-able… but certainly more JVM engineer hours spent than doing the dispatch table/hashmap just at function entry.

Cliff: Yeah, that’s what I had in mind. eachCommon would be cloned at a bytecode level on a per-call-site basis. The Block passed to it will be a unique subtype, but with a virtual method (all blocks implement “yield” to call their unique bodies of code). This should allow both eachCommon and the block itself to inline into the caller.

As for JVM gods…I meant the nameless, faceless entities that live within Hotspot or JRockit or J9 I don’t want to make Lord Hotspot angry if I start duplicating a lot of code to work around his shortcomings, but I don’t see another path forward right now…

Cliff: My intention was just giving hints to the compiler (think SQL hints); and like I said, @Inling is probably not a great way to do it. I just want a way to explicitly tell the VM to be extra aggressive, maybe forcing higher limits for inlining budgets and other space/speed tradeoffs. I agree that good results will still depend on the method having adequate structure to allow certain opts, but even in that case, the fact that the programmer explicitly requested optimization is useful because if the JIT fails to obey that command – for example, the method contains a megamorphic call inside a hot loop and the JIT can’t inline that – then the JIT could issue a warning, explaining the reason of that failure. Today, JVMs have extensive tracing options and I can pinpoint inlining failures, but this requires analyzing extremely verbose and low-level traces.

Lets separate out the issues here. I think that nearly always the inlining heuristics work – except when they cant. So it’s more an issue of explaining to the power-user why performance is not what was expected. *THIS* problem, of explaining why certain optimizations did not kick in, in a way that’s useful, has remained unsolved for the past 20 years. You can turn on +PrintCompilation and +PrintInlining to see what gets compiled and inlined – and a hint on why something does not get inlined. But in general it’s a hard problem to concisely explain how a complicated thing failed to function as expected.
Cliff

On 4/6/2011 6:30 AM, Doug Lea wrote:
> On 03/28/11 10:53, Cliff Click wrote:
>> Notice my weasel-words: “specializing on a passed-in argument” instead of “specializing on a field which is loaded just prior to use”.
>
> Yes, this is by far the most common case anyway, and even when not, is easy to arrange.
>
> But even after reading your expanded account on blog, I can’t make myself stop thinking that it would be more productive to split very early (never resplitting), but then common-up shared code that doesn’t need any specialized virtual dispatch out-of-line. Which library writers could help arrange by calling out to @noInline methods or the like, but would be nicer still if automated with the help of morphicity summaries for methods and/or flow graphs nodes.
>
HotSpot has something akin to “morphicity summaries” it uses to guide inlining – but this sounds more like you are proposing “outlining”.

“Morphicity summaries” – hotspot has per-call-site receiver-klass profiling and per-method invocation summaries. We can tell if different classes are used to reach a given method at each call site, and we can tell if a given method is called from many call sites or just one.

The pattern of observing a hot megamorphic call site, in a loop with a loop-invariant receiver is easy enough to spot. The trick is: What Now? I’m still waffling between Remi’s suggestion (lean on the existing OSR mechanism) vs Sergey’s (make a new hidden virtual call).

That seems to be a perfect use case for trace based compilation. Since the trace recorder looks only at a part of a method anyway, it is not necessary for the developer to factor out the loop in a separate method. I see trace based compilation as the generalization of OSR – instead of doing special OSR handling for some hot loops in some methods, switching between unoptimized and optimized code in the middle of a method is the default with a trace based compiler.

Gal et al. described a similar switching between loop iterations that have different types in JavaScript (http://dx.doi.org/10.1145/1542476.1542528)

Trace based compilation *exactly* does this right. HotSpot-style method-based compilation has other advantages typically (easier more natural compilation boundaries – programmers typically “tidy up” their state as they cross a function boundary), but not here. So now the question in my mind becomes: do we add trace-style (well, generalized OSR) to HotSpot, or use some other technique?

As someone who’s been developing in Java full time for over a decade and knows a thing or two about compiling optimizers, I have to say that asking developers to write in a particular style is completely the wrong approach.

The “right” way to write fast code seems to change more often than most of the code I write, so I only optimize the sections where I know I’ll get real speed-up. Since my programs tend to be database-driven, micro-optimizations aren’t even measurable most of the time. When I do need micro-optimizations, I try to stick to simple heuristics like “write like it’s C” so that I’m not relying on deep JVM magic that might be obsolete next year. How was I to know (other than reading your blog) that HotSpot doesn’t currently do trace based compilation? And as an application developer, why should I need to know?

I can’t tell by quickly scanning your sample code that it’s written to be inlined, except by the comments. And I distrust comments. Eclipse’s compiler can’t read the comments, so it can’t warn me if there’s a subtlety in the code that prevents inlining. Nor is there an easy unit test that I can write that will tell me if it’s getting inlined properly. When I change the code next year and it gets 10x slower, how can I tell it was that code change and not something else?

Compiler friendly code has the same problem I have with locks: it’s impossible to tell if it’s working unless it fails catastrophically, and even then it can be really difficult to track down. So it ends up being voodoo that you take on faith and don’t test.

(If I sound jaded maybe it’s because just last week I found a serialized function in my code that needed to be static but wasn’t. It’s been buggy for years, but never got caught. I removed the lock, since it was part of a workaround for a bug that Sun fixed sometime around Java 1.3. This sort of thing happens far too often.)