Post by Dr. Cliff Click

11/22/2011

What the heck is OSR? It’s HotSpot jargon for On-Stack-Replacement – and it’s used to convert a running function’s interpreter frame into a JIT’d frame – in the middle of that method. It crops up in micro-benchmarks all the friggin’ time, but only rarely in other apps (or at least only rarely in a performance critical way). What happens is this:

The JVM starts executing some method for the first time ever, in the interpreter, e.g. main().

That method has a long running loop, that is now being executed in the interpreter

The interpreter figures out that the method is hot and triggers a normal compilation

That compilation will be used the NEXT time this method is called, but e.g. it’s main() and there is no next time

Eventually the interpreter triggers an OSR compilation. The OSR is specialized by some bytecode it will be called at, which is typically the loop back-edge branch.

Eventually the OSR compilation completes.

At this point the interpreter jumps to the OSR code when it crosses the specialized entry bytecode – in the middle of the method.

Do more & shorter warmup runs. 100K iterations is fine, but loop around the ‘runTest’ 10 times during warmup. Otherwise you risk ending up testing an ‘OSR’ compile instead of a normal one.

On Mon, Nov 21, 2011 at 12:29 PM, Martin Thompson wrote:

Standard Jit’ing and On Stack Replacement are not equal. I have been operating under the assumption that they were. Just to ensure my understanding is correct. The JVM implements a counter for each method that gets incremented when the method returns and also on branch back within a loop. If this counter exceeds 10K on the server VM then it will trigger a compilation of the method. If the method is called again the compiled code is used. If the method was continuing to loop more than the 10K increments then at approximately 14K increments the method will be swapped via OSR. I do not understand why OSR can be a less efficient result but will take your word for it. So I should re-organise the code to avoid OSR and do a number of shorter warm up runs. This is very interesting feedback for main loop design. Can you point me about more detail on how the main loops should be best structured?

OK, Martin, here goes…

A number of shorter warm-up loops will do better, or calling the warm-up loop itself in a loop. Do not nest the warm-up loops in the same function, that defeats the purpose. It probably helps to understand what’s going on in an OSR compile.

In an OSR the code has entered some Java method IN THE INTERPRETER. It’s now spinning in some hot loop IN THE INTERPRETER. The interpreter is busy gathering backedge counts (which will soon hit 14000) and function-entry counts (will only ever be 1).

When the OSR compilation starts, we start parsing bytecodes IN THE MIDDLE OF THE HOT LOOP. For singly-nested loops the effect is to partially peel one loop iteration, then enter the loop proper. To make it clear what the problem(s) are, my examples are written closer to the bytecodes actually experienced by the JVM, as opposed to clean Java. Single-nested-loop example:

Suppose the interpreter has already triggered a normal compile (after 10,000 back-edge branches) and decides that since this loop as run 14,000 back-edges already and we might never exit this function (and thus re-enter it via the normal compilation), and now we need an OSR compilation. This OSR compilation will only be validly entered at the backwards goto bytecode (and also it needs to do some horrible stack-mangling to remove the interpreter’s frame and insert it’s own before entering the main function proper). What program semantics does the OSR compilation have to deal with? The OSR compile “sees” a function that looks like we started executing bytecodes at that backwards goto; this function might look something like this:

This typical optimizes fairly well. Variably ‘i’ is a well-formed array loop index (whose starting value is unknown, but it is a stride-1 increasing value capped at the smaller of ‘A.length’ and ‘!P’). However for a doubly nested loop things get ugly.

Here we (again) decide to OSR at the backedge – although there are two backedges to choose from. In practice, the inner one is typically much more frequent than the outer one, so that’s the typical OSR choice. Not that the interpreter can tell; it only tracks backward branches and has no clue about any possible looping structure. In any case, our example OSR compilation starts parsing bytecodes at the ‘goto loop2‘ … leading to this structure:

Predicate ‘P2′ probably reports a very low but non-zero frequency (since it’s really the original inner-loop exit test). Since P2 is non-zero, variable i gets reset during the loop. Now variable i is no longer a proper array loop index. There’s no simple affine function describing i, and we no longer can range-check-eliminate A[i++] from the original inner loop. Performance of the OSR version can be as bad as 1/2 that of a normal compilation (although it’s not usually that bad… and 1/2 normal-speed is still 5x interpreted speed!).

Recovering nicely nested loops from here is tough. And what happens above is the GOOD case. There’s another bytecode parsing pattern/technique HS used to use here, and that one leads to having IRREDUCIBLE loops… which pretty much blows all chance of any loop optimizations in any case.

I hope this little chat has given you some idea of what goes on in an OSR compile… and when OSR’s kick in. If you want to test some little function or other in a tight loop… then the Right Thing To Do is to wrap it in a loop in a function in a loop in a function. The outermost loop might OSR, but the inner-most loop(s) will be normal function entries so the inner-loop code-gen will Do The Right Thing.

3 Responses to “What the heck is OSR and why is it Bad (or Good)?”

I thought they ran into this problem in tracemonkey and found a nice way around it.

The real sticking point here seems to be the insistence on whole method compilation. OSR is much better done at the level of hot code blocks.

In particular when the backbranches crest a certain value you simply JIT the block that begins with the target of your backbranch and ends when it bits the backbranch. Exiting the loop in any fashion is compiled as an exit to the interpreter. Supposing your loop only has a single entry point in the bytecode it would be irrelevant if it was nested or not. Either you compile the inner backbranch and end up with just the inner loop JIT or you compile at the outer backbranch and end up with JIT compile that reflects the nesting.

You might end up leaving the outer loop interpreted at first but this is no problem. Should the interpreted backbranches reach the OSR threshold again you end up JIT from the outer backbranch. Proper design lets you reuse the inner JIT code by removing the interpreter entry/exit code and splicing in parameter passing from the outer block. This won’t be quite as efficient as the code you might generate by JITing the outer loop from scratch as it won’t generate intra-block optimizations, e.g., if the outer block performs P = P* 32 and the inner block sets K = P/32 a good SSA compiler with appropriate optimizations and assumptions would move the multiplication to the end of the outer loop and eliminate the divide but as P is basically being treated as a method parameter no such optimization will be made.

However, a decent peephole optimizer should be able to exorcise the overhead of passing parameters to the inner block by removing store/load pairs. And ideally intra-block optimizations (aside from register allocation) like strength reduction or arithmetic simplification would be done in the bytecode when it was produced from source. Also for short inner loops you have the option of simply recompiling the whole thing.

—

What puzzles me is why this isn’t already done. It’s essentially equivalent to just pulling the inner block out into a separate tail recursive method. If the compiler has decent support for inlining JIT methods and associated opportunity for intraprocedural optimizations this approach comes basically for free. Yet the OSR compilation you describe seems to generate substantially less efficient code. Sure entering/exiting the interpreter is expensive but either it’s quite rare as only the inner loop is hot or it gets eliminated when the outer method gets JITed.

I guess the trouble is that java bytecode makes the business of converting a block to a procedure very inconvenient. In SSA form it’s trivial but if the inner loop is accessing parts of the stack, parameters and instance variables then the method call code would be unnecessarily ugly.

Hi Cliff,
From my understanding of HotSpot, it seems that you have sufficient information to map the execution context from a normally JIT’ed function to an interpreted version of the same code for deoptimization purposes. What makes it difficult to apply the mapping the other way to transition from the interpreted code directly to the ‘normal’ JIT compilation at a safepoint?

For “simple” loops with fixed known termination conditions (e.g. most scientific loops; String.hashCode, etc), there’s NO mapping kept in the loop body. Removing the maintenance of the mapping speeds up these kinds of loops alot, typically 2x or so (for small loops) or maybe 20-30% (for larger loops). Since there’s no mapping, you can’t transit into the middle of the loop. Also, the compiler will track other values than what the interpreter directly needs or produces; jumping into the middle of such code means computing those values from the interpreter’s state. Such computations get arbitrarily complex. Example from String.hashCode: “hash = hash*31 + ary[i];” The interpreter has tracked ‘hash’ and ‘ary’ and ‘i’, but the JIT will often track some ‘p = A+12+i*2′ and do a ‘p += 2′ in the loop… except the loop is unrolled, and any entry point will come around only every 4th or 8th value for ‘i’. So we’d have to JIT the code; then tell the interpreter to keep executing until ‘i mod 8 == 0′, then compute ‘p=A+12+i*2′ and place it in some such register (and ‘this’, hash, ary and ‘i’ probably in other registers) before we could start up the optimized loop. In short, the mapping in this direction is complicated and isn’t something I’d like to record as a ‘mapping’ per-se. Instead I’d just have the optimizer JIT code for the mapping… which is how the problem is solved now. Except you better have only 1 such loop-entry or else your loop is now ‘irreducible’ and all the classic loop optimizations no longer apply.