May 7, 2009

Fun with micro benchmarks and optimizers

I’ve been doing some micro benchmarks comparing the apparent cost of Java vs C++ virtual function calls. While the comparison is interesting, what really threw me for a bit was the results I was getting in Java 1.6. Have a look at the first version of my test:

Ok, so on 1.6 the first iteration of billion calls took 8ms and each subsequent iteration took nearly 200 times as long, how could that be?

Q> I screwed up my measurements

A> Nope looks ok, double check, still looks ok

Q> Was is the garbage collector

A> Shouldn’t be, other then printing the results the test doesn’t generate any garbage. I reran with GC logging just to be sure, and no GCs were logged (not surprising).

Q> Did I need to let the test run longer so HotSpot could do its thing?

A> Nope, ran for much longer, results held steady at ~1350ms

Q> Is the optimizer broken or deoptimizing after the first iteration?

A> Sure looks like it

Ok so time to start thinking about how to optimizer is going to change my test. The first thing to notice is that it could identify that there is no reason to actually call my getVirtual() method, it can see there are no side effects from calling it, and can see that the result is discarded. So lets modify to do something with the result, just looking at the test loop now.

We can see the getVirtual() method (or least its logic) is getting run

the cost of both the fast iteration and the slow iteration increased

the performance difference between fast and slow runs is down to ~40

Still what’s the deal, running the test for longer doesn’t yield any additional fast iterations. So at this point I get a few other people involved, and they work through many of the same suggestions and assumptions I’d listed above. We also try the following:

pull the logic out of main() and put it in a non-static method

try running the test loop in parallel on multiple threads

try first warming up the JVM by running some random but heavy code prior to running the test

try recording the results into an array rather then printing during the test

All of these yield essentially the same results as above. We do randomly trigger the loss of the fast iteration, but never trigger multiple fast iterations. So yipee we figured out how to make things go slower. As a side note it was intersting what would trigger the loss of the fast iteration, which was triggered by recording the results into an array as follows:

Like this:

You might want to read http://wikis.sun.com/display/HotSpotInternals/MicroBenchmarks. One specific issue with HotSpot is that the code in main() itself will not be optimized heavily, so pulling your test code out into a method called by main() make sense. Also, you might find it interesting to test with Japex, which handles VM warm up and timing for you.

Thanks for the feedback. It was interesting that simply pulling the test out into a helper method did not change the behavior, i.e. I still received an initial very fast iteration, followed by endless very slow iterations. It was only when the test was modified to repeatedly call the helper method that it allowed the optimizer apply a good optimization. All in all it makes sense, but it does go to show some of the surprises you can get when doing micro benchmarks. Generally I’m used to them providing overly positive performance estimates, in this case it went just the opposite. I’ll plan on checking out Japex and see what it can do.

The first, did you try with the -Xint param to get hotspot out of the picture. Perhaps in this simple case the interpreter is faster, since you are really not doing anything, and the amount of state held in the interpreter discovers this quickly and early returns.

Version two is that the first time hotspot see’s your loop it completely unrolls into a thousand or so no-op instructions, which then get further optimized into nothingness – then when the loop is seen again, perhaps it says “I do indeed need to run this as a loop”

Either way you should send this to Cliff Click. He’d figure it out in two ticks.

Yep, I’d given both -Xint and -Xbatch a try. -Xint resulted in consistent runs, but all were about 80x slower then my prior slow runs. -Xbatch resulted in the same initial fast iteration followed by slow iterations. I’d left out a number of the intermediate steps I’d taken in trying to track this down. Here are some of the other switches I’d tried, in an attempt to persuade the optimizer to yield additional fast results, all working as effectively as -Xbatch.

-Xcomp
-XX:+AggressiveOpts
-XX:+BackgroundCompilation

In hindsight it makes sense that the optimizations which I was looking for could not be applied until the method was re-executed, which of course I was not allowing when keeping everything with main(). Though I’m still curious as to how to the drop in performance after the first iteration, apparently the optimizer was willing to take an initial stab at things, i.e. it was able to de-optimize my second (and subsequent) iterations.

Be sure to run with -server mode to get reasonable compilation. I would assume that in most the cases above code should be optimized to zero iterations. You can also try out compiling with -Xbatch to see how it changes the results.

Myself, when running somebody’s microbenchmarks, I’m renaming his main(…) to mainX(…) and then adding my own main(…) calling his in the loop. Normally I’m getting stable results after 2-3 iterations.

I suppose I’d left out some important details. The tests were in fact run with -server (my default execution mode), running with -client always produced “slow” results. The intent of my investigation had been to identify why “fast” runs appeared initially and then vanished. In the course of trial and error, I found many changes either to code or JVM switch would allow me to get all “slow” runs, but it was the elusive “fast” runs which I was searching for.

I definitely agree with your point regarding having main() loop over the real test code, and not itself be part of the test. This little exercise has taught me to always add that going forward.