Tuesday, 22 November 2011

Biased Locking, OSR, and Benchmarking Fun

After my last post on Java Lock Implementations, I got a lot of good feedback about my results and micro-benchmark design approach. As a result I now understand JVM warmup, On Stack Replacement (OSR) and Biased Locking somewhat better than before. Special thanks to Dave Dice from Oracle, and Cliff Click & Gil Tene from Azul, for their very useful feedback.

In the last post I concluded, based on my experiments, that biased locking was no longer necessary on modern CPUs. While this conclusion is understandable given the data gathered in the experiment, it was not valid because the experiment did not take account of some JVM warm up behaviour that I was unaware of.

In this post I will re-run the experiment taking into account the feedback and present some new results. I shall also expand on the changes I've made to the test and why it is important to consider the JVM warm-up behaviour when writing micro-benchmarks, or even very lean Java applications with quick start up time.

On Stack Replacement (OSR)

Java virtual machines will compile code to achieve greater performance based on runtime profiling. Some VMs run an interpreter for the majority of code and replace hot areas with compiled code following the 80/20 rule. Other VMs compile all code simply at first then replace the simple code with more optimised code based on profiling. Oracle Hotspot and Azul are examples of the first type and Oracle JRockit is an example of the second.

Oracle Hotspot will count invocations of a method return plus branch backs for loops in that method, and if this exceeds 10K in server mode the method will be compiled. The compiled code on normal JIT'ing can be used when the method is next called. However if a loop is still iterating it may make sense to replace the method before the loop completes, especially if it has many iterations to go. OSR is the means by which a method gets replaced with a compiled version part way through iterating a loop.

I was under the impression that normal JIT'ing and OSR would result in similar code. Cliff Click pointed out that it is much harder for a runtime to optimise a loop part way through, and especially difficult if nested. For example, bounds checking within the loop may not be possible to eliminate. Cliff will blog in more detail on this shortly.

What this means is that you are likely to get better optimised code by doing a small number of shorter warm ups than a single large one. You can see in the code below how I do 10 shorter runs in a loop before the main large run compared to the last article where I did a single large warm-up run.

Biased Locking

Dave Dice pointed out that Hotspot does not enable objects for biased locking in the first few seconds (4s at present) of JVM startup. This is because some benchmarks, and NetBeans, have a lot of thread contention on start up and the revocation cost is significant. Other VMs such as Azul use biased locking right from the start which is not an issue because their revocation is cheap.

All objects by default are created with biased locking enabled in Oracle Hotspot after the first few seconds of start-up delay, and can be configured with -XX:BiasedLockingStartupDelay=0.

This point, combined with knowing more about OSR, is important for micro-benchmarks. It is also important to be aware of these points if you have a lean Java application that starts in a few seconds.

The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.

Nehalem 2.8GHz - Ops/Sec

Threads

-UseBiasedLocking

+UseBiasedLocking

ReentrantLock

1

53,283,461

450,950,969

62,876,566

2

18,519,295

18,108,615

10,217,186

3

13,349,605

13,416,198

14,108,622

4

8,120,172

8,040,773

14,207,310

5

4,725,114

4,551,766

14,302,683

6

5,133,706

5,246,548

14,676,616

7

5,473,652

5,585,666

18,145,525

8

5,514,056

5,414,171

19,010,725

Sandy Bridge 2.0GHz - Ops/Sec

Threads

-UseBiasedLocking

+UseBiasedLocking

ReentrantLock

1

34,500,407

396,511,324

43,148,808

2

20,899,076

19,742,639

6,038,923

3

9,288,039

11,957,032

24,147,807

4

5,618,862

5,589,289

9,082,961

5

5,609,932

5,592,574

9,389,243

6

5,742,907

5,760,558

12,518,728

7

6,699,201

6,641,886

13,684,475

8

6,957,824

6,925,410

14,819,005

Observations

Biased locking has a huge benefit in the un-contended single threaded case.

Biased locking when un-contended, and not revoked, only adds 4-5 cycles of cost. This is the cost when having a cache hit for the lock structures, on top of the code protected in the critical section.

-XX:BiasedLockingStartupDelay=0 needs to be set for lean applications and micro-benchmarks.

Avoiding OSR does not make a material difference to this set of test results. This is likely to be because the loop is so simple or other costs are dominating.

For the current implementations, ReentrantLocks scale better than synchronised locks under contention, except in the case of 2 contending threads.

Conclusion

My tests in the last post are invalid for the testing of an un-contended biased lock, because the lock was not actually biased. If you are designing code following the single writer principle, and therefore having un-contended locks when using 3rd party libraries, then having biased locking enabled is a significant performance boost.

11 comments:

I 'd like to ask a couple of questions. You mention that sometime after 10k iterations, compilation of the method happens, and that if you repeat this warmup phase more times, better optimization of the method's code is likely to occur. In your code you do 10 iterations, 100k each. Is there any neat trick behind that?:P

Regarding the values on the tables, are they from a single run or are they the average/min/max values of a number of runs?

The neat trick is Cliff suggests this is the best way to warm it up :-) He would know best. The compilation starts but the method will be swapped in sometime after that. Taking a few iterations ensures the opportunity to have the swap occur. We want to ensure everything has stabilised before the main run.

The values are from a typical run I picked in the middle of the range of many runs.

Thanks for the answer! My point was that usually 15K-20K iterations are enough for compilation to occur. Ofcourse 100K iterations is nothing when compare to 500M, and even if you do it 10 times it will be 1/500 th of the running time. I was just curious if there was a specific reason for choosing 100k and not 20k. I think I'll ask Cliff:)

You need to allow time for the compilation to complete and then have the new implementation available. By 20K this may not have happened. Compilation may start after 10K but will not be available to sometime later depending on complexity and available resources.

OSR does result in JIT'ed code. It is however not the optimal result in many cases. It is best to have the entire method replaced with a JIT'ed version which the 10 * 100K iteration tends to result in.

It is good practice to do a warm up. I usually do it like above or by running the whole test 3-5 times. Cliff recommends many smaller runs rather than a few larger runs. I'm changing my approach to reflect this.