Plumbr now also monitors java.util.concurrent locks

August 4, 2016 by
Ivo Mägi
Filed under:
Locked ThreadsPlumbr

The java.util.concurrent package was originally introduced to Java 5 via JSR 166. This means that the potential performance issues related to concurrent data structures and locks introduced in the library have been out there for 12 years. Detecting such issues has been a pain throughout these years. No more – Plumbr root cause detection is now able to detect performance issues arising from using the java.util.concurrent library.

Background

For the readers unfamiliar with the java.util.concurrent package – it is a set of of high-performance & thread-safe building blocks for developing concurrent apps. The framework includes thread pools, thread-safe collections, atomic variables, locks and many other goodies.

If you are familiar with the synchronization mechanism built into the JVM and are wondering about the potential benefits of the library, here are just some examples of the powerful abstractions embedded into the framework:

java.util.concurrent.Lock implementations allow you to implement critical sections much like the synchronized keyword. However, with the Lock implementations, you can bypass the limits of containing the synchronized block within a single method. Locks can have the calls to lock() and unlock() spanning multiple methods. On top of that, these locks can be made fair, meaning that when there are multiple threads waiting on the same lock, they will gain access to it in a FIFO order. Synchronized does not have this guarantee and you may end up with thread starvation.

java.util.concurrent.CountDownLatch is a construct allowing one or more threads to wait until a particular set of operation has completed. For example you can init a CountDownLatch with a count of your choice and trigger a countdown via calls to the countDown() method whenever a particular operation has completed. Threads waiting for this count to reach zero can call one of the await() methods which blocks the thread until the count reaches zero. This may come in handy in many situations. For instance, when you need multiple worker threads to start simultaneously.

java.util.concurrent.BlockingQueue will hand you a special queue to decouple producers and consumers. The producer thread can keep inserting objects into the queue, until reaching a predefined upper bound on queue capacity. Until the consumer threads are not taking objects out of the queue, producing thread is blocked while trying to insert the new object. Same approach applies to consumer threads when accessing an empty BlockingQueue – they block until producers have offered new objects to the queue.

There is a lot more in the framework, so I can only encourage you to familiarize yourself with the rest of the goodies in the package.

Motivation

Plumbr is all about monitoring the end user experience for potential performance issues and exposing the root causes of such issues. We suspected the elements in the java.util.concurrent to be a frequent source of performance-related problems. Similar to the standard synchronized blocks, the blocking nature of many of the components is likely to force the threads to wait until access is granted to a particular data structure or code snippet. This in turn would force users to wait until such access will be granted.

As we already had the support for synchronization detection, it was only natural to investigate the frequency of such blockages impacting end users. To no surprise, the initial research conducted of 300 different deployments exposed that 10% of such deployments regularly experienced performance issues caused by the java.util.concurrent framework usage.

The only surprise was that the ratio was not higher. The very same data set contained 29% of deployments where the standard synchronized access was the source of problems. It is likely that the uptake of the improved library has not been too fast, even after 12 years of its initial release.

Solution

Our solution to synchronized access detection was based on JVMTI callbacks in situations where threads were waiting on contended monitors for longer than expected. For java.util.concurrent data structures we did not need to dig that deep and were able to stick with bytecode instrumentation instead.

This instrumentation will detect the situations where threads are forced to wait for events originating from the use of various java.util.concurrent classes, ranging from ReentrantLock to ArrayBlockingQueue.

So when using Plumbr 16.07.07 and later agents then in situations where end users face performance issues, Plumbr will expose the root causes originating from the java.util.concurrent library, similar to the following example:

As seen from the above, processing a payment has taken 17 seconds instead of the expected five seconds. Plumbr has flagged such a transaction as slow and has exposed multiple root causes for this. Biggest contributor as seen is a locking issue, contributing six seconds towards the transaction completion time.

Opening up the particular root cause details, we get access to the following information:

Plumbr has exposed the java.util.concurrent.ReentrantReadWriteLock being held at com.sun.enterprise.resource.pool.datastructure.RWLockDataStructure.getResources() on line 116 to be the line in source code to be the root cause. It is also visible that this is not a single occurrence – the very same root cause has affected 17 other transactions, making them slower than expected.

Equipped with this information, fixing the performance issues becomes easy. You are immediately zoomed in to the root cause in source code. Better yet, all the root causes are ranked by the impact measured in number of end users suffering, so you can first fix issues impacting your end users most.

So I can only encourage you to go ahead and grab the free trial for the fully functional Plumbr to see both how your end users are experiencing the application and whether or not java.util.concurrent locks are among the many root causes that can and will impact the user’s well-being.

Thank you. You are doing a great job.
However,
first of all the performance enthusiasts do not understand synchronization issues,
secondly, understanding the synchronization detection requires to get access to the source file and figure out if the code is oversynchronized.
thirdly, under different workload modeling the synchronization issues might not be exposed.
And finally, it was my observation that developers in trying to avoid the thread save issue are oversynchronizong.

I will cover your questions in a single answer, hopefully I have all the answers.

Regarding functionality demo – we do have a test application we internally use to test out the detection of all the root causes. This is based on the Spring Pet Clinic demo app and we are crawling it using Selenium tests. We do plan to release it for public use as well, but at the moment this is not possible. What I could offer is an screensharing demo we could schedule, so if you are interested, contact us via support@plumbr.eu and let us schedule the demo.

The lock contention issue originates from the same demo and we did not actually solve it as the demo revolves around being broken in different ways. But the problem boils down to connection acquisitions from the connection pool which in the application at hand is derived from a N+1 issue present in the same demo application acquiring hundreds of connections from the pool during a single transaction. So if we would have attempted to solve it, it would have been done by batching the calls to a single call and reducing the number of connections acquired from the pool by 100x+ times.