This article is also a great opportunity for you to improve your thread dump analysis skills and I highly recommend that you study and properly understand the following analysis approach. It will also demonstrate the importance of proper data gathering as opposed to premature middleware (Weblogic) restarts.

Environment specifications

Java EE server: Oracle Service Bus 11g

OS: AIX 6.1

JDK: IBM JRE 1.6.0 @64-bit

RDBMS: Oracle 10g

Platform type: Enterprise Service Bus

Troubleshooting tools

Quest Software Foglight for Java (monitoring and alerting)

Java VM Thread Dump (IBM JRE javacore format)

Problem overview

Major performance degradation was observed from our Oracle Service Bus Weblogic environment. Alerts were also sent from the Foglight agents indicating a significant surge in Weblogic threads utilization.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

What is the client impact? HIGH

Recent change of the affected platform? Yes, logging level changed in OSB console for a few business services prior to outage report

Any recent traffic increase to the affected platform? No

Since how long this problem has been observed? New problem observed following logging level changes

Did a restart of the Weblogic server resolve the problem? Yes

Conclusion #1: The logging level changes applied earlier on some OSB business services appear to have triggered this stuck thread problem. However, the root cause remains unknown at this point.

Weblogic threads monitoring: Foglight for Java

Foglight for Java (from Quest Software) is a great monitoring tool allowing you to completely monitor any Java EE environment along with full alerting capabilities. This tool is used in our production environment to monitor the middleware (Weblogic) data, including threads, for each of the Oracle Service Bus managed servers. You can see below a consistent increase of the threads along with a pending request queue.

For your reference, Weblogic slow running threads are identified as “Hogging Threads” and can eventually be promoted to “STUCK” status if running for several minutes (as per your configured threshold).

Now what should be your next course of action? Weblogic restart? Definitely not…

Your first “reflex” for this type of problems is to capture a JVM Thread Dump. Such data is critical for you to perform proper root cause analysis and understand the potential hanging condition. Once such data is captured, you can then proceed with Weblogic server recovery actions such as a full managed server (JVM) restart.

Stuck Threads: Thread Dump to the rescue!

The next course of action in this outage scenario was to quickly generate a few thread dump snapshots from the IBM JVM before attempting to recover the affected Weblogic instances. Thread dump was generated using kill -3 <Java PID> which did generate a few javacore files at the root of the Weblogic domain.

javacore.20120610.122028.15149052.0001.txt

Once the production environment was back up and running, the team quickly proceeded with the analysis of the captured thread dump files as per below steps.

Thread Dump analysis step #1 – identify a thread execution pattern

The first analysis step is to quickly go through all the Weblogic threads and attempt to identify a common problematic pattern such as threads waiting from remote external systems, threads in deadlock state, threads waiting from other threads to complete their tasks etc.

The analysis did quickly reveal many threads involved in the same blocking situation as per below. In this sample, we can see an Oracle Service Bus thread in blocked state within the TransactionManager Java class (OSB kernel code).

at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37(Compiled Code))

at java/lang/reflect/Method.invoke(Method.java:589(Compiled Code))

at com/bea/wli/sb/transports/Util$1.invoke(Util.java:83(Compiled Code))

at $Proxy111.sendMessageAsync(Bytecode PC:26(Compiled Code))

……………………………

Thread Dump analysis step #2 – review the blocked threads chain

The next step was to review the affected and blocked threads chain involved in our identified pattern. As we saw in the thread dump analysis part 4, the IBM JVM thread dump format contains a separate section that provides a full breakdown of all thread blocked chains e.g. the Java object monitor pool locks.

A quick analysis did reveal the following thread culprit as per below. As you can see, Weblogic thread #16 is the actual culprit with 300+ threads waiting to acquire a lock on a shared object monitor TransactionManager@0x0700000001A51610/0x0700000001A51628.

Once you identify a primary culprit thread, the next step is to perform a deeper review of the computing task this thread is currently performing. Simply go back to the raw thread dump data and start analyzing the culprit thread stack trace from bottom-up.

As you can see below, the thread stack trace for our problem case was quite revealing. It did reveal that thread #16 is currently attempting to commit a change made at the Weblogic / Oracle Service Bus level. The problem is that the commit operation is hanging and taking too much time, causing thread #16 to retain the shared object monitor lock from TransactionManager for too long and “starving” the other Oracle Service Bus Weblogic threads.

Weblogic runtime threads executing client requests quickly started to queue up and wait for a lock on the shared object monitor (TransactionManager)

The Weblogic instances ran out of threads, generating alerts and forcing the production support team to shut down and restart the affected JVM processes

Our team is planning to open an Oracle SR shortly to share this OSB deployment behaviour along with hard dependency between the client requests (threads) and OSB logging layer. In the meantime, no OSB logging level change will be attempted outside the maintenance window period until further notice.

Conclusion

I hope this article has helped you understand and appreciate how powerful thread dump analysis can be to pinpoint root cause of stuck thread problems and the importance for any Java EE production support team to capture such crucial data in order to prevent future re-occurrences.

No worry, our team is still struggling with other high priority issues. However, I can confirm to you that we had a second occurrence of this problem, again by attempting to apply a simple change from the OSB console. I'm expecting our team to open a SR this week (after resolution of current production outages).

In the meantime, my recommendation to you and your team team is to freeze any OSB console change during peak load e.g. wait after your business hours/SLA in order to reduce the risk level. Also, add extra Thread monitoring after any OSB console change so you can detect any early Thread lock following the commit of any OSB console/domain change. Preventive Weblogic restart may be required if you see similar Thread wait as per article analysis.

I will post a fresh comment as soon as I hear detail from Oracle support on this situation.

Quick update, we are still continuing investigation with Oracle. An IBM VM bug is also suspecting. A deeper analysis of the thread dump is also possibly pointing to a ReentrantReadWriteLock implementation (com/bea/wli/config/derivedcache/DerivedCache) problems.

We just got a patch for this problem after further review by the Oracle OSB DEV team. You can ask your Oracle consultant to refer to this SR: SR # 3-6342083251. Problem was confirmed and related to known issue with DerivedCache OSB data structure.

Can you please share your thread dump (phcharbonneau@hotmail.com) so I can confirm if you are dealing with the same problem.