This article is also a great opportunity for you to improve your thread dump analysis skills and I highly recommend that you study and properly understand the following analysis approach. It will also demonstrate the importance of proper data gathering as opposed to premature middleware (Weblogic) restarts.

Environment specifications

Java EE server: Oracle Service Bus 11g

OS: AIX 6.1

JDK: IBM JRE 1.6.0 @64-bit

RDBMS: Oracle 10g

Platform type: Enterprise Service Bus

Troubleshooting tools

Quest Software Foglight for Java (monitoring and alerting)

Java VM Thread Dump (IBM JRE javacore format)

Problem overview

Major performance degradation was observed from our Oracle Service Bus Weblogic environment. Alerts were also sent from the Foglight agents indicating a significant surge in Weblogic threads utilization.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

What is the client impact? HIGH

Recent change of the affected platform? Yes, logging level changed in OSB console for a few business services prior to outage report

Any recent traffic increase to the affected platform? No

Since how long this problem has been observed? New problem observed following logging level changes

Did a restart of the Weblogic server resolve the problem? Yes

Conclusion #1: The logging level changes applied earlier on some OSB business services appear to have triggered this stuck thread problem. However, the root cause remains unknown at this point.

Weblogic threads monitoring: Foglight for Java

Foglight for Java (from Quest Software) is a great monitoring tool allowing you to completely monitor any Java EE environment along with full alerting capabilities. This tool is used in our production environment to monitor the middleware (Weblogic) data, including threads, for each of the Oracle Service Bus managed servers. You can see below a consistent increase of the threads along with a pending request queue.

For your reference, Weblogic slow running threads are identified as “Hogging Threads” and can eventually be promoted to “STUCK” status if running for several minutes (as per your configured threshold).

Now what should be your next course of action? Weblogic restart? Definitely not…

Your first “reflex” for this type of problems is to capture a JVM Thread Dump. Such data is critical for you to perform proper root cause analysis and understand the potential hanging condition. Once such data is captured, you can then proceed with Weblogic server recovery actions such as a full managed server (JVM) restart.

Stuck Threads: Thread Dump to the rescue!

The next course of action in this outage scenario was to quickly generate a few thread dump snapshots from the IBM JVM before attempting to recover the affected Weblogic instances. Thread dump was generated using kill -3 <Java PID> which did generate a few javacore files at the root of the Weblogic domain.

javacore.20120610.122028.15149052.0001.txt

Once the production environment was back up and running, the team quickly proceeded with the analysis of the captured thread dump files as per below steps.

Thread Dump analysis step #1 – identify a thread execution pattern

The first analysis step is to quickly go through all the Weblogic threads and attempt to identify a common problematic pattern such as threads waiting from remote external systems, threads in deadlock state, threads waiting from other threads to complete their tasks etc.

The analysis did quickly reveal many threads involved in the same blocking situation as per below. In this sample, we can see an Oracle Service Bus thread in blocked state within the TransactionManager Java class (OSB kernel code).

at sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37(Compiled Code))

at java/lang/reflect/Method.invoke(Method.java:589(Compiled Code))

at com/bea/wli/sb/transports/Util$1.invoke(Util.java:83(Compiled Code))

at $Proxy111.sendMessageAsync(Bytecode PC:26(Compiled Code))

……………………………

Thread Dump analysis step #2 – review the blocked threads chain

The next step was to review the affected and blocked threads chain involved in our identified pattern. As we saw in the thread dump analysis part 4, the IBM JVM thread dump format contains a separate section that provides a full breakdown of all thread blocked chains e.g. the Java object monitor pool locks.

A quick analysis did reveal the following thread culprit as per below. As you can see, Weblogic thread #16 is the actual culprit with 300+ threads waiting to acquire a lock on a shared object monitor TransactionManager@0x0700000001A51610/0x0700000001A51628.

Once you identify a primary culprit thread, the next step is to perform a deeper review of the computing task this thread is currently performing. Simply go back to the raw thread dump data and start analyzing the culprit thread stack trace from bottom-up.

As you can see below, the thread stack trace for our problem case was quite revealing. It did reveal that thread #16 is currently attempting to commit a change made at the Weblogic / Oracle Service Bus level. The problem is that the commit operation is hanging and taking too much time, causing thread #16 to retain the shared object monitor lock from TransactionManager for too long and “starving” the other Oracle Service Bus Weblogic threads.

Weblogic runtime threads executing client requests quickly started to queue up and wait for a lock on the shared object monitor (TransactionManager)

The Weblogic instances ran out of threads, generating alerts and forcing the production support team to shut down and restart the affected JVM processes

Our team is planning to open an Oracle SR shortly to share this OSB deployment behaviour along with hard dependency between the client requests (threads) and OSB logging layer. In the meantime, no OSB logging level change will be attempted outside the maintenance window period until further notice.

Conclusion

I hope this article has helped you understand and appreciate how powerful thread dump analysis can be to pinpoint root cause of stuck thread problems and the importance for any Java EE production support team to capture such crucial data in order to prevent future re-occurrences.

7.17.2012

Determination of proper Java Heap size for a production system is not a straightforward exercise. In my Java EE enterprise experience, I have seen multiple performance problem cases due to inadequate Java Heap capacity and tuning.

This article will provide you with 5 tips that can help you determine optimal Java heap size, as a starting point, for your current or new production environment. Some of these tips are also very useful regarding the prevention and resolution of java.lang.OutOfMemoryError problems; including memory leaks.

Please note that these tips are intended to “help you” determine proper Java Heap size. Since each IT environment is unique, you are actually in the best position to determine precisely the required Java Heap specifications of your client’s environment. Some of these tips may also not be applicable in the context of a very small Java standalone application but I still recommend you to read the entire article.

Future articles will include tips on how to choose the proper Java VM garbage collector type for your environment and applications.

#1 – JVM: you always fear what you don't understand

How can you expect to configure, tune and troubleshoot something that you don’t understand? You may never have the chance to write and improve Java VM specifications but you are still free to learn its foundation in order to improve your knowledge and troubleshooting skills. Some may disagree, but from my perspective, the thinking that Java programmers are not required to know the internal JVM memory management is an illusion.

-You then proceed and implement the same tuning to your environment. 2 days later you realize problem is still happening (even worse or little better)…the struggle continues…

What went wrong?

-You failed to first acquire proper understanding of the root cause of your problem

-You may also have failed to properly understand your production environment at a deeper level (specifications, load situation etc.). Web searches is a great way to learn and share knowledge but you have to perform your own due diligence and root cause analysis

-You may also be lacking some basic knowledge of the JVM and its internal memory management, preventing you to connect all the dots together

My #1 tip and recommendation to you is to learn and understand the basic JVM principles along with its different memory spaces. Such knowledge is critical as it will allow you to make valid recommendations to your clients and properly understand the possible impact and risk associated with future tuning considerations. Now find below a quick high level reference guide for the Java VM:

As you can see, the Java VM memory management is more complex than just setting up the biggest value possible via –Xmx. You have to look at all angles, including your native and PermGen space requirement along with physical memory availability (and # of CPU cores) from your physical host(s).

It can get especially tricky for 32-bit JVM since the Java Heap and native Heap are in a race. The bigger your Java Heap, smaller the native Heap. Attempting to setup a large Heap for a 32-bit VM e.g .2.5 GB+ increases risk of native OutOfMemoryError depending of your application(s) footprint, number of Threads etc. 64-bit JVM resolves this problem but you are still limited to physical resources availability and garbage collection overhead (cost of major GC collections go up with size). The bottom line is that the bigger is not always the better so please do not assume that you can run all your 20 Java EE applications on a single 16 GB 64-bit JVM process.

Your application(s) along with its associated data will dictate the Java Heap footprint requirement. By static memory, I mean “predictable” memory requirements as per below.

-Determine how many different applications you are planning to deploy to a single JVM process e.g. number of EAR files, WAR files, jar files etc. The more applications you deploy to a single JVM, higher demand on native Heap

-Determine how many Java classes will be potentially loaded at runtime; including third part API’s. The more class loaders and classes that you load at runtime, higher demand on the HotSpot VM PermGen space and internal JIT related optimization objects

-Determine data cache footprint e.g. internal cache data structures loaded by your application (and third party API’s) such as cached data from a database, data read from a file etc. The more data caching that you use, higher demand on the Java Heap OldGen space

-Determine the number of Threads that your middleware is allowed to create. This is very important since Java threads require enough native memory or OutOfMemoryError will be thrown

For example, you will need much more native memory and PermGen space if you are planning to deploy 10 separate EAR applications on a single JVM process vs. only 2 or 3. Data caching not serialized to a disk or database will require extra memory from the OldGen space.

Try to come up with reasonable estimates of the static memory footprint requirement. This will be very useful to setup some starting point JVM capacity figures before your true measurement exercise (e.g. tip #4). For 32-bit JVM, I usually do not recommend a Java Heap size high than 2 GB (-Xms2048m, -Xmx2048m) since you need enough memory for PermGen and native Heap for your Java EE applications and threads.

This assessment is especially important since too many applications deployed in a single 32-bit JVM process can easily lead to native Heap depletion; especially in a multi threads environment.

For a 64-bit JVM, a Java Heap size of 3 GB or 4 GB per JVM process is usually my recommended starting point.

Your business traffic will typically dictate your dynamic memory footprint. Concurrent users & requests generate the JVM GC “heartbeat” that you can observe from various monitoring tools due to very frequent creation and garbage collections of short & long lived objects. As you saw from the above JVM diagram, a typical ratio of YoungGen vs. OldGen is 1:3 or 33%.

Minimizing the frequency of major GC collections is a key aspect for optimal performance so it is very important that you understand and estimate how much memory you need during your peak volume.

Again, your type of application and data will dictate how much memory you need. Shopping cart type of applications (long lived objects) involving large and non-serialized session data typically need large Java Heap and lot of OldGen space. Stateless and XML processing heavy applications (lot of short lived objects) require proper YoungGen space in order to minimize frequency of major collections.

As you can see, with such requirement, there is no way you can have all this traffic sent to a single JVM 32-bit process. A typical solution involves splitting (tip #5) traffic across a few JVM processes and / or physical host (assuming you have enough hardware and CPU cores available).

However, for this example, given the high demand on static memory and to ensure a scalable environment in the long run, I would also recommend 64-bit VM but with a smaller Java Heap as a starting point such as 3 GB to minimize the GC cost. You definitely want to have extra buffer for the OldGen space so I typically recommend up to 50% memory footprint post major collection in order to keep the frequency of Full GC low and enough buffer for fail-over scenarios.

Most of the time, your business traffic will drive most of your memory footprint, unless you need significant amount of data caching to achieve proper performance which is typical for portal (media) heavy applications. Too much data caching should raise a yellow flag that you may need to revisit some design elements sooner than later.

-Have a very good view or forecast on the business traffic (# of concurrent users etc.) and for each application

-Some ideas if you need a 64-bit VM or not and which JVM settings to start with

-Some ideas if you need more than one JVM (middleware) processes

But wait, your work is not done yet. While this above information is crucial and great for you to come up with “best guess” Java Heap settings, it is always best and recommended to simulate your application(s) behaviour and validate the Java Heap memory requirement via proper profiling, load & performance testing.

You can learn and take advantage of tools such as JProfiler (future articles will include tutorials on JProfiler). From my perspective, learning how to use a profiler is the best way to properly understand your application memory footprint. Another approach I use for existing production environments is heap dump analysis using the Eclipse MAT tool. Heap Dump analysis is very powerful and allow you to view and understand the entire memory footprint of the Java Heap, including class loader related data and is a must do exercise in any memory footprint analysis; especially memory leaks.

Java profilers and heap dump analysis tools allow you to understand and validate your application memory footprint, including detection and resolution of memory leaks. Load and performance testing is also a must since this will allow you to validate your earlier estimates by simulating your forecast concurrent users. It will also expose your application bottlenecks and allow you to further fine tune your JVM settings. You can use tools such as Apache JMeter which is very easy to learn and use or explore other commercial products.

Finally, I have seen quite often Java EE environments running perfectly fine until the day where one piece of the infrastructure start to fail e.g. hardware failure. Suddenly the environment is running at reduced capacity (reduced # of JVM processes) and the whole environment goes down. What happened?

There are many scenarios that can lead to domino effects but lack of JVM tuning and capacity to handle fail-over (short term extra load) is very common. If your JVM processes are running at 80%+ OldGen space capacity with frequent garbage collections, how can you expect to handle any fail-over scenario?

Your load and performance testing exercise performed earlier should simulate such scenario and you should adjust your tuning settings properly so your Java Heap has enough buffer to handle extra load (extra objects) at short term. This is mainly applicable for the dynamic memory footprint since fail-over means redirecting a certain % of your concurrent users to the available JVM processes (middleware instances).

#5 – Divide and conquer

At this point you have performed dozens of load testing iterations. You know that your JVM is not leaking memory. Your application memory footprint cannot be reduced any further. You tried several tuning strategies such as using a large 64-bit Java Heap space of 10 GB+, multiple GC policies but still not finding your performance level acceptable?

In my experience I found that, with current JVM specifications, proper vertical and horizontal scaling which involved creating a few JVM processes per physical host and across several hosts will give you the throughput and capacity that you are looking for. Your IT environment will also more fault tolerant if you break your application list in a few logical silos, with their own JVM process, Threads and tuning values.

This “divide and conquer” strategy involves splitting your application(s) traffic to multiple JVM processes and will provide you with:

The bottom line is that when you find yourself spending too much time in tuning that single elephant 64-bit JVM process, it is time to revisit your middleware and JVM deployment strategy and take advantage of vertical & horizontal scaling. This implementation strategy is more taxing for the hardware but will really pay off in the long run.

7.10.2012

This is part 3 of our java.lang.NoClassDefFoundError troubleshooting series. As I mentioned in my first article, there are many possible problems that can lead to java.lang.NoClassDefFoundError such as a wrong Java runtime classpath. This article will describe one of the most common causes of this problem: failure of Java class static initializer blocks or static variables.

A sample Java program will be provided and I encourage you to compile and run this example from your workstation in order to properly replicate and understand this type of problem.

Java static initializer revisited

The Java programming language provides you with the capability to “statically” initialize variables or a block of code. This is achieved via the “static” variable identifier or the usage of a static {} block at the header of a Java class. Static initializers are guaranteed to be executed only once in the JVM life cycle and are Thread safe by design which make their usage quite appealing for static data initialization such as internal object caches, loggers etc.

What is the problem? I will repeat again, static initializers are guaranteed to be executed only once in the JVM life cycle…This means that such code is executed at the class loading time and never executed again until you restart your JVM. Now what happens if the code executed at that time (@Class loading time) terminates with an unhandled Exception?

Welcome to the java.lang.NoClassDefFoundError problem case #2!

NoClassDefFoundError problem case 2 – static initializer failure

This type of problem is occurring following the failure of static initializer code combined with successive attempts to create a new instance of the affected (non-loaded) class.

-ClassA provides you with a ON/OFF switch allowing you the replicate the type of problem that you want to study

This program is simply attempting to create a new instance of ClassA 3 times (one after each other). It will demonstrate that an initial failure of either a static variable or static block initializer combined with successive attempt to create a new instance of the affected class triggers java.lang.NoClassDefFoundError.

In order to replicate the problem, we will simply “voluntary” trigger a failure of the static initializer code. Please simply enable the problem type that you want to study e.g. either static variable or static block initializer failure:

java.lang.NoClassDefFoundError: Could not initialize class org.ph.javaee.tools.jdk7.training2.ClassA

at org.ph.javaee.tools.jdk7.training2.NoClassDefFoundErrorSimulator.main(NoClassDefFoundErrorSimulator.java:30)

THIRD attempt to create a new instance of ClassA...

java.lang.NoClassDefFoundError: Could not initialize class org.ph.javaee.tools.jdk7.training2.ClassA

at org.ph.javaee.tools.jdk7.training2.NoClassDefFoundErrorSimulator.main(NoClassDefFoundErrorSimulator.java:39)

done!

What happened? As you can see, the first attempt to create a new instance of ClassA did trigger a java.lang.ExceptionInInitializerError. This exception indicates the failure of our static initializer for our static variable & bloc which is exactly what we wanted to achieve.

The key point to understand at this point is that this failure did prevent the whole class loading of ClassA. As you can see, attempt #2 and attempt #3 both generated a java.lang.NoClassDefFoundError, why? Well since the first attempt failed, class loading of ClassA was prevented. Successive attempts to create a new instance of ClassA within the current ClassLoader did generate java.lang.NoClassDefFoundError over and over since ClassA was not found within current ClassLoader.

As you can see, in this problem context, the NoClassDefFoundError is just a symptom or consequence of another problem. The original problem is the ExceptionInInitializerError triggered following the failure of the static initializer code. This clearly demonstrates the importance of proper error handling and logging when using Java static initializers.

-Review the java.lang.NoClassDefFoundError error and identify the missing Java class

-Perform a code walkthrough of the affected class and determine if it contains static initializer code (variables & static block)

-Review your server and application logs and determine if any error (e.g. ExceptionInInitializerError) originates from the static initializer code

-Once confirmed, analyze the code further and determine the root cause of the initializer code failure. You may need to add some extra logging along with proper error handling to prevent and better handle future failures of your static initializer code going forward

Please feel free to post any question or comment. The part 4 is now available.