2.29.2012

This article will describe the complete root cause analysis and solution of a HashMap High CPU problem (infinite looping) affecting a Weblogic 10.0 environment running on the Java HotSpot VM 1.5.

This case study will again demonstrate this importance of mastering Thread Dump analysis skill and CPU correlation techniques such as Solaris prstat.

Environment specifications

-Java EE server: Oracle Weblogic Portal 10.0

-Middleware OS: Solaris 10

-Java VM: Java HotSpot VM 1.5

-Platform type: Portal application

Monitoring and troubleshooting tools

-JVM Thread Dump (HotSpot format)

-Solaris prstat (CPU contributors analysis)

Problem overview

-Problem type: High CPUobserved from our Weblogic production environment

A high CPU problem was observed from our Solaris physical servers hosting a Weblogic Portal 10 environment. Users also reporting major slowdown of the portal application.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non-technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·What is the client impact? HIGH

·Recent change of the affected platform? No

·Any recent traffic increase to the affected platform? Yes

·How does this high CPU manifest itself? A sudden CPU increase was observed and is not going down; even after load goes down e.g. near zero level.

·Did an Oracle OSB recycle resolve the problem? Yes, but problem is returning after few hours or few days (unpredictable pattern)

-Conclusion #1: The high CPU problem appears to be intermittent vs. pure correlation with load

-Conclusion #2: Since high CPU remains after load goes down, this typically indicates either the presence of some infinite looping or heavy processing Threads

Solaris CPU analysis using prstat

Solaris prstat is a powerful OS command allowing you to obtain the CPU per process but more importantly CPU per Thread within a process. As you can see below from our case study, the CPU utilization was confirmed to go up as high as 100% utilization (saturation level).

## PRSTAT (CPU per Java Thread analysis)

prstat -L -p 8223 1 1

PID USERNAME SIZE RSS STATE PRI NICE TIME CPUPROCESS/LWPID

8223 bea10 2809M 2592M sleep 59 0 14:52:59 38.6%java/494

8223 bea10 2809M 2592M sleep 57 0 12:28:05 22.3% java/325

8223 bea10 2809M 2592M sleep 59 0 11:52:02 28.3% java/412

8223 bea10 2809M 2592M sleep 59 0 5:50:00 0.3% java/84

8223 bea10 2809M 2592M sleep 58 0 2:27:20 0.2% java/43

8223 bea10 2809M 2592M sleep 59 0 1:39:42 0.2% java/41287

8223 bea10 2809M 2592M sleep 59 0 4:41:44 0.2% java/30503

8223 bea10 2809M 2592M sleep 59 0 5:58:32 0.2% java/36116

……………………………………………………………………………………

As you can see from above data, 3 Java Threads were found using together close to 100% of the CPU utilization.

at org.apache.axis.encoding.TypeMappingImpl.getClassForQName(TypeMappingImpl.java:715)

at org.apache.axis.encoding.TypeMappingDelegate.getClassForQName(TypeMappingDelegate.java:170)

at org.apache.axis.encoding.TypeMappingDelegate.getClassForQName(TypeMappingDelegate.java:160)

at org.apache.axis.encoding.TypeMappingImpl.getDeserializer(TypeMappingImpl.java:454)

at org.apache.axis.encoding.TypeMappingDelegate.getDeserializer(TypeMappingDelegate.java:108)

at org.apache.axis.encoding.TypeMappingDelegate.getDeserializer(TypeMappingDelegate.java:102)

at org.apache.axis.encoding.DeserializationContext.getDeserializer(DeserializationContext.java:457)

at org.apache.axis.encoding.DeserializationContext.getDeserializerForType(DeserializationContext.java:547)

at org.apache.axis.encoding.ser.BeanDeserializer.getDeserializer(BeanDeserializer.java:514)

at org.apache.axis.encoding.ser.BeanDeserializer.onStartChild(BeanDeserializer.java:286)

at org.apache.axis.encoding.DeserializationContext.startElement(DeserializationContext.java:1035)

at org.apache.axis.message.SAX2EventRecorder.replay(SAX2EventRecorder.java:165)

at org.apache.axis.message.MessageElement.publishToHandler(MessageElement.java:1141)

at org.apache.axis.message.RPCElement.deserialize(RPCElement.java:236)

at org.apache.axis.message.RPCElement.getParams(RPCElement.java:384)

at org.apache.axis.client.Call.invoke(Call.java:2467)

at org.apache.axis.client.Call.invoke(Call.java:2366)

at org.apache.axis.client.Call.invoke(Call.java:1812)

…………………………………………………………………………………………………………………………………………

Thread Dump analysis – HashMap.get() infinite loop condition!

As you can see from the above Thread Stack Trace, the Thread is currently stuck in an infinite loop over a java.util.HashMap that originates from the Apache Axis TypeMappingImpl Java class.

This finding was quite revealing. The 2 others Threads using high CPU also did reveal infinite looping condition within the same Apache Axis HashMap Object.

Root cause: non Thread safe HashMap in Apache Axis 1.4

Additional research did reveal this known defect affecting Apache Axis 1.4; which is the version that our application was using.

As you may already know, usage of non Thread safe / non synchronized HashMap under concurrent Threads condition is very dangerous and can easily lead to internal HashMap index corruption and / or infinite looping. This is also a golden rule for any middleware software such as Oracle Weblogic, IBM WAS, Red Hat JBoss which rely heavily on HashMap data structures from various Java EE and caching services.

Such best practice is also applicable for any Open Source third party API such as Apache Axis.

The most common solution is to use the ConcurrentHashMap data structure which is designed for that type of concurrent Thread execution context.

Solution

Our team did apply the proposed patch from Apache (synchronize the non Thread safe HashMap) which did resolve the problem. We are also currently looking at upgrading our application to a newer version of Apache Axis.

Conclusion

I hope this case study has helped you understand how to pinpoint the root cause of high CPU Threads and the importance of proper Thread safe data structure for high concurrent Thread / processing applications.

The JRockit Native memory space is storing the Java Classes metadata, Threads and objects such as library files, other JVM and third party native code objects.

Where is the PermGen space?

Similar to the IBM VM, there is no PermGen space for the JRockit VM. The PermGen space is only applicable to the HotSpot VM. The JRockit VM is using the Native Heap for Class metadata related data. Also, as you probably saw from my other article, Oracle Sun is also starting to remove the PermGen space for the HotSpot VM.

Why is the JRockit VM Java process using more memory than HotSpot VM?

The JRockit VM tend to uses more native memory in exchange for better performance. JRockit does not have an interpretation mode, compilation only, so due to its additional native memory needs the process size tends to use a couple of hundred MB larger than the equivalent Sun JVM size. This should not be a big problem unless you are using a 32-bit JRockit with a large Java Heap requirement; in this scenario, the risk of OutOfMemoryError due to Native Heap depletion is higher for a JRockit VM (e.g. for a 32-bit VM, bigger is the Java Heap, smaller is memory left for the Native Heap).

What is Oracle’s plan for JRockit?

Current Oracle JVM strategy is to merge both HotSpot and JRockit product lines to a single JVM project that will include the best features of each VM. This will also simplify JVM tuning since right now failure to understand the differences between these 2 VM’s can lead to bad tuning recommendations and performance problems.

2.22.2012

This post is to inform you that I provide free Weblogic consulting services via this Blog root cause analysis forum & email. I will do my very best to provide you with guidance and share with you my tips on how resolve common Weblogic problems along with best practices.

My expertise background is as per below:

·10 years of experience with development, support and hardening of Java EE systems and currently working as a full time employee at CGI Inc. Canada

Why is my IT consulting service from this Blog free? My goal is to create a solid knowledge base of Java EE support patterns on this Blog and share with Java EE production support individuals across the world. You can simply reward me by sharing this Blog with your other work colleagues and friends. With your approval, I might also use your problem case and create an article.

This is part 3 of our Thread Dump analysis series which will provide you with an overview of what is a JVM Thread Dump for the HotSpot VM and the different Threads that you will find. Detail for the IBM VM Thread Dump format will be provided in the part 4.

** Allows to quickly learn about Thread state and its potential current blocking condition **

-Java Thread Stack Trace; this is by far the most important data that you will find from the Thread Dump. This is also where you will spent most of your analysis time since the Java Stack Trace provides you with 90% of the information that you need in order to pinpoint root cause of many problem pattern types as you will learn later in the training sessions

-Java Heap breakdown; starting with HotSpot VM 1.6, you will also find at the bottom of the Thread Dump snapshot a breakdown of the HotSpot memory spaces utilization such as your Java Heap (YoungGen, OldGen) & PermGen space. This is quite useful when excessive GC is suspected as a possible root cause so you can do out-of-the-box correlation with Thread data / patterns found

In order for you to better understand, find below a diagram showing you a visual breakdown of a HotSpot VM Thread Dump and its common Thread Pools found:

As you can there are several pieces of information that you can find from a HotSpot VM Thread Dump. Some of these pieces will be more important than others depending of your problem pattern (problem patterns will be simulated and explained in future articles).

For now, find below a detailed explanation for each Thread Dump section as per our sample HotSpot Thread Dump:

# Full thread dump identifier

This is basically the unique keyword that you will find in your middleware / standalong Java standard output log once you generate a Thread Dump (ex: via kill -3 <PID> for UNIX). This is the beginning of the Thread Dump snapshot data.

This portion is the core of the Thread Dump and where you will typically spend most of your analysis time. The number of Threads found will depend on your middleware software that you use, third party libraries (that might have its own Threads) and your application (if creating any custom Thread, which is generally not a best practice).

In our sample Thread Dump, Weblogic is the middleware used. Starting with Weblogic 9.2, a self-tuning Thread Pool is used with unique identifier “'weblogic.kernel.Default (self-tuning)”

This is an internal Thread managed by the HotSpot VM in order to perform internal native operations. Typically you should not worry about this one unless you see high CPU (via Thread Dump & prstat / native Thread id correlation).

When using HotSpot parallel GC (quite common these days when using multi physical cores hardware), the HotSpot VM create by default or as per your JVM tuning a certain # of GC Threads. These GC Threads allow the VM to perform its periodic GC cleanups in a parallel manner, leading to an overall reduction of the GC time; at the expense of increased CPU utilization.

This is crucial data as well since when facing GC related problems such as excessive GC, memory leaks etc, you will be able to correlate any high CPU observed from the OS / Java process(es) with these Threads using their native id value (nid=0x3). You will learn how to identify and confirm this problem is future articles.

# JNI global references count

JNI (Java Native Interface) global references are basically Object references from the native code to a Java object managed by the Java garbage collector. Its role is to prevent collection of an object that is still in use by native code but technically with no "live" references in the Java code.

It is also important to keep an eye on JNI references in order to detect JNI related leaks. This can happen if you program use JNI directly or using third party tools like monitoring tools which are prone to native memory leaks.

JNI global references: 1925

# Java Heap utilization view

This data was added back to JDK 1 .6 and provides you with a short and fast view of your HotSpot Heap. I find it quite useful when troubleshooting GC related problems along with HIGH CPU since you get both Thread Dump & Java Heap in a single snapshot allowing you to determine (or to rule out) any pressure point in a particular Java Heap memory space along with current Thread computing currently being done at that time. As you can see in our sample Thread Dump, the Java Heap OldGen is maxed out!

The C-Heap is storing class metadata objects including library files, other JVM and third party native code objects.

Where is the PermGen space?

This is by far the most typical question I get from Java EE support individuals supporting an IBM VM environment for this first time. The answer: there is no PermGen space for the IBM VM. The PermGen space is only applicable to the HotSpot VM. The IBM VM is using the Native Heap for Class metadata related data. Also, as you probably saw from my other article, Oracle / Sun is also starting to remove the PermGen space for the HotSpot VM.

The next article will provide you a tutorial on how to enable and analyze verbose GC for an IBM VM. Please feel free to post any comment or question on the IBM VM.