10.30.2011

This case study describes the complete steps from root cause analysis to resolution of a sudden OutOfMemoryError problem experienced with Oracle Service Bus 2.6 running on Weblogic Server 9.2.

This case study will also demonstrate how you combine Thread Dump analysis and Solaris PRSTAT to pinpoint culprit Thread(s) consuming large amount of memory on your application Java Heap.

Environment specifications

-Java EE server: Oracle Service Bus 2.6 & Weblogic Server 9.2

-OS: Solaris 10

-JDK: Sun HotSpot 1.5 SR7

-RDBMS: Oracle 10g

-Platform type: Enterprise Service Bus

Troubleshooting tools

-Quest Foglight for Java EE

-Java VM Thread Dump (for stuck Thread analysis)

-Solaris PRSTAT (for Thread CPU contributors breakdown)

Problem overview

-Problem type: OutOfMemoryError

An OutOfMemoryError problem was observed in our Oracle Service Bus Weblogic logs along with major performance degradation.

Preventive restart was then implemented since a Java Heap memory leak was initially suspected which did not prove to be efficient.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·What is the client impact? HIGH

·Recent change of the affected platform? No

·Any recent traffic increase to the affected platform? No

·Since how long this problem has been observed? Since several months with no identified root cause

·Is the OutOfMemoryError happening suddenly or over time? It was observed that the Java Heap is suddenly increasing and can then take several hours before failing with a OutOfMemoryError

·Did a restart of the Weblogic server resolve the problem? No, preventive restart was attempted which did not prove to be efficient

-Conclusion #1: The sudden increase of the Java Heap could indicate long running Thread(s) consuming a large amount of memory in a short amount of time

-Conclusion #2: A Java Heap leak cannot be ruled out at this point

Java Heap monitoring using Foglight

Quest Foglight monitoring tool was used to monitor the trend of the Oracle Service Bus Java Heap for each of the managed servers. You can see below a sudden increase of the Java Heap OldGen space can be observed.

This type of memory problem normally does not indicate a linear or long term memory leak but rather rogue Thread(s) consuming large amount of memory in a short amount of time. In this scenario, my recommendation to you is to first perform Thread Dump analysis before jumping too quickly to JVM Heap Dump analysis.

Thread Dump analysis

Thread Dump snapshots were generated during the sudden Java Heap increase and analysis did reveal a primary culprit as you can see below.

Analysis of the different Thread Dump snapshots did reveal a very long elapsed time of this Thread along with a perfect correlation with our Java Heap increase. As you can see, this Thread is triggered by the ALSB alerting layer and is involved in read/write of the Weblogic internal diagnostic File Store.

Such correlation and finding did confirm that this rogue Thread was actively processing (reading/writing to the File Store) and allocating a large amount of memory on the Java Heap in a short amount of time.

Oracle Service Bus allows Alert actions to be configured within the message flow (pipeline alerts). These pipeline alert actions generate alerts based on message context in a pipeline, to send to an alert destination. Such alert destination is the actual Weblogic diagnostic File Store which means this structure will grow over time depending of the volume of Alerts that your OSB application is generating.

It is located under >> //domain_name/servers/server_name/data/store/diagnostics/

In our case, the File Store size was around 800 MB.

Such increase of the diagnostic File Store size over time is leading to an increase of elapsed time of the Thread involved in read/write operations; allocating large amount of memory on the Java Heap. Such memory cannot be garbage collected until Thread completion which is leading to OutOfMemoryError and performance degradation.

Solution

2 actions were required to resolve this problem:

-Reset the diagnostic File Store by renaming the existing data file and forcing Weblogic to create a fresh one

-Review and reduce the Oracle Service Bus alerting level to the minimum acceptable level

The reset of the diagnostic File Store did bring some immediate relief and by ensuring short and optimal diagnostic File Store operations and not causing too much impact on the Java Heap.

The level of OSB alerting is still in review and will be reduced shortly in order to prevent this problem at the source.

Regular monitoring and purging of the diagnostic File Store will also be implemented in order to prevent such performance degradation going forward; as per Oracle recommendations.

Conclusion and recommendations

I hope this article has helped you understand how to identify and resolve this Oracle service bus problem and appreciate how powerful Thread Dump analysis and Solaris PRSTAT tool can be to pinpoint this type of OutOfMemoryError problem.

As a Java Thread Dump analysis expert, my first natural question to approach this problem was: how can I generate a .NET Thread Dump in order to pinpoint our slowdown condition? The answer was found with the Windows dump file and their latest Debug Diagnostic Tool 1.2.

This article will demonstrate how you can take advantage of the new Microsoft Debug Diagnostic Tool 1.2 to investigate and troubleshoot hang Thread problems. For people familiar with Java EE, you will definitely notice similarities vs. Java VM Thread Dump and Heap Dump analysis.

Environment specifications (case study)

-Web and App server: ASP.NET framework 4 & IIS 7

-OS: Windows Server 8 Service Pack 2 - 64-bit

-Platform type: Web application

Monitoring and Troubleshooting tools

** DebugDiag 1.2 can be downloaded for free from Microsoft **

-Compuware Server Vantage (IIS connection monitoring)

-Debug Diagnostic Tool 1.2

Problem overview

-Problem type: Major slowdown and hang Threads was observed from the .net framework 4 servers affecting the overall Web application response time

Such problem was observed especially during high load along with an increase of the active IIS connection count.

Gathering and validation of facts

Similar to Java EE problems, ASP.NET framework 4 and IIS performance problems require gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·What is the client impact? HIGH; major Web application slowdown

·Recent change of the affected platform? No

·Any recent traffic increase to the affected platform? No

·What is the health of the IIS server? Overall IIS health is OK but an increase of IIS active connections is observed; ultimately causing performance degradation

·Did a restart of the IIS and application pool resolve the problem? No, this was used as a mitigation strategy only but problem is re-occurring quickly following IIS and application pool recycle

-Conclusion #1: The problem appears to be isolated at the .NET framework 4 application layer due to an unidentified .NET Thread contention

Diagnostic Debug Tool overview

The Debug Diagnostic Tool (DebugDiag) is designed to assist in troubleshooting issues such as hangs, slow performance, memory leaks or fragmentation, and crashes in any user-mode process. The tool includes additional debugging scripts focused on Internet Information Services (IIS) applications, web data access components, COM+ and related Microsoft technologies.

This is the tool that our team used to successfully pinpoint the source of slowdown and Thread hang condition.

Step #1 – Create a dump file of your affected w3wp process

In order to troubleshoot .NET Thread hangs problems, you first need to generate a dump file of your affected .NET application pool(s) process. This can be done by following the steps below:

-Start Windows Task Manager by performing a right-click an empty area of the task bar, and then click Start Task Manage

-Select the Processes tab.

-Identify and Right-click the name of the process that you want, and then click Create Dump File. If you are prompted for an administrator password or confirmation, type your password or click Continue.

-A dump file for the process is created in the following folder Drive:\Users\UserName\AppData\Local\Temp

When you receive a message that states that the dump file was successfully created, click OK.

As you can see below from our case study, the blocked Threads originate mainly from our remote SOSS service. SOSS is a distributed cache service that our application is using and invoking remotely via a WinSock (Socket).

Step #7 – Blocked Threads deep dive analysis

At this point, it is now time to review a few samples of your blocked Threads in order to pinpoint the source of contention. Simply select a Thread id from the summary or scroll down the report to look at the raw .NET Thread Stack Trace data.

This is quite similar to Java Thread Dump analysis but the Debug Diagnostic Tool includes more detail such as description of the hanging call etc. Always start your Stack Trace (.NET Call stack) from bottom up. The first line will always give you the current Thread hang operation and should give you a better indication of the culprit of your problem (hanging Web Service provider, hanging database call etc.).

As you can see for our case study, the source of Thread contention was identified and was caused by a slow response time of our SOSS service (.Sosslib_invoke(UInt32, Void*, SByte*, UInt32, UInt32*)).

Conclusion

I hope this simple tutorial has helped you understand and appreciate the powerful Debug Diagnostic Tool and how you can leverage it to analyze .NET framework 4 hang Thread patterns.

We created a sample Java program in order to load the Java Heap and trigger an explicit GC in order to generate some interesting verbose GC output. This program is simply loading about 3 million instances of java.lang.String in a static Map data structure and triggers an explicit GC (via System.gc()) followed by the removal of 2 million instances along with a second explicit GC before exiting.

As you can see from the verbose GC output, the OldGen space was at 340 MB after the initial loading of 3 M String instances in our HashMap. It did go down to 126 MB following the removal of 2 M String instances.

Now find below explanation and snapshots on how you can read the GC output data in more detail for each Java Heap space.

## YoungGen space analysis

## OldGen space analysis

## PermGen space analysis

## Java Heap breakdown analysis

Conclusion

I hope this sample Java program and verbose GC output analysis has helped you understand how to read and interpret this critical data. Please do not hesitate to post any comment or question. My next article will cover the verbose GC for the IBM Java JRE.