9.28.2012

This case
study describes the complete root cause analysis and resolution of an Apache
Log4j thread race problem affecting a Weblogic Portal 10.0 production
environment. It will also demonstrate the importance of proper Java classloader
knowledge when developing and supporting Java EE applications.

This article
is also another opportunity for you to improve your thread dump analysis skills
and understand thread race conditions.

Environment specifications

Java EE server: Oracle Weblogic Portal 10.0

OS: Solaris 10

JDK: Oracle/Sun HotSpot JVM 1.5

Logging API: Apache Log4j 1.2.15

RDBMS: Oracle 10g

Platform type: Web Portal

Troubleshooting tools

Quest Foglight
for Java (monitoring and alerting)

Java VM Thread
Dump (thread race analysis)

Problem overview

Major
performance degradation was observed from one of our Weblogic Portal production
environments. Alerts were also sent from the Foglight agents indicating a
significant surge in Weblogic threads utilization up to the upper default limit
of 400.

Gathering and validation of facts

As usual,
a Java EE problem investigation requires gathering of technical and non
technical facts so we can either derived other facts and/or conclude on the
root cause. Before applying a corrective measure, the facts below were verified
in order to conclude on the root cause:

What is the client
impact? HIGH

Recent change of the affected
platform? Yes, a recent deployment was performed
involving minor content changes and some Java libraries changes & refactoring

Any recent traffic
increase to the affected platform? No

Since how long this
problem has been observed? New problem observed following the deployment

Did a restart of the Weblogic
server resolve the problem? No, any restart attempt did
result in an immediate surge of threads

Did a rollback of the
deployment changes resolve the problem? Yes

Conclusion #1: The problem appears
to be related to the recent changes. However, the team was initially unable to
pinpoint the root cause. This is now what we will discuss for the rest of the
article.

Weblogic hogging thread report

The
initial thread surge problem was reported by Foglight. As you can see below,
the threads utilization was significant (up to 400) leading to a high volume of
pending client requests and ultimately major performance degradation.

As usual,
thread problems require proper thread dump analysis in order to pinpoint the
source of threads contention. Lack of this critical analysis skill will prevent
you to go any further in the root cause analysis.

at
org.apache.commons.logging.impl.Log4JLogger.debug(Log4JLogger.java:110)

at org.apache.beehive.netui.util.logging.Logger.debug(Logger.java:119)

at
org.apache.beehive.netui.pageflow.DefaultPageFlowEventReporter.beginPageRequest(DefaultPageFlowEventReporter.java:164)

at
com.bea.wlw.netui.pageflow.internal.WeblogicPageFlowEventReporter.beginPageRequest(WeblogicPageFlowEventReporter.java:248)

at
org.apache.beehive.netui.pageflow.PageFlowPageFilter.doFilter(PageFlowPageFilter.java:154)

at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)

at com.bea.p13n.servlets.PortalServletFilter.doFilter(PortalServletFilter.java:336)

at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)

at
weblogic.servlet.internal.RequestDispatcherImpl.invokeServlet(RequestDispatcherImpl.java:526)

at weblogic.servlet.internal.RequestDispatcherImpl.forward(RequestDispatcherImpl.java:261)

at
<App>.AppRedirectFilter.doFilter(RedirectFilter.java:83)

at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)

at
<App>.AppServletFilter.doFilter(PortalServletFilter.java:336)

at
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:42)

at
weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3393)

at
weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)

at
weblogic.security.service.SecurityManager.runAs(Unknown Source)

at
weblogic.servlet.internal.WebAppServletContext.securedExecute(WebAppServletContext.java:2140)

at
weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:2046)

at
weblogic.servlet.internal.ServletRequestImpl.run(Unknown Source)

at
weblogic.work.ExecuteThread.execute(ExecuteThread.java:200)

at
weblogic.work.ExecuteThread.run(ExecuteThread.java:172)

As you can
see, it appears that all the threads are waiting to acquire a lock on an Apache
Log4j object monitor (org.apache.log4j.spi.RootCategory) when attempting to log
debug information to the configured appender and log file. How did we figure
that out from this thread stack trace? Let’s dissect this thread stack trace in
order for you to better understand this thread race condition e.g. 250 threads
attempting to acquire the same object monitor concurrently.

At this
point the main question is why are we seeing this problem suddenly? An increase
of the logging level or load was also ruled out at this point after proper
verification. The fact that the rollback of the previous changes did fix the
problem did naturally lead us to perform a deeper review of the promoted
changes. Before we go to the final root cause section, we will perform a code
review of the affected Log4j code e.g. exposed to thread race conditions.

Apache Log4j 1.2.15 code review

## org.apache.log4j.Category

/**

* Call the appenders in the hierrachy
starting at <code>this</code>. If no

* appenders could be found, emit a warning.

*

* <p>

* This method calls all the appenders
inherited from the hierarchy

* circumventing any evaluation of whether to
log or not to log the

* particular log request.

*

* @param event

*
the event to log.

*/

publicvoid callAppenders(LoggingEvent event) {

int writes = 0;

for (Category c = this; c != null; c = c.parent) {

// Protected against simultaneous call to
addAppender,

// removeAppender,...

synchronized (c) {

if (c.aai != null) {

writes
+= c.aai.appendLoopOnAppenders(event);

}

if (!c.additive) {

break;

}

}

}

if (writes == 0) {

repository.emitNoAppenderWarning(this);

}

As you can
see, the Catelogry.callAppenders() is using a synchronized block at
the Category level which can lead to a severe thread race condition under heavy
concurrent load. In this scenario, the usage of a re-entrant read write lock
would have been more appropriate (e.g. such lock strategy allows concurrent “read”
but single “write”). You can find reference to this known Apache Log4j
limitation below along with some possible solutions.

Does the
above Log4j behaviour is the actual root cause of our problem? Not so fast…

Let’s
remember that this problem got exposed only following a recent deployment. The
real question is what application change triggered this problem & side
effect from the Apache Log4j logging API?

Root cause: a perfect storm!

Deep dive
analysis of the recent changes deployed did reveal that some Log4j libraries at
the child classloader level were removed along with the associated “child first”
policy. This refactoring exercise ended-up moving the delegation of both Commons
logging and Log4j at the parent
classloader level. What is the problem?

Before
this change, the logging events were split between Weblogic Beehive Log4j calls
at the parent classloader and web application logging events at the child class
loader. Since each classloader had its own copy of the Log4j objects, the
thread race condition problem was split in half and not exposed (masked) under
the current load conditions. Following the refactoring, all Log4j calls were
moved to the parent classloader (Java EE app); adding significant concurrency
level to the Log4j components such as Category. This increase concurrency level
along with this known Category.java thread race / deadlock behaviour was a perfect
storm for our production environment.

In other
to mitigate this problem, 2 immediate solutions were applied to the
environment:

Rollback the refactoring and split Log4j calls back
between parent and child classloader

Reduce logging level for some appenders from DEBUG
to WARNING

This
problem case again re-enforce the importance of performing proper testing and
impact assessment when applying changes such as library and class loader
related changes. Such changes can appear simple at the "surface" but can trigger
some deep execution pattern changes, exposing your application(s) to
known thread race conditions.

A future
upgrade to Apache Log4j 2 (or other logging API’s) will also be explored
as it is expected to bring some performance enhancements which may address some
of these thread race & scalability concerns.

9.21.2012

As you may
have seen from my previous tutorials and case studies, Java Heap Space OutOfMemoryError problems can be complex to pinpoint and resolve. One of the common problems I
have observed from Java EE production systems is OutOfMemoryError: unable to
create new native thread; error thrown when the HotSpot JVM is unable to
further create a new Java thread.

This
article will revisit this HotSpot VM error and provide you with recommendations
and resolution strategies.

If you are
not familiar with the HotSpot JVM, I first recommend that you look at a high
level view of its internal HotSpot JVM memory spaces. This knowledge is important in
order for you to understand OutOfMemoryError problems related to the native (C-Heap)
memory space.

OutOfMemoryError: unable to create new native
thread – what is it?

Let’s
start with a basic explanation. This HotSpot JVM error is thrown when the
internal JVM native code is unable to create a new Java thread. More precisely,
it means that the JVM native code was unable to create a new “native” thread
from the OS (Solaris, Linux, MAC, Windows...).

We can
clearly see this logic from the OpenJDK 1.6 and 1.7 implementations as per
below:

Unfortunately
at this point you won’t get more detail than this error, with no indication of
why the JVM is unable to create a new thread from the OS…

HotSpot JVM: 32-bit or 64-bit?

Before you
go any further in the analysis, one fundamental fact that you must determine from
your Java or Java EE environment is which version of HotSpot VM you are using
e.g. 32-bit or 64-bit.

Why is it so
important? What you will learn shortly is that this JVM problem is very often related
to native memory depletion; either at the JVM process or OS level. For now
please keep in mind that:

A 32-bit JVM process is in theory allowed to grow up
to 4 GB (even much lower on some older 32-bit Windows versions).

A 64-bit JVM process is in theory allowed to use
most of the OS virtual memory available or up to 16 EB (16 million TB)

As you can
see, if you allocate a large Java Heap (2 GB+) for a 32-bit JVM process, the
native memory space capacity will be reduced automatically, opening the door for
JVM native memory allocation failures.

For a
64-bit JVM process, your main concern, from a JVM C-Heap perspective, is the
capacity and availability of the OS physical, virtual and swap memory.

OK great but how does native memory affect Java
threads creation?

Now back
to our primary problem. Another fundamental JVM aspect to understand is that
Java threads created from the JVM requires native
memory from the OS. You should now start to understand the source of your
problem…

The high
level thread creation process is as per below:

A new Java thread is requested from the Java program
& JDK

The JVM native code then attempt to create a new
native thread from the OS

The OS then attempts to create a new native thread
as per attributes which include the thread stack size. Native memory is
then allocated (reserved) from the OS to the Java process native memory
space; assuming the process has enough address space (e.g. 32-bit process)
to honour the request

The OS will also refuse any further Thread & native
memory allocation if the virtual memory of the OS is depleted (including Solaris
swap space depletion since thread access to the stack can generate a SIGBUS
error, crashing the JVM * http://bugs.sun.com/view_bug.do?bug_id=6302804

For a 32-bit JVM, Java thread creation also requires memory available from the C-Heap or process address space

Problem diagnostic

Now that
you understand native memory and JVM thread creation a little better, is it now
time to look at your problem. As a starting point, I suggest that your follow
the analysis approach below:

Determine if you are using HotSpot 32-bit or 64-bit
JVM

When problem is observed, take a JVM Thread Dump and
determine how many Threads are active

Monitor closely the Java process size utilization before
and during the OOM problem replication

Monitor closely the OS virtual memory utilization before
and during the OOM problem replication; including the swap memory space
utilization if using Solaris OS

Proper
data gathering as per above will allow you to collect the proper data points,
allowing you to perform the first level of investigation. The next step will be
to look at the possible problem patterns and determine which one is applicable
for your problem case.

Problem pattern #1 – C-Heap depletion (32-bit
JVM)

From my
experience, OutOfMemoryError: unable to create new native thread is quite
common for 32-bit JVM processes. This problem is often observed when too many
threads are created vs. C-Heap capacity.

JVM Thread
Dump analysis and Java process size monitoring will allow you to determine if
this is the cause.

Problem pattern #2 – OS virtual memory
depletion (64-bit JVM)

In this
scenario, the OS virtual memory is fully depleted. This could be due to a few
64-bit JVM processes taking lot memory e.g. 10 GB+ and / or other high memory
footprint rogue processes. Again, Java process size & OS virtual memory
monitoring will allow you to determine if this is the cause.

Also, please verify if you are not hitting OS related threshold such as ulimit -u or NPROC (max user processes). Default limits are usually low and will prevent you to create let's say more than 1024 threads per Java process.

Problem pattern #3 – OS virtual memory
depletion (32-bit JVM)

The third
scenario is less frequent but can still be observed. The diagnostic can be a
bit more complex but the key analysis point will be to determine which
processes are causing a full OS virtual memory depletion. Your 32-bit JVM
processes could be either the source or the victim such as rogue processes
using most of the OS virtual memory and preventing your 32-bit JVM processes to
reserve more native memory for its thread creation process.

Please
note that this problem can also manifest itself as a full JVM crash (as per below sample) when running
out of OS virtual memory or swap space on Solaris.

You now
understand your problem and know which problem pattern you are dealing with.
You are now ready to provide recommendations to address the problem…are you?

Your work is
not done yet, please keep in mind that this JVM OOM event is often just a “symptom”
of the actual root cause of the problem. The root cause is typically much
deeper so before providing recommendations to your client I recommend that you
really perform deeper analysis. The last thing you want to do is to simply
address and mask the symptoms. Solutions such as increasing OS physical /
virtual memory or upgrading all your JVM processes to 64-bit should only be
considered once you have a good view on the root cause and production environment
capacity requirements.

The next
fundamental question to answer is how many threads were active at the time of
the OutOfMemoryError? In my experience with Java EE production systems, the
most common root cause is actually the application and / or Java EE container
attempting to create too many threads at a given time when facing non happy
paths such as thread stuck in a remote IO call, thread race conditions etc. In
this scenario, the Java EE container can start creating too many threads when
attempting to honour incoming client requests, leading to increase pressure
point on the C-Heap and native memory allocation. Bottom line, before blaming
the JVM, please perform your due diligence and determine if you are dealing
with an application or Java EE container thread tuning problem as the root
cause.

Once you
understand and address the root cause (source of thread creations), you can
then work on tuning your JVM and OS memory capacity in order to make it more
fault tolerant and better “survive” these sudden thread surge scenarios.

Perform a JVM Thread Dump analysis and determine
the source of all the active threads vs. an established baseline.
Determine what is causing your Java application or Java EE container to
create so many threads at the time of the failure

Determine if the Java Heap size of your 32-bit JVM
is too large, preventing the JVM to create enough threads to fulfill your
client requests. In this scenario, you will have to consider reducing your
Java Heap size (if possible), vertical scaling or upgrade to a 64-bit JVM

Capacity planning analysis to the rescue

As you may
have seen from my past article on the Top 10 Causes of Java EE Enterprise Performance Problems, lack of capacity planning analysis is often the source of
the problem. Any comprehensive load and performance testing exercise should also properly
determine the Java EE container threads, JVM & OS native memory requirement
for your production environment; including impact measurements of "non-happy" paths. This approach
will allow your production environment to stay away from this type of problem
and lead to better system scalability and stability in the long run.