I had this interesting problem with a major shipping/logistics company. Their application was running on Java 6, JBoss Cluster in RedHat Linux platform.

PROBLEM STATEMENT

Frequently JBoss instances were dropping off from the JBoss cluster. In JBoss terminology it’s “shunning” (good word :-). After several seconds of “shunning”, JBoss instance automatically rejoins the cluster without any manual intervention. (Note: Here JBoss didn’t crash or anything, it just stops to respond). Any user sessions that were established to this JBoss wouldn’t progress further. So user has to sign out and login again (so that session can be established with another JBoss intance in the cluster). Since problem started to happen quite frequently, users has to sign out and sign in again multiple times. This was an annoying behaviour and started to get high visibility with in the organization.

Basically below was the (shunning) error message reported in the log file:

When I started to review the Garbage Collection log file, I was able to observe that JBoss instances were suffering from long pauses when full GC (Garbage Collection) runs. (Note: Explaining the details of Full GC is outside the scope of this article). However point to note here is: When Full GC runs, entire JVM would freeze. JVM wouldn’t be able to process any new transaction and any transactions which are in flight would starve.

Below is the Garbage Collection log file excerpts, which indicates full GC duration time (refer to bold font):

As shown above, sometimes (not always) JBoss full GC duration is taking anywhere between 80secs – 130secs. Since JVM becomes dormant during this period, JBOSS cluster evicts (aka shuns) the JBoss instance, since it’s not responding to heart beat checks. Thus all the customer’s transactions which are handled by this Jboss instance gets jeopardized.

Whenever a Jboss instance is evicted from the cluster, an alert notification is sent out to operations team. Whenever Full GC duration exceed several seconds, those are the exact times at which alert notifications are sent out. It was clear confirmation that Long Pauses caued by GC was root cause for this problem.

SOLUTION

By studying the garbage collection log file, it was evident that Permenant Generation space in the JVM memory was fully getting utilized. Besides that in the application log file, following statements were printed. It’s indication that GC process is trying to remove the class files from the Permenant Generation. It’s neither a good sign.