By default, WebSphere Application Server has a global log level of `info` (*=info). This means the following messages are logged: info, audit, warning, severe, and fatal. It is almost always a malpractice to use *=severe, *=fatal, or *=off, because warnings and errors generally occur infrequently and are critical to understanding why problems occurred. It is often a malpractice to use *=warning, because there are many informational messages that are very useful to understanding why problems occurred. If there are repeating messages flooding your logs, then the last resort should be to broadly disable core logging; instead, consider:

Open a support ticket with the owner of the message to understand why the message occurs so frequently.

Change the log level of the particular logger for those messages (after understanding what they mean in #1). Any log levels specified after the global log level override the log level for that particular logger. For example, if the log configuration is `*=info:com.test.Logger=warning`, then the threshold is only changed for com.test.Logger messages.

Intermittent performance problems are one of the most difficult type of performance problems. If the issues can't be reproduced easily, then they may be difficult to capture much data for. Often, running the necessary tracing all the time may be too expensive (even with optimization technologies such as WAS High Performance Extensible Logging). If you've exhausted all other performance techniques (tracing to RAM, thread sampling, request metrics, PMI, and the like), then you may apply the classic software engineering practice of brute force.

The first stage in finding intermittent performance problems using brute force is to isolate the location(s) of the problems. This is the most iterative and time-consuming stage. One way to do this is to start with the outermost component that demonstrates the symptom and then slowly isolate down. For example, if all that's known is that HTTP requests are intermittently slow, then you may start with the servlet (or if it's proved to be outside that, the container itself). If lightweight tracing doesn't exist, then it must be added. Below is a Java example which speculatively monitors how long a method takes, and if the duration exceeds some configurable threshold, the speculative traces are dumped, thus allowing one iteration of isolation:

For those aware of TCP retransmits, the following advice may seem obvious, but for others who don't know (like myself just earlier this week), this will be very simple but potentially very useful performance advice: Monitor the number of TCP retransmits in your operating system and be aware of the timeout values. The reason: they may explain random response time fluctuations or maximums of up to a few seconds.

The concept of TCP retransmission is one of the fundamental reasons why TCP is reliable. After a packet is sent, if it's not ACKed within the retransmission timeout, then the sender assumes there was a problem (e.g. packet loss, OS saturation, etc.) and retransmits the packet. From TCP RFC 793:

When the TCP transmits a segment containing data, it puts a copy on a retransmission queue and starts a timer; when the acknowledgment for that data is received, the segment is deleted from the queue. If the acknowledgment is not received before the timer runs out, the segment is retransmitted.

For applications such as Java/WAS, retransmissions occur transparently in the operating system. Retransmission is not considered an error condition and it is not reported up through libc. So a Java application may do a socket write, and every once in a while, a packet is lost, and there is a delay during the retransmission. Unless you've gathered network trace, this is difficult to prove.

Correlating TCP retransmit increases with the times of response time increases is much easier to do than end-to-end network trace with TCP port correlation (which often doesn't exist in low-overhead tracing).

Here is an example of an accumulating TCP retransmission counter on Linux:

$ netstat -s | grep -i retrans
283 segments retransmited

If a TCP implementation enables RFC 6298 support, then the RTO is recommended to be at least 1 second:

Whenever RTO is computed, if it is less than 1 second, then the RTO SHOULD be rounded up to 1 second. Traditionally, TCP implementations use coarse grain clocks to measure the RTT and trigger the RTO, which imposes a large minimum value on the RTO. Research suggests that a large minimum RTO is needed to keep TCP conservative and avoid spurious retransmissions [AP99]. Therefore, this specification requires a large minimum RTO as a conservative approach, while at the same time acknowledging that at some future point, research may show that a smaller minimum RTO is acceptable or superior.

However, this is not a "MUST" and Linux, for example, uses a default minimum value of 200ms, although it may be dynamically adjusted upwards.

The current timeout (called retransmission timeout or "rto") can be queried on Linux using `ss`:

Another potential issue is that the default location of DumpLocation is /var/spool/abrt. It has been observed that many customer configurations have this directory mounted on a small filesystem which will truncate core dumps. You may change DumpLocation or increase the filesystem space.

When ABRT encounters issues such as the above, it may write warnings or errors into /var/log/messages.

You may also disable the ABRT service if you don't expect to use its features (e.g. abrtd stop and relevant configuration such as chkconfig on system reboot). You can check if stopping the service has taken effect by printing the value of /proc/sys/kernel/core_pattern

When the JVMs are restarted after abrtd has stopped, they will go back to the default behavior of creating dumps in the current working directory of the process, and the JVMJ9VM135W warning should be suppressed. If you are not having difficulty collecting core files, or not interested in collecting core files, then the message can be safely ignored.

The WAS NCSA Access Log is an optional feature of traditional WAS and WAS liberty that prints one line into an IHS/httpd-style access.log for each HTTP response handled by WAS. I've covered this before, including the addition of %D in WAS 8.0.0.2 to record the response time.

Starting with WAS 8.0.0.11 and 8.5.5.6, the WAS access log supports %{R}W which is the HTTP service time. The difference between the HTTP response time and the HTTP service time is that the former includes the time to send back the entire response, whereas the latter only times up to the first byte sent back. The reason for this distinction is that one very common issue is a slow or bad network, slow client, or slow intermediary proxy (e.g. IHS, etc.). With %D, there is no distinction between the time spent in WAS and the time spent in the network, end-user, and intermediary proxies. %{R}W is a subset of %D and helps isolate where the slowdown may be. This is a heuristic and it doesn't help with servlets that stream responses (and do complex work in between) or otherwise call flush. It also doesn't help if WAS (or the operating system it sits on) has an issue while sending back the rest of the bytes. With those caveats, %{R}W is a great addition to help find where HTTP responses may be slow and you should enable both %D and %{R}W if your version of WAS includes them.

The WebSphere Application Server Performance Cookbook covers performance tuning for WebSphere Application Server (WAS), although there is also a very strong focus on Java, Operating Systems, and theory which can be applied to other products and environments. The cookbook is designed to be read in a few different ways:

On the go: Readers short on time should skip to the Recipes chapter at the end of the book. In the spirit of a cookbook, there are recipes that provide step-by-step instructions of how to gather and analyze data for particular classes of problems.

General areas: For readers interested in tuning some general area such as WAS or Java, each major chapter provides its recipe at the top of the chapter that summarizes the key tuning knobs that should be investigated.

Deep dive: Readers interested in end-to-end tuning are encouraged to skim the entire book for areas relevant to their product usage.

A few customer experiences this year have made me start to avoid the phrase "root cause analysis" (RCA). I recommend the elimination of this phrase, and instead using "causal chain" and here's why:

Root cause analysis is the search for the primary, sufficient condition that causes a problem. The first danger of "root" cause analysis is that you may think you're done when you're not. The word "root" suggests final, but how do you know you're done? For example, in a situation earlier this year, the problem was high user response times. The proximate cause was that the processors were saturated. The processors were being driven so heavily because System.gc was being called frequently, forcing garbage collections. This was thought to be the "root cause" so somebody suggested using the option -Xdisableexplicitgc to make calls to System.gc do nothing. Everyone sighed relief; root cause was found! Not so. The System.gc's were being called due to native OutOfMemoryErrors when trying to load classes (and -Xdisableexplicitgc doesn't affect forced GCs from the JVM handling certain NOOMs). After much more investigation, we arrived at a very complex causal chain in which there wasn't even a single cause:

The second danger of root "cause" analysis is that it suggests a single cause, which obviously isn't always the case.

Properly understood and with all the right caveats, RCA is fine, but it is rarely properly understood and rarely comes with caveats. Once someone declares that "root cause" has been found, most people are satisfied, especially if removing that cause seems to avoid the problem.

I find it interesting that the term "root" has gained such a strong hold, when it is clearly too strong of a term. My guess is that "root" was added to "cause analysis," because without "root," some people might stop at the first cause, but perversely, the phrase has caused the exact same sloppiness, laziness and false sense of accomplishment that it was probably designed to avoid. However, given that both suffer from the same problem, I think "root cause analysis" is worse than "cause analysis" because at least the latter is more open ended. Instead, I prefer the term "causal chain" because it seems to define the investigation in terms of a chain of causes and effects and is more suggestive of the open-endedness of this chain (actually, it's probably more like a causal graph, but I think chain is good enough).

Example Scripts

In all the example scripts below, we basically iterate over a list of hosts and execute commands on all of those hosts. Remember that if the orchestrator machine is also one of these hosts, that it should be included in the list (it will be connecting to "itself"). You will need to modify these scripts to match what you need.

In a previous post, I showed how to break a heapdump down into retained sets by class as an alternative way to explore what's consuming the heap. I thought it would be worth elaborating on retained sets a bit more, as they are one of the most important concepts in heapdump analysis.

The MAT documentation gives a good definition of retained sets and a useful diagram:

Retained set of X is the set of objects which would be removed by GC when X is garbage collected.

When most people talk about the "size" of a set of objects X, they are really talking about the retained set of the set of objects X, i.e. if nothing referenced X, then those objects could be garbage collected and the number of bytes representing the retained set of X would be freed.

Therefore, if you want to know how much memory a set of objects retain, click Open Query Browser > Java Basics > Show as Histogram, specify the objects, and then click "Calculate retained size" and select either of the two options. For example, one common cause of excessive heap usage is by org.apache.jasper.runtime.BodyContentImpl objects due to the default behavior of com.ibm.ws.jsp.limitBuffer=false. If we want to see how much these buffers are retaining, we can show a histogram for BodyContentImpl and calculate a precise retained size, in this example 291MB:

Another way to do this is to simply open the class Histogram and then filter in the Class Name column and calculate the retained size:

One useful technique when first analyzing a heapdump is to open the class histogram, calculate minimum retained sizes (you probably don't want to do precise as there may be many classes), and then sort by the "Retained Heap" column. It's important to note that each retained heap value is exclusive of the other values, so don't add this column up. For example, we may see that char[] retain hundreds of MB and BodyContentImpl objects retain hundreds of MB, but in this example, the BodyContentImpl objects retain the char[] objects.

It's nice to know how "big" a set of objects is but it's even better to get a class histogram of what is in that retained set. To do that, either right click on a set of objects and select "Show Retained Set," or use Open Query Browser > Show Retained Set and specify the objects. One tip is that you can use wildcards, so if you want to know how much memory is retained by some set of classes (e.g. everything made by one vendor), simply do com.example.* and review the sum of shallow heaps (in this example, we can say org.apache classes retain 321MB).

It's important to understand the limitations of retained sets. Complex object graphs often complicate retained sets. For example, WAS classes such as com.ibm.ws.webcontainer.httpsession.MemorySessionContext hold all HTTP sessions, so you may think that you can get the size of all HTTP sessions by simply looking at the retained set of this class. However, let's say WebContainer threads are currently working on some set of HTTP sessions at the time of the heapdump. In that case, those sessions are not part of the retained set of MemorySessionContext because there are references to those objects from outside MemorySessionContext. For specific situations, MAT has a Customized Retained Set query where you can explicitly say which objects to exclude from the set of incoming references (in this example, you would specify MemorySessionContext and specify the set of application objects that reference these sessions as the exclude list). An alternative way to answer the question of how big all the session are is to calculate the retained set of all of the actual session objects instead of the map that contains them.

Excessive native memory usage by java.nio.DirectByteBuffers is a classic problem with any generational garbage collector such as gencon (which is the default starting in IBM Java 6.26/WAS 8), particularly on 64-bit. DirectByteBuffers (DBBs) are Java objects that allocate and free native memory. DBBs use a PhantomReference which is essentially a more flexible finalizer and they allow the native memory of the DBB to be freed once there are no longer any live Java references. Finalizers and their ilk are generally not recommended because their cleanup time by the garbage collector is non-deterministic.

This type of problem is particularly bad with generational collectors because the whole purpose of a generational collector is to minimize the collection of the tenured space (ideally never needing to collect it). If a DBB is tenured, because the size of the Java object is very small, it puts little pressure on the tenured heap. Even if the DBB is ready to be garbage collected, the PhantomReference can only become ready during a tenured collection. Here is a description of this problem (which also talks about native classloader objects, but the principle is the same):

"If an application relies heavily on short-lived class loaders, and nursery collections can keep up with any other allocated objects, then tenure collections might not happen very frequently. This means that the number of classes and class loaders will continue increasing, which can increase the pressure on native memory... A similar issue can arise with reference objects (for example, subclasses of java.lang.ref.Reference) and objects with finalize() methods. If one of these objects survives long enough to be moved into tenure space before becoming unreachable, it could be a long time before a tenure collection runs and "realizes" that the object is dead. This can become a problem if these objects are holding on to large or scarce native resources. We've dubbed this an "iceberg" object: it takes up a small amount of Java heap, but below the surface lurks a large native resource invisible to the garbage collector. As with real icebergs, the best tactic is to steer clear of the problem wherever possible. Even with one of the other GC policies, there is no guarantee that a finalizable object will be detected as unreachable and have its finalizer run in a timely fashion. If scarce resources are being managed, manually releasing them wherever possible is always the best strategy."

Essentially the problem boils down to either:

There are too many DBBs being allocated (or they are too large), and/or

The DBBs are not being cleared up quickly enough.

It is very important to verify that the volume and rate of DBB allocations are expected or optimal. If you would like to determine who is allocating DBBs (problem #1), of what size, and when, you can run a DirectByteBuffer trace. Test the overhead of this trace in a test environment before running in production.

One common cause of excessive DBB allocations is the default WAS WebContainer channelwritetype value of async. In this mode, all writes to servlet response OutputStreams (e.g. static file downloads from the application or servlet/JSP responses) are sent to the network asynchronously. If the network and/or the end-user do not keep up with the rate of network writes, the response bytes are buffered in DBB native memory without limit. Even if the network and end-user do keep up, this behavior may simply create a large volume of DBBs that can build up in the tenured area. You may change channelwritetype to sync to avoid this behavior although servlet performance may suffer, particularly for end-users on WANs. A feature request has been opened to change the default behavior.

If you would like to clear up DBBs more often (problem #2), there are a few options:

Use -XX:MaxDirectMemorySize=$bytes

Specifying MaxDirectMemorySize will force the DBB code to run System.gc() when the sum of outstanding DBB native memory would be more than $bytes. This option may have performance implications. When using this option, ensure that -Xdisableexplicitgc is not used. The optimal value of $bytes should be determined through testing. The larger the value, the more infrequent the System.gcs will be but the longer each tenured collection will be. For example, start with -XX:MaxDirectMemorySize=1024m and gather throughput, response time, and verbosegc garbage collection overhead numbers and compare to a baseline. Double and halve this value and determine which direction is better and then do a binary search for the optimal value.

Use -Xgc:maxScavengeBeforeGlobal=N to force System.gc()s after every N scavenges. This option may have performance implications. Find the optimal value of N through the same testing approach as #1.

Use a non-generational garbage collection policy such as -Xgcpolicy:optthruput or -Xgcpolicy:optavgpause. This option may have performance implications. As the article quoted above mentions, this may not completely solve the issue.

In most cases, something like -XX:MaxDirectMemorySize=1024m (and ensuring -Xdisableexplicitgc is not set) is a reasonable solution to the problem.

A recent IBM javacore (kill -3 ${PID}) shows how much native memory is currently in use by DirectByteBuffers. For example: