WebSphere Peformance - Alexandre Polozoff's Point of View

The past few weeks meeting with various WebSphere Application Server-based customers reminded me of the importance of the basic and fundamental performance tuning tasks. The InfoCenter provides information on tunings at the OS level, TCP/IP, JVM, etc. I have visited no less than 3 different environments running WebSphere Application Server without these base tunings. Just by applying the base tunings to the OS and JVM we saw as much as 99% less garbage collection, improved response time, throughput and less CPU utilization with the same production loads. The best part of following these instructions is the administrator does not need to be a performance guru to realize these gains. These improvements also help save money requiring less capacity going forward.

In the "Tuning the JVM" section I have never been disappointed with the "Option 1" settings. Options 2 and 3 require the ability to place the application under load/stress test. If you do not have a load/stress test environment (i.e. you have to test in production) then stick with "Option 1".

Notice that "Tuning Performance" has several sections for both application developers and WebSphere administrators. This is because we all know that to realize the best performance gains one has to optimize the application code. Runtime tuning can realize 5-15% but application code improvements can see 300% and higher performance improvements.

I am often asked specifically what metrics should be monitored in WebSphere Application Server. This list can be considered a starting point. The WAS admin will frequently use "Custom" PMI setting and disable any metrics not in this list.

IBM provides a tool to help with analysis of client side performance. One of the benefits for doing this is to help identify static objects that are not being cached at the browser. In addition, for JSF based applications, this tool helps identify large client side caches.

As I travel the world working performance problems I never see Microsoft Windows environments used outside the developer's desktop. Surprisingly these past couple of weeks I've been working in an environment where Microsoft Windows is used for the IBM HTTP Server tier with the WebSphere Application Server plug-in. Under normal operating conditions everything seems to work nominally.

However, much to my surprise, if we took down any of the application servers in the cluster of this very large cell I saw an anomaly. When the plug-in was attempting to route traffic to the downed application servers there seemed to be a really long lag on the connection refused processing. In fact, I was seeing least a second to get through the TCP/IP roundtrip. This made no sense to me. One of my colleagues, Keys Botzum, took a Java application and ran it on both Windows and Solaris. The application simply tried to connect to localhost (to eliminate any DNS lookups or network latency from the test) on a port no one was listening to and looped around 20 times. On Windows the test took slightly over 20 seconds. On Solaris, less than a second (which was the behaviour I was expecting on Windows).

If you are, or planning to, use Microsoft Windows on the IHS tier be aware of this strange failure scenario on Windows. I'll try to investigate and see if there are any Windows settings to help tune this. Though the plan is to move off Windows to Redhat Linux which right now sounds like the right move to me.

The above string (...) is the rest of the Log Format line. To also print out the JSESSIONID cookie in the Apache access log add the above JSESSIONID string to the end of the Log Format directive. This is helpful because the JSESSIONID string contains the clone the user has established their JSESSIONID. This way if a user is having problems the administrator will know which clone the user was pinned to. This helps immensely with troubleshooting because the administrator knows which log file they need to look at when the error occurs. Test this out in the test environment first. Then in production make sure disk space is monitored to ensure that the disk does not run out of space because of the additional logging data.

Report scheduler enhancements in Maximo v7.5. As with any online transaction application most enterprises need to pull reports from their environment. Reports tend to be (a) scheduled to repeat and (b) heavy users of CPU and memory. Therefore having more control on the report scheduler is a good thing to look at in Maximo v7.5.

Upgrades occur for a number of reasons. Hardware or
middleware software goes end of life. Going to a new operating system is another common upgrade. Applications need upgrading to
provide new functionality. One aspect of upgrading is trying to
understand how the "new" environment compares with the "old"
environment. Standing up new hardware typically means faster CPU and
more RAM. Additionally, new virtualization decisions (i.e. LPARs, VM, etc) may be under consideration as more environments move toward shared and/or cloud infrastructure and how that may impact performance..

A couple of years ago I wrote a comment line about using multiple cells in production. The article covers a common method for providing both high availability and/or continuous availability (i.e. no down time due to planned maintenance) in production environments. However, that isn't the only reason for using multiple cells. My next series of posts looks at using a similar strategy in non-production environments specifically around performance comparison for upgrades to hardware, middleware software or the application. In addition, I'll be investigating the ability to also compare infrastructural changes that may occur due to newer/better hardware capabilities.

I'm always on the look out for interesting reading on the topic of software complexity and failure. Through serendipity I came across this fascinating document from NASA [ http://www.nasa.gov/pdf/418878main_FSWC_Final_Report.pdf ] on just this topic. I think one could easily remove the word "flight" from this document and see immediate applicability to their own enterprise environment.

After an application outage or an extremely negative performance event one needs to conduct root cause analysis to try and determine the next corrective course of action. Having done this many times let me document some of the steps done in the first/initial phases of trying to figure out just what happened.

1. Inventory

The first task is to inventory what you have, how it is configured and deployed. This includes all software version information, configuration items for the application, pool sizes, etc.

Once that information is gathered understand what may be missing and asking a lot of questions. Is the software at the latest version or fixpack level. If not, why not? Is there anything in the patches subsequent to the version in production that may address the problems encountered? Are there any odd configurations (i.e. JDBC pool size is 3x larger than the thread pool size; 300 second timeouts, etc)? Understand odd configurations and try to determine why they exist. Often this is difficult because the people that initially configured and deployed the environment have moved on to other projects and the team you're dealing with is simply in maintenance mode.

2. Discovery / Data Collection

In order to solve a problem we have to have data about the problem. No data, no resolution because any decision is just a guess. Guesses do not work. My assumption here is we are investigating Java based applications.

a. Were thread dumps collected during the negative event? If not, why not? Thread dumps are collected using 'kill -3 <pid>' (this doesn't "kill" the process it just sends signal #3 to the JVM which is caught by the JVM and it dumps all the Java threads at that point in time) on Unix based systems. Collect thread dumps during all negative events in the future if they were not caught in the past. Thread dumps are a crucial piece of the puzzle to help narrow down what is going wrong.

b. Is verbose GC (garbage collection) enabled? If not, why not? Verbose (and the term is unfortunate as it is not that verbose) GC is another crucial piece of data to understanding what the memory utilization was like during the negative event.

c. If the application was written in house then initiate a code review. Software is written by humans and humans err. It could be a bug in the application that only kicks in during the appropriate planetary alignment event. Reviewing code, on a periodic basis, is a good idea in general even if you are not having any problems.

d. What backends are the applications accessing? Is there any information from the backend that would indicate participating in the negative event (i.e. log files, DB2 snapshots, etc)? It would not be the first time that some negative condition in the backend was causing a front end backlog. It could also be related to bugs in the application (see 2c above).

e. Are any application monitoring tools in place? Java is a robust environment that allows for rather detailed application monitoring of various factors like pool utilization, application response time, SQL response times, etc. Not having an application monitor in place simply limits the ability to understand what happened. Having an application monitor in place also allows for alerts to be issued when a negative event is detected. This allows for proactive actions to be taken by people who can troubleshoot the problem and hopefully fix it before the users ever notice.

f. Look in the application log files. There may be a indication of what is going on in the application logs. This really depends on how well the developers implemented logging in the application and may or may not be of any use. Fingers crossed!

Get through this initial set of steps and then you can go on to the next phase which is actually figuring out just what went wrong. Which I'll write about in my next installment.

Your application is slow. You get a thread dump and look in the javacore and see lots of threads is ClassLoader.loadClass() with one thread holding the lock. You need to check your FFDC logs and look for "Too many open files." This means you haven't tuned the OS ulimit parameters and probably many others. Look in the InfoCenter for performance tuning and operating systems and pick the page for your OS. This should be the first link in the InfoCenter you access after you install WebSphere Application Server.

Someone takes a javacore during what looks to be a hung app server and notices it contains lots of threads in socketRead. This is symptomatic of a slow back end whether it is a database, Web service, etc. An application is as strong as its weakest link. If the backend the application depends on is unable to respond in a timely manner then there is nothing that can be tuned at the application layer except for aggressive timeouts to protect the application from getting stuck. Hangs like these typically happen under high load/traffic conditions. It is important that the group that maintains the backend is aware of an issue with their tier and they need to fix it.

I'm always on the look out for new tools especially in the performance and troubleshooting space. This morning I saw a presentation on the IBM Performance Analyst and am pretty impressed by what some of my colleagues have been working on.

Testing requires a tool and for one of my projects I'm using JMeter. I'm testing an https based site and was just having a hard time figuring out what was going on. I kept seeing an error in the JMeter "view results tree" that just said "ensure browser is set to accept the jmeter proxy certificate". I started researching that phrase and got nowhere quickly.

However, in the jmeter/bin subdirectory I found the jmeter.log file. In there I found a java.security.NoSuchAlgorithmException and referencing the SunX509 KeyManagerFactory. Ah ha, yes, I'm running the IBM JRE and not the Sun version.

Unfortunately changing the jmeter.properties proxy.cert.factory=IbmX509 (and of course uncommenting it) had no effect and I got the same SunX509 exception. I decided to try it at the command line as: