What could cause global Tomcat/JVM slowdown?

The system runs just fine, but after a greatly variyng amount of time all instances of the application at the same time suddenly suffer from rising response times. Basically the application still works, but the response times are about three times higher.

This are two diagrams displaying the response time of two certain short workflows/actions (log in, access list of seminars, ajax-refresh this list, log out; the lower line is just the request time for the ajax refresh) of two example instances of the application:

As you can see both instances of the application "explode" at the exact same time and stay slow. After restarting the server everything's back to normal. All the instances of the application "explode" simultaneously.

We're storing the session data to a database and use this for clustering. We checked session size and number and both are rather low (meaning that on other servers with other applications we sometimes have larger and more sessions). The other Tomcat in the cluster usually stays fast for some more hours and after this random-ish amount of time it also "dies". We checked the heap sizes with jconsole and the main heap stays between 2.5 and 1 GB size, db connection pool is basically full of free connections, as well as the thread pools. Max heap size is 5 GB, there's also plenty of perm gen space available. The load is not especially high; there's just about 5% load on the main CPU. The server does not swap. It's also no hardware issue as we additionally deployed the applications to a VM where the problems remain the same.

I don't know where to look anymore, I am out of ideas. Has someone an idea where to look?

2013-02-21 Update: New Data!

I added two more timing traces to the application. As for the measurement: the monitoring system calls a servlet that performs two tasks, measures execution time for each on the server and writes the time taken as response. These values are logged by the monitoring system.

I have several interesting new facts: a hot redeployment of the application causes this single instance on the current Tomcat to go nuts. This also seems to affect raw CPU calculation performance (see below). This individual-context-explosion is different from the overall-context-explosion that occurs randomly.

Now for some data:

First the individual lines:

Light blue is total execution time of a small workflow (details see above), measured on the client

Red is "part" of light blue and is the time taken to perform a special step of that workflow, measured on the client

Dark blue is measured in the application and consists of reading a list of entities from the DB through Hibernate and iterating over that list, fetching lazy collections and lazy entities.

Green is a small CPU benchmark using floating point and integer operations. As far as I see no object allocation, so no garbage.

Now for the individual stages of explosion: I marked each image with three black dots. The first one is a "small" explostion in more or less only one application instance - in Inst1 it jumps (especially visible in the red line), while Inst2 below more or less stays calm.

After this small explosion the "big bang" occurs and all application instances on that Tomcat explode (2nd dot). Note that this explosion affects all high level operations (request processing, DB access), but not the CPU benchmark. It stays low in both systems.

After that I hot-redeployed Inst1 by touching the context.xml file. As I said earlier this instance goes from exploded to completely devestated now (the light blue line is out of the chart - it is at about 18 secs). Note how a) this redeployment does not affect Inst2 at all and b) how the raw DB access of Inst1 is also not affected - but how the CPU suddenly seems to have become slower!. This is crazy, I say.

Update of update
The leak prevention listener of Tomcat does not whine about stale ThreadLocals or Threads when the application is undeployed. There obviously seems to be some cleanup problem (which is I assume not directly related to the Big Bang), but Tomcat doesn't have a hint for me.

2013-02-25 Update: Application Environment and Quartz Schedule

The application environment is not very sophisticated. Network components aside (I don't know enough about those) there's basically one application server (Linux) and two database servers (MySQL 5 and MSSQL 2008). The main load is on the MSSQL server, the other one merely serves as a place to store the sessions.

The application server runs an Apache as a load balancer between two Tomcats. So we have two JVMs running on the same hardware (two Tomcat instances). We use this configuration not to actually balance load as the application server is capable of running the application just fine (which it did for years now) but to enable small application updates without downtime. The web application in question is deployed as separate contexts for different customers, about 15 contexts per Tomcat. (I seemm to have mixed up "instances" and "contexts" in my posting - here in the office they're often used synonymously and we usually magically know what the colleague is talking about. My bad, I'm really sorry.)

To clarify the situation with better wording: the diagrams I posted show response times of two different contexts of the same application on the same JVM. The Big Bang affects all contexts on one JVM but doesn't happen on the other one (the order in which the Tomcats explode is random btw). After hot-redeployment one context on one Tomcat instance goes nuts (with all the funny side effects, like seemingly slower CPU for that context).

The overall load on the system is rather low. It's an internal core business related software with about 30 active users simultaneously. Application specific requests (server touches) are currently at about 130 per minute. The number of single requests are low but the requests itself often require several hundred selects to the database, so they're rather expensive. But usually everything's perfectly acceptable. The application also does not create large infinite caches - some lookup data is cached, but only for a short amount of time.

Above I wrote that the servers where capable of running the application just fine for several years. I know that the best way to find the problem would be to find out exactly when things went wrong for the first time and see what has been changed in this timeframe (in the application itself, the associated libraries or infrastructure), however the problem is that we don't know when the problems first occured. Just let's call that suboptimal (in the sense of absent) application monitoring... :-/

We ruled out some aspects, but the application has been updated several times during the last months and thus we e.g. cannot simply deploy an older version. The largest update that wasn't feature change was a switch from JSP to Facelets. But still, "something" must be the cause of all the problems, yet I have no idea why Facelets for instance should influence pure DB query times.

Quartz

As for the Quartz schedule: there's a total of 8 jobs. Most of them run only once per day and have to do with large volume data synchronization (absolutely not "large" as in "big data large"; it's just more than the averate user sees through his usual daily work). However, those jobs of course run at night and the problems occur during daytime. I omit a detailled job listing here (if beneficial I can provide more details of course). The jobs' source code has not been altered during the last months. I already checked whether the explosions align with the jobs - yet the results are inconclusive at best. I'd actually say that they don't align, but as there are several jobs that run every minute I can't rule it out just yet. The acutal jobs that run every minute are pretty low-weight in my opinion, they usually check if data is available (in different sources, DB, external systems, email account) and if so write it to the DB or push it to another system.

However I'm currently enabling logging of indivdual job execution so that I can exactly see start and end timestamp of each single job execution. Perhaps this provides more insight.

2013-02-28 Update: JSF Phases and Timing

I manually added a JSF phae listener to the application. I executed a sample call (the ajax refresh) and this is what I've got (left: normal running Tomcat instance, right: Tomcat instance after Big Bang - the numbers have been taken almost simultaneously from both Tomcats and are in milliseconds):

RESTORE_VIEW: 17 vs 46

APPLY_REQUEST_VALUES: 170 vs 486

PROCESS_VALIDATIONS: 78 vs 321

UPDATE_MODEL_VALUES: 75 vs 307

RENDER_RESPONSE: 1059 vs 4162

The ajax refresh itself belongs to a search form and its search result. There's also another delay between the application's outmost request filter and web flow starts its work: there's a

FlowExecutionListenerAdapter

that measures time taken in certain phases of web flow. This listener reports 1405 ms for "Request submitted" (which is as far as I know the first web flow event) out of a total of 1632 ms for the complete request on an un-exploded Tomcat, thus I estimate about 200ms overhead.

But on the exploded Tomcat it reports 5332 ms for request submitted (meaning all JSF phases happen in those 5 seconds) out of a total request duration of 7105ms, thus we're up to almost 2 seconds overhead for everything outside of web flow's request submitted.

Below my measurement filter the filter chain contains a

org.ajax4jsf.webapp.BaseFilter

, then the Spring servlet is called.

2013-06-05 Update: All the stuff going on in the last weeks

A small and rather late update... the application performance still sucks after some time and the behaviour remains erratic. Profiling did not help much yet, it just generated an enormous amount of data that's hard to dissect. (Try poking around in performance data on or profile a production system... sigh) We conducted several tests (ripping out certain parts of the software, undeploying other applications etc.) and actually had some improvements that affect the whole application. The default flush mode of our

EntityManager

is

AUTO

and during view rendering lots of fetches and selects are issued, always including the check whether flushing is neccesary.

So we built a JSF phase listener that sets the flush mode to

COMMIT

during

RENDER_RESPONSE

. This improved overall performance a lot and seems to have mitigated the problems somewhat.

Yet, our application monitoring keeps yielding completely insane results and performance on some contexts on some tomcat instances. Like an action that should finish in under a second (and that actually does it after deployment) and that now takes more than four seconds. (These numbers are supported by manual timing in the browsers, so it's not the monitoring that causes the problems).

See the following picture for example:

This diagram shows two tomcat instances running the same context (meaning same db, same configuration, same jar). Again the blue line is the amount of time taken by pure DB read operations (fetch a list of entities, iterate over them, lazily fetch collections and associated data). The turquoise-ish and red line are measured by rendering several views and doing an ajax refresh, respectively. The data rendered by two of the requests in turquoise-ish and red is mostly the same as is queried for the blue line.

Now around 0700 on instance 1 (right) there's this huge increase in pure DB time which seems to affect actual render response times as well, but only on tomcat 1. Tomcat 0 is largely unaffected by this, so it cannot be caused by the DB server or network with both tomcats running on the same physical hardware. It has to be a software problem in the Java domain.

During my last tests I found out something interesting: All responses contain the header "X-Powered-By: JSF/1.2, JSF/1.2". Some (the redirect responses produced by WebFlow) even have "JSF/1.2" three times in there.

I traced down the code parts that set those headers and the first time this header is set it's caused by this stack:

... at org.ajax4jsf.webapp.FilterServletResponseWrapper.addHeader(FilterServletResponseWrapper.java:384)
at com.sun.faces.context.ExternalContextImpl.<init>(ExternalContextImpl.java:131)
at com.sun.faces.context.FacesContextFactoryImpl.getFacesContext(FacesContextFactoryImpl.java:108)
at org.springframework.faces.webflow.FlowFacesContext.newInstance(FlowFacesContext.java:81)
at org.springframework.faces.webflow.FlowFacesContextLifecycleListener.requestSubmitted(FlowFacesContextLifecycleListener.java:37)
at org.springframework.webflow.engine.impl.FlowExecutionListeners.fireRequestSubmitted(FlowExecutionListeners.java:89)
at org.springframework.webflow.engine.impl.FlowExecutionImpl.resume(FlowExecutionImpl.java:255)
at org.springframework.webflow.executor.FlowExecutorImpl.resumeExecution(FlowExecutorImpl.java:169)
at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:183)
at org.springframework.webflow.mvc.servlet.FlowController.handleRequest(FlowController.java:174)
at org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter.handle(SimpleControllerHandlerAdapter.java:48)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:920)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:641)
... several thousands ;) more

The second time this header is set by

at org.ajax4jsf.webapp.FilterServletResponseWrapper.addHeader(FilterServletResponseWrapper.java:384)
at com.sun.faces.context.ExternalContextImpl.<init>(ExternalContextImpl.java:131)
at com.sun.faces.context.FacesContextFactoryImpl.getFacesContext(FacesContextFactoryImpl.java:108)
at org.springframework.faces.webflow.FacesContextHelper.getFacesContext(FacesContextHelper.java:46)
at org.springframework.faces.richfaces.RichFacesAjaxHandler.isAjaxRequestInternal(RichFacesAjaxHandler.java:55)
at org.springframework.js.ajax.AbstractAjaxHandler.isAjaxRequest(AbstractAjaxHandler.java:19)
at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.createServletExternalContext(FlowHandlerAdapter.java:216)
at org.springframework.webflow.mvc.servlet.FlowHandlerAdapter.handle(FlowHandlerAdapter.java:182)
at org.springframework.webflow.mvc.servlet.FlowController.handleRequest(FlowController.java:174)
at org.springframework.web.servlet.mvc.SimpleControllerHandlerAdapter.handle(SimpleControllerHandlerAdapter.java:48)
at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:925)
at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:856)
at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:920)
at org.springframework.web.servlet.FrameworkServlet.doPost(FrameworkServlet.java:827)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:641)

I have no idea if this could indicate a problem, but I did not notice this with other applications that are running on any of our servers, so this might as well provide some hints. I really have no idea what that framework code is doing (admittedly I did not dive into it yet)... perhaps someone has an idea? Or am I running into a dead end?

Appendix

My CPU benchmark code consists of a loop that calculates Math.tan and uses the result value to modify some fields on the servlet instance (no volatile/synchronized there), and secondly performs several raw integer calcualations. This is not severly sophisticated, I know, but well... it seems to show something in the charts, however I am not sure what it shows. I do the field updates to prevent HotSpot from optimizing away all my precious code ;)

Solution

Increase the maximum size of the Code Cache:

-XX:ReservedCodeCacheSize=256m

Background

We are using ColdFusion 10 which runs on Tomcat 7 and Java 1.7.0_15. Our symptoms were similar to yours. Occasionally the response times and the CPU usage on the server would go up by a lot for no apparent reason. It seemed as if the CPU got slower. The only solution was to restart ColdFusion (and Tomcat).

Initial analysis

I started by looking at the memory usage and the garbage collector log. There was nothing there that could explain our problems.

My next step was to schedule a heap dump every hour and to regularly perform sampling using VisualVM. The goal was to get data from before and after a slowdown so that it could be compared. I managed to get accomplish that.

There was one function in the sampling that stood out: get() in coldfusion.runtime.ConcurrentReferenceHashMap. A lot of time was spent in it after the slowdown compared to very little before. I spent some time on understanding how the function worked and developed a theory that maybe there was a problem with the hash function resulting in some huge buckets. Using the heap dumps I was able to see that the largest buckets only contained 6 elements so I discarded that theory.

Code Cache

I finally got on the right track when I read "Java Performance: The Definitive Guide". It has a chapter on the JIT Compiler which talks about the Code Cache which I had not heard of before.

Compiler disabled

When monitoring the number of compilations performed (monitored with jstat) and the size of the Code Cache (monitored with Memory Pools plugin of VisualVM) I saw that the size increased up to the maximum size (which is 48 MB by default in our environment -- the default varies depending on Java version and Java compiler). When the Code Cache became full the JIT Compiler was turned off. I have read that "CodeCache is full. Compiler has been disabled." should be printed when that happens but I did not see that message; maybe the version we are using does not have that message. I know that the compiler was turned off because the number of compilations performed stopped increasing.

Deoptimization continues

The JIT Compiler can deoptimize previously compiled functions which will caues the function to be executed by the interpreter again (unless the function is replaced by an improved compilation). The deoptimized function can be garbage collected to free up space in the Code Cache.

For some reason functions continued to be deoptimized even though nothing was compiled to replace them. More and more memory would become available in the Code Cache but the JIT Compiler was not restarted.

I never had -XX:+PrintCompilation enabled when we experience a slowdown but I am quite sure that I would have seen either ConcurrentReferenceHashMap.get(), or a function that it depends on, be deoptimized at that time.

Result

We have not seen any slowdowns since we increased the maximum size of the Code Cache to 256 MB and we have also seen a general performance improvement. There is currently 110 MB in our Code Cache.