The Hotspot JVM is a Ponzi Scheme

But it’s not the only one!

Every once in a while I run into someone that has a Hotspot JVM that is running back to back garbage collections and yet the heap is still almost full after each attempt! When they discover that their problem is related to the JVM not having enough memory they often ask the question, why didn’t the JVM throw an OutOfMemoryError (OOME)? After all my application is not making any forward progress and the reason is Java heap is almost completely exhausted, right?

First things first, if you’re lucky, you may never realize that your JVM is running a Ponzi scheme on you. It’s telling your application that you have a infinite amount of memory so go ahead and just use it. The JVM’s garbage collectors are responsible for maintaining this illusion that all’s well in memory land.

It’s not the only Ponzi scheme in town mind you! This behaviour is similar to how many O/S’s deal with memory allocations for native applications. You’ll keep on allocating native memory until you hit swap and then you’re in a bad way – not completely paused but fairly close to it due to low disk speed compared to main memory.

In order to maintain this illusion the garbage collector has to magically freeze time using a "safe-pointing" spell so that the application doesn’t even realize what happened. However, as much as time may have stopped for your application, time hasn’t stopped for those using your application. If your application keeps a lot of live data the garbage collector is going to have to work much harder to maintain the illusion. While your application may not notice how often or for how long it’s been frozen for, your users will!

Because your application trusts the JVM and the JVM will work hard, heroically (even foolishly) to maintain the illusion when it’s clear the gig is up leaving people wondering, "why didn’t my app throw an OOME?"

An example using a ParNew Young collection (with CMS)

Lets follow a GC log fragment to see if we can sort things out, here we start at the beginning of about 10 seconds worth of log.

This is a normally completing parallel collection of Young most likely triggered by an allocation failure in Eden. Let’s break out the numbers:

The live bytes left behind total106307K

The live bytes in Survivor space is 2110K

This implies the occupancy of Old is now at 104197K (106307-2110)

Driving into the numbers a little deeper:

However total heap size is 126912K

Of which Young is 19136K

This implies that Tenured is 107776K

With a little more math we can see that Old is 104197K/107776K or just under 97% occupied, which is dangerously close to being completely full!

The CMS that follows

The next set of records show that the previous ParNew was running inside a CMS cycle and that cycle did manage to complete. Yet the next record following the end of the CMS cycle is another CMS cycle.. The reason why, is that the previous cycle recovered 104197K – 101397K = 2800K, which is approximately 2.5% of Tenured leaving the GC just able to continue, but one hiccup from being exposed!

A Concurrent Mode Failure (CMF) almost seems inevitable at this stage and sure enough…

The following Concurrent Mode Failure (CMF)

The next records show that unfortunately for the collector, that one bad thing has just happened, the CMS concurrent-mark barely gets a chance to do it’s thing and now that pesky ParNew wants to promote a lot of data into tenured and can’t as there isn’t enough space to do so.

To make things worse, the ParNew’s attempts to get memory caused the CMS cycle to fail (concurrent mode failure) and rather than let the application know all is not well and to keep the game going, GC decides to call in the big gun, a Full GC. But even with this move, things aren’t looking good as the Full GC ends up leaving 107775K in Old and it’s only 107776K in size! That’s close to 100% full. However we still can move on as Young contains 1095K (108870K-107775K) which is all stuffed into a Survivor space. That was a close call but GC still has to scramble if it is to keep the Ponzi scheme from being exposed…

A following Full GC

Another Full GC now occurs to paper over the problem that the JVM has run out of memory. This one happens at 87.734 seconds after JVM startup. The previous pause time was 0.7589328. Add that to 86.875 and you get 87.634 meaning the application got just about 100ms of run time before it was once again interrupted.

But once again the heroic effort buys the GC a wee bit more time before it attempts yet another CMS cycle, a ParNew that fails and reverts (silently in this case) to yet another Full GC all the while stringing along the application into believing all is well when it isn’t.

And the misery continues

Yet more back to back CMS cycles with concurrent mode failures all indicating that the garbage collector is just able to keep things going but in all honesty you have to start wondering if the cost is too high and if at this point the better thing to do would be to throw some sort of warning or error.

Defining Out Of Memory

The elephant in the room is that Java heap is too small to support this application. A larger heap would help GC perpetuate the Ponzi scheme indefinitely. While the application may not realize that all is not well, end users surely do. We’d really like the application to complain before the users do. Unfortunately we don’t have an AlmostOutOfMemoryError but we can redefine when an OutOfMemoryError is thrown but adjusting the parameters GCTimeLimit and GCHeapFreeLimit.

The default values for GCTimeLimit is 98% meaning that if 98% of recent time is spent in GC than an OutOfMemoryError will be thrown. GCHeapFreeLimit is the amount of free heap available after a collection. The default value for this variable is 2%.

If we run the numbers from the GC log fragment we can see that GC does just well enough that it never exceeds either of these thresholds. Thus the GC is able to keep the Ponzi scheme limping along. And since these values are some what arbitrary, one could redefine them to say; if you have to work so hard to maintain the illusion maybe it’s best to just throw in the towel and let the application know that it’s struggling for memory. In this case setting the GCHeapFreeLimit to 5% and or setting GCTimeLimit to 90% would have triggered an OutOfMemoryError. This would have been the biggest hint as to why this application was not responding well and would have helped the victims of the Ponzi scheme know that they’d been had!