Staying on top of the JVM crashes

November 22, 2016 by
Ago Allikmaa
Filed under:
Java

I am working at Plumbr and am responsible for the native parts of our Agent monitoring Java Virtual Machines. As we have discovered, around one out of every 100 JVMs have crashed at least once during the six months when we have been capturing this data. So one of the solutions I have built during 2016 was related to capturing evidence from crashed JVMs.

This post is describing the reasons why JVMs get terminated and the different approaches you can take to harvest more evidence in such situations.

What is causing the JVM crashes?

JVM crashes tend to originate from two different sources. There are some more exotic cases, but running out of native memory and segmentation faults are the root cause for 99+% or the crashes we have seen so far.

Running out of native memory. In this case, the first line in the crash log will say “There is insufficient memory for the Java Runtime Environment to continue”. In this case, the action plan is to determine why there was not enough memory. This has three common causes:

The Xmx of the process was set too high and did not give any headroom for any off-heap memory the JVM or other processes on the machine require. One important thing to keep in mind is that in addition to the heap memory, the JVM process has separate metaspace and code cache regions and also needs some memory for its other housekeeping tasks. All these combined should definitely not add up to more than 90% of the system memory and to be extra safe, it is best to keep up to half of the system memory free.

There is a leak of native resources, such as native memory allocated by native libraries or direct buffers. This can be difficult to debug as native memory is not as easily explorable as heap memory is via memory dumps. Likely suspects in this case can be classes which allocate native memory or direct buffers. Look out for native resources which are freed in a finalizer, as there are cases when the finalizer thread cannot keep up with the instances that need to be cleaned up, resulting in what looks like a memory leak.

Something else on the system is using up more memory than expected. If the native memory usage of the JVM process is not higher than expected, but the JVM still crashed with this error, this is a likely cause. In this case it is a good idea to keep track of the memory usage of all processes in the system and when this occurs, see if the memory usage of any other process was high at that moment.

Segmentation fault. A In this case the log will start with the message “A fatal error has been detected by the Java Runtime Environment”. In this case, the first thing to look at is the stack trace of the thread the crash occurred in. The first line of this stack trace is right at the top of the log under the line “Problematic frame:”. The rest of the stack trace can be found a bit lower in the crash log. The library name in that stack trace is the first indicator of what might be responsible. This helps to give an initial guess which of the common categories it falls under:

Bug in a native library. In case the crash log mentions a native library which is not part of the JRE, it is likely that it is the cause of the crash, especially so if the first line of the crash log is in that library. In this case the best course of action is to contact the vendor of that library or check if it has a new version, especially if the changelog indicates a fix to a crash. Should any crash log mention a Plumbr library in the stack trace, contact our support immediately as guaranteeing the stability of our product is of the utmost priority for us. There have been some cases where we have discovered that we have been either directly or remotely responsible for a crash and thanks to having these logs available we have been able to resolve these issues quickly.

JVM bug. If the stack trace contains only JVM entries or happens inside a JIT-compiled method, the JVM is the likely culprit. First thing to do in such a case is to search for the first frame of the stack trace, which could lead to an issue on the issue tracker for the JVM. In case of an outdated JVM version, a good course of action is to just update the JVM. Sometimes these crashes are difficult to link to a JVM bug even if it is present in a changelog or in the issue tracker of the JVM. In case updating is not overly complicated, it is easier to update to exclude any bugs that may have been fixed already. Should there be any suspicion that this type of crash occurs only when using Plumbr or significantly more often with it, then do not hesitate to contact us, as we will make sure to do our best to resolve the issue or at least provide a workaround for it.

Unsafe operations called from Java. If you see sun.misc.Unsafe in the stack trace of the crash log, the most likely culprit is the first non-JDK class in the stack trace.

From more exotic cases, failed assertions and stack buffer overflows are two that I have encountered a handful times:

Failed asserts in the JVM. In case the JVM detects that it is in an invalid state, it may terminate itself. In this case, the first line indicating a fatal error is the same as for segmentation faults, but the next one after that indicates an internal error and includes the source file and line number in the JVM source code where the assert is, for example “Internal Error (sharedRuntime.cpp:721)“. This is usually caused by a JVM bug.

There are some other types of crashes which may not produce a crash log at all. One example is stack buffer overflow in either the JVM or some other native library. In this case some details about the overflow may be found in the standard error output of the process. Additionally, while the JVM always creates a crash log in case of a failed assert, when asserts fail in other libraries, they generally terminate the process directly without giving the JVM a chance to do anything.

Where to look for the symptoms of the JVM crash?

Segmentation faults, running out of native memory and assertion errors all result in similar symptoms from the operating system level. You can spot the symptoms from two sources, one of which is the appearance of hs_err_<pid>.log (usually) in the JVM’s working directory. The other location to look for a crash will be embedded in the log while where stderr is streamed to. Such logs would then contain snippets like following:

So you should already notice two possibilities to be alerted in cases when such errors occur:

From system monitoring perspective, you can periodically check the working directories of the JVMs and whenever a file named hs_err<pid>.log appears in any of them, raise an alert. As seen, the location of this file can be platform-dependent, so you need to be careful when making this expectation.

From log monitoring perspective, you can trigger alerts based on the log record format. Segmentation faults end up with errors similar to the example above, so matching a log entry to “A fatal error has been detected by the Java Runtime Environment” could also be what you are looking for.

As Plumbr is neither system nor log monitoring solution, we had to take a different route. In next section we are going to expose what and how we are doing in this regards.

How Plumbr exposes JVM crashes.

In general, Plumbr is all about monitoring end user experience in regards of performance and availability. We do this by an Agent attached to Java Virtual Machines, which is using instrumentation and bytecode modification to get the information from the JVM monitored. The Agent harvests data, packages it and periodically sends it to Server where it is analyzed and results exposed.

Besides monitoring your applications at individual user interaction level, Plumbr also embeds a concept of a downtime. A downtime is detected and alert sent out when all the JVMs servicing the first tier in the application monitored with Plumbr are not responding. So in Plumbr UI the location for such JVM crashes is embedded in downtime events view. If the JVM has crashed, our UI will confirm this in downtime views by stating “Crashed due to a fatal error”. Clicking on it will reveal the full crash log and allow to navigate between all the crashes for the JVM in the selected time period.

So how do we get the information from a JVM which has just crashed? If we would try to do it from within the same process that just got killed we would obviously not succeed. So in order to detect crashes, Plumbr starts a separate native process when the JVM is launched waiting until the JVM process has exited. The process is started in a way which is unaffected by the death of the JVM process. The goal for the process is to periodically check if the JVM process is still alive.

On Linux and OS X this check is done by checking if the parent process ID is still the JVM process. In case the JVM has exited, this process is adopted by another one and the parent process ID changes. In Windows, this check is done by waiting on the handle of the parent process, which times out if the process has not exited yet.

Once this native process has detected that the JVM has exited, it looks for the crash log the JVM would have created if it had crashed. The possible places for the crash log depend on the JVM vendor, operating system, process ID and JVM parameters.

The default location is usually the working directory of the JVM process. For certain OpenJDK distributions or in case the working directory is not writable, it will end up in the temporary directory. Additionally, the location can be manually configured with the -XX:ErrorFile JVM parameter. Regardless of where this log ends up on the machine, it will be accessible from the Plumbr downtime events list.

After looking for the crash log, the process notifies Plumbr Server of the JVM exit, including the crash log if it was found. This also allows Plumbr Server to reliably know that the process is dead and is not coming back. In case sending this notification to the server fails, it will be retried from the JVM process the next time it is launched, to make sure it doesn’t get lost.

Take-away

Being proud of my craft, I do recommend adding Plumbr to your monitoring stack. Besides getting insights into how end users are using your applications and detecting performance bottlenecks, we do expose the nastiest crashes to you in a really convenient way. If you have not yet used Plumbr, give us a test drive by signing up for a free trial.