Out of memory: Kill process or sacrifice child

It is 6 AM. I am awake summarizing the sequence of events leading to my way-too-early wake up call. As those stories start, my phone alarm went off. Sleepy and grumpy me checked the phone to see whether I was really crazy enough to set the wake-up alarm at 5AM. No, it was our monitoring system indicating that one of Plumbr services went down.

As a seasoned veteran in the domain, I made the first correct step towards solution by turning on the espresso machine. With a cup of coffee I was equipped to tackle the problems. First suspect, application itself seemed to have behave completely normal before the crash. No errors, no warning signs, no trace of any suspects in the application logs.

The monitoring we have in place had noticed the death of the process and had already restarted the crashed service. But as I already had caffeine in my bloodstream, I started to gather more evidence. 30 minutes later I found myself staring at the following in the /var/log/kern.log :

Apparently we became victims of the Linux kernel internals. As you all know, Linux is built with a bunch of unholy creatures ( called ‘daemons’). Those daemons are shepherded by several kernel jobs, one of which seems to be especially sinister entity. Apparently all modern Linux kernels have a built-in mechanism called “Out Of Memory killer” which can annihilate your processes under extremely low memory conditions. When such a condition is detected, the killer is activated and picks a process to kill. The target is picked using a set of heuristics scoring all processes and selecting the one with the worst score to kill.

Understanding the “Out Of Memory killer”

By default, Linux kernels allow processes to request more memory than currently available in the system. This makes all the sense in the world, considering that most of the processes never actually use all of the memory they allocate. The easiest comparison to this approach would be with the cable operators. They sell all the consumers a 100Mbit download promise, far exceeding the actual bandwidth present in their network. The bet is again on the fact that the users will not simultaneously all use their allocated download limit. Thus one 10Gbit link can successfully serve way more than the 100 users our simple math would permit.

A side effect of such approach is visible in case some of your programs is on the path of depleting the system’s memory.This can lead to extremely low memory conditions, where no pages can be allocated to process. You might have faced such situation, where not even a root account cannot kill the offending task. To prevent such situations, the killer activates, and identifies the process to be the killed.

What was triggering the Out of memory killer?

Now that we have the context, it is still unclear what was triggering the “killer” and woke me up at 5AM? Some more investigation revealed that:

The configuration in /proc/sys/vm/overcommit_memory allowed overcommitting memory – it was set to 1, indicating that every malloc() should succeed.

The application was running on a EC2 m1.small instance. EC2 instances have disabled swapping by default.

Those two facts combined with the sudden spike in traffic in our services resulted in the application requesting more and more memory to support those extra users. Overcommitting configuration allowed to allocate more and more memory for this greedy process, eventually triggering the “Out of memory killer” who was doing exactly what it is meant to do. Killing our application and waking me up in the middle of the night.

Example

When I described the behaviour to engineers, one of them was interested enough to create a small test case reproducing the error. When you compile and launch the following Java code snippet on Linux (I used the latest stable Ubuntu version):

Solution?

There are several ways to handle such situation. In our example, we just migrated the system to an instance with more memory. I also considered allowing swapping, but after consulting with engineering I was reminded of the fact that garbage collection processes on JVM are not good at operating under swapping, so this option was off the table.

Other possibilities would involve fine-tuning the OOM killer, scaling the load horizontally across several small instances or reducing the memory requirements of the application.

If you found the study interesting – follow Plumbr in Twitter or RSS, we keep publishing our insights about Java internals.

6. Spring Interview Questions

7. Android UI Design

and many more ....

One Response to "Out of memory: Kill process or sacrifice child"

Was the amount of memory allocated for this application (Xmx) higher than amount of memory actually available on this host minus some 500-1000m for the OS itself? Otherwise, I don’t get how the JVM wouldn’t crash with OOM itself when trying to allocate another array.

Newsletter

Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

Email address:

Join Us

With 1,043,221 monthly unique visitors and over 500 authors we are placed among the top Java related sites around. Constantly being on the lookout for partners; we encourage you to join us. So If you have a blog with unique and interesting content then you should check out our JCG partners program. You can also be a guest writer for Java Code Geeks and hone your writing skills!