The OOM killer on Linux wreaks havoc with various applications every so often, and it appears that not much is really done on the kernel development side to improve this. Would it not be better, as a best practice when setting up a new server, to reverse the default on the memory overcommitting, that is, turn it off (vm.overcommit_memory=2) unless you know you want it on for your particular use? And what would those use cases be where you know you want the overcommitting on?

As a bonus, since the behavior in case of vm.overcommit_memory=2 depends on vm.overcommit_ratio and swap space, what would be a good rule of thumb for sizing the latter two so that this whole setup keeps working reasonably?

3 Answers
3

An aircraft company discovered that it
was cheaper to fly its planes with
less fuel on board. The planes would
be lighter and use less fuel and money
was saved. On rare occasions however
the amount of fuel was insufficient,
and the plane would crash. This
problem was solved by the engineers of
the company by the development of a
special OOF (out-of-fuel) mechanism.
In emergency cases a passenger was
selected and thrown out of the plane.
(When necessary, the procedure was
repeated.) A large body of theory was
developed and many publications were
devoted to the problem of properly
selecting the victim to be ejected.
Should the victim be chosen at random?
Or should one choose the heaviest
person? Or the oldest? Should
passengers pay in order not to be
ejected, so that the victim would be
the poorest on board? And if for
example the heaviest person was
chosen, should there be a special
exception in case that was the pilot?
Should first class passengers be
exempted? Now that the OOF mechanism
existed, it would be activated every
now and then, and eject passengers
even when there was no fuel shortage.
The engineers are still studying
precisely how this malfunction is
caused.

The OOM killer only wreaks havoc if you have overloaded your system. Give it enough swap, and don't run applications that suddenly decide to eat massive amounts of RAM, and you won't have a problem.

To specifically answer your questions:

I don't think it's a good idea to turn off overcommit in the general case; very few applications are written to properly deal with brk(2) (and the wrappers that use it, such as malloc(3)) returning an error. When I experimented with this at my previous job, it was deemed to be more of a hassle to get everything capable of handling out-of-memory errors than it was just to deal with the consequences of an OOM (which, in our case, was far worse than having to restart the occasional service if an OOM occured -- we had to reboot an entire cluster, because GFS is a steaming pile of faeces).

You want overcommitting on for any process that overcommits memory. The two most common culprits here are Apache and the JVM, but plenty of apps do this to some greater or lesser degree. They think they might need a lot of memory at some point in the future, so they grab a big chunk right off. On an overcommit-enabled system, the kernel goes "meh, whatever, come bother me when you actually want to write to those pages" and nothing bad happens. On an overcommit-off system, the kernel says "no, you can't have that much memory, if you do happen to write to it all at some point in the future I'm boned, so no memory for you!" and the allocation fails. Since nothing out there goes "oh, OK, can I have this smaller amount of process data segment?", then the process either (a) quits with an out-of-memory error, or (b) doesn't check the return code from malloc, thinks it's OK to go, and writes to an invalid memory location, causing a segfault. Thankfully, the JVM does all it's prealloc on startup (so your JVM either starts or dies immediately, which you usually notice), but Apache does it's funky stuff with each new child, which can have exciting effects in production (unreproducible "not handling connections" types of excitement).

I wouldn't want to set my overcommit_ratio any higher than the default of 50%. Again, from my testing, although setting it up around 80 or 90 might sound like a cool idea, the kernel requires big chunks of memory at inconvenient times, and a fully-loaded system with a high overcommit ratio is likely to have insufficient spare memory when the kernel needs it (leading to fear, pestilence, and oopses). So playing with overcommit introduces a new, even more fun failure mode -- rather than just restarting whatever process got OOMed when you run out of memory, now your machine crashes, leading to an outage of everything on the machine. AWESOME!

Swap space in an overcommit-free system is dependent on how much requested-but-unused memory your applications need, plus a healthy safety margin. Working out what's needed in a specific case is left as an exercise for the reader.

Basically, my experience is that turning off overcommit is a nice experiment that rarely works as well in practice as it sounds in theory. This nicely corresponds with my experiences with other tunables in the kernel -- the Linux kernel developers are almost always smarter than you, and the defaults work the best for the vast, vast majority of cases. Leave them alone, and instead go find what process has the leak and fix it.

I don't want my backup process to get killed because someone is DoS-ing my web server. Exceptions are fine but the default should be safety and consistency. Optimizations like OOM should be turned on manually IMHO. It is like coding, you code cleanly, and then optimize. Over-commit is a nice feature, but should not be the default.
–
AkiOct 7 '13 at 15:42

If you don't want your backup process to get killed because someone is DoS-ing your web server, don't configure your webserver in such a way that a DoS can cause the resources on the system to become overwhelmed.
–
wombleOct 10 '13 at 5:55

Hmm, I'm not fully convinced by arguments in favour of overcommit and OOM killer...
When womble writes,

"The OOM killer only wreaks havoc if you have overloaded your system. Give it enough swap, and don't run applications that suddenly decide to eat massive amounts of RAM, and you won't have a problem."

He's about describing an environment scenario where overcommit and OOM killer are not enforced, or don't 'really' act (if all applications allocated memory as needed, and there were enough virtual memory to be allocated, memory writes would closely follow memory allocations without errors, so we couldn't really speak about an overcommited system even if an overcommit strategy were enabled). That's about an implicit admission that overcommit and OOM killer works best when their intervention is not needed, which is somehow shared by most supporters of this strategy, as far as I can tell (and I admit I cannot tell much...). Morover, referring to applications with specific behaviours when preallocating memory makes me think that a specific handling could be tuned at a distribution level, instead of having a default, systemwise approach based on heuristics (personally, I believe that heuistic is not a very good approach for kernel stuff)

For what concern the JVM, well, it's a virtual machine, to some extent it needs to allocate all the resources it needs on startup, so it can create its 'fake' environment for its applications, and keep its available resource separated from the host environment, as far as possible. Thus, it might be preferable to have it failing on startup, instead of after a while as a consequence of an 'external' OOM condition (caused by overcommit/OOM killer/whatever), or anyway suffering for such a condition interfering with its own internal OOM handling strategies (in general, a VM should get any required resources from the beginning and the host system should 'ignore' them until the end, the same way any amount of physical ram shared with a graphics card is never - and cannot be - touched by the OS).

About Apache, I doubt that having the whole server occasionally killed and restarted is better than letting a single child, along with a single connection, fail from its (= the child's/the connection's) beginning (as if it were a whole new instance of the JVM created after another instance run for a while). I guess the best 'solution' might depend on a specific context. For instance, considering an e-commerce service, it might be far preferable to have, sometimes, a few connections to shopping chart failing randomly instead of loosing the whole service, with the risk, for instance, to interrupt an ongoing order finalization, or (maybe worse) a payment process, with all consequences of the case (maybe harmless, but maybe harmfull - and for sure, when problems arose, those would be worse then an unreproducible error condition for debugging purposes).

The same way, on a workstation the process which consumes the most resources, and so tailing to be a first choice for the OOM killer, could be a memory intensive application, such as a video transcoder, or a rendering software, likely the only application the user wants to be untouched. This considerations hints me that the OOM killer default policy is too aggressive. It uses a "worst fit" approach which is somehow similar to that of some filesystems (the OOMK tries and free as much memory as it can, while reducing the number of killed subprocesses, in order to prevent any further intervention in short time, as well as a fs can allocate more disk space then actually needed for a certain file, to prevent any further allocation if the file grew and thus preventing fragmentation, to some extent).

However, I think that an opposite policy, such as a 'best fit' approach, could be preferable, so to free the exact memory being needed at a certain point, and not be bothering with 'big' processes, which might well be wasting memory, but also might not, and the kernel cannot know that (hmm, I can imagine that keeping trace of page accesses count and time could hint if a process is allocating memory it doesn't need any more, so to guess whether a process is wasting memory or just using a lot, but access delays should be weighted on cpu cycles to distinguish a memory wasting from a memory and cpu intensive application, but, whereas potentially inaccurate, such an heuristics could have an excessive overhead).

Moreover, it might not be true that killing the fewer possible processes is always a good choise. For instance, on a desktop environment (let's think of a nettop or a netbook with limited resources, for sample) a user might be running a browser with several tabs (thus, memory consuming - let's assume this is the first choise for the OOMK), plus a few other applications (a word processor with non saved data, a mail client, a pdf reader, a media player, ...), plus a few (system) daemons, plus a few file manager instances. Now, an OOM error happens, and the OOMK chooses to kill the browser while the user is doing something deemed 'important' over the net... the user would be disappointed. On the other hand, closing the few file manager's instances being in idle could free the exact amount of memory needed while keeping the system non only working, but working in a more reliable way.

Anyway, I think that the user should be enabled to take a decision on his own on what's to do. In a desktop (=interactive) system, that should be relatively quite easy to do, provided enough resources are reserved to ask the user to close any application (but even closing a few tabs could be enough) and handle his choise (an option could consist of creating an additional swap file, if there is enough space). For services (and in general), I'd also consider two further possible enhancements: one is logging OOM killer intervents, as well as processes starting/forking failures in such a way the failure could be easily debugged (for instance, an API could inform the process issuing the new process creation or forking - thus, a server like Apache, with a proper patch, could provide a better logging for certain errors); this could be done indepenently from the overcommit/OOMK being in effort; in second place, but not for importance, a mechanism could be established to fine-tune the OOMK algorithm - I know it is possible, to some extent, to define a specific policy on a process by process basis, but I'd aim a 'centralised' configuration mechanism, based on one or more lists of application names (or ids) to identify relevant processes and give them a certain degree of importance (as per listed attributes); such a mechanism should (or at least could) also be layered, so that there could be a top-level user-defined list, a system- (distribution-) defined list, and (bottom-level) application-defined entries (so, for instance, a DE file manager could instruct the OOMK to safely kill any instance, since a user can safely reopen it to access the lost file view - whereas any important operation, such as moving/copying/creating data around could be delegated to a more 'privileged' process).

Morover, an API could be provided in order to allow applications to rise or lower their 'importance' level at run-time (with respect to memory management purposes and regardless execution priority), so that, for instance, a Word processor could start with a low 'importance' but rise it as some data is holded before flushing to a file, or a write operation is being performed, and lower importance again once such operation ends up (analogously, a file manager could change level when it passed from just liting files to dealing with data and viceversa, instead of using separate processes, and Apache could give different levels of importance to different children, or change a child state according to some policy decided by sysadmins and exposed through Apache's - or any other kind of server's - settings). Of course, such an API could and would be abused/misused, but I think that's a minor concern compared to the kernel arbitrarily killing processes to free memory without any relevant information on what's going on the system (and memory consumption/time of creation or the alike aren't enough relevant or 'validating' for me) - only users, admins and program writers can really determine whether a process is 'still needed' for some reason, what the reason is, and/or if the application is in a state leading to data loss or other damages/troubles if killed; however, some assumption could yet be made, for instance looking for resources of a certain kind (file descriptors, network sockets, etc.) acquired by a process and with pending operations could tell if a process should be in a higher 'state' than the one set, or if its 'self-established' one is higher than needed and can be lowered (aggressive approach, unless superseded by user's choises, such as forcing a certain state, or asking - through the lists I mentioned above - to respect application's choises).

Or, just avoid overcommitting and let the kernel do just what a kernel must do, allocating resources (but not rescuing them arbitrarily as the OOM killer does), scheduling processes, preventing starvations and deadlocks (or rescuing from them), ensuring full preemption and memory spaces separation, and so on...

I'd also spend some more words about overcommit approaches. From other discussions I've made the idea that one of the main concerns about overcommit (both as a reason to want it and as a source of possible troubles) consists of forks handling: honestly, I don't know how exactly the copy-on-write strategy is implemented, but I think that any aggressive (or optimistic) policy might be mitigated by a swap-alike locality strategy. That is, instead of just cloning (and adjusting) a forked process code pages and scheduling structures, a few other data pages could be copied before an actual write, choosing among those pages the parent process has accessed for writing more frequently (that is, using a counter for write operations).

"Morover, an API could be provided in order to allow applications to rise or lower their 'importance' level at run-time" Importance is /proc/$PID/oom_adj.
–
Vi.Jul 3 '10 at 11:51

Regarding the JVM, there is a gotcha that make you want memory overcommit in some cases: in case you want to create another JVM from your original JVM, it will call fork(). A fork call will allocate as much memory as the original process (first), until it really starts the process. So say you have a 4GB JVM and want to create a new 512KB JVM from it, unless you have overcommit, you will need 8GB of memory to do that...
–
alciSep 19 '11 at 14:53