Memory size, overcommit, limit

Paging and the OOM-Killer

Due to the nature of experiments our group runs, we often induce heavy paging and complete exhaustion of available memory on certain nodes. Linux has a pair of strategies to deal with heavy memory use. First, is overcommitting. This is where a process is allowed allocate or fork even when there is no more memory available. You can seem some interesting numbers here:[1]. The assumption is that processes may not use all memory that they allocate and failing on allocation is worse than failing at a later date when the memory use is actually required. More processes may be supported by allowing them to allocate memory (provided they do not use it all). The second part of the strategy is the Out-of-Memory killer (OOM Killer). When memory has been exhausted and a process tries to use some 'overcommitted' part of memory, the OOM killer is invoked. It's job is to rank all processes in terms of their memory use, priority, privilege and some other parameters, and then select a process to kill based on the ranks.

The argument for using overcommital+OOM Killer is that rather than failing to allocate memory for some random unlucky process, which as a result would probably terminate, the kernel can instead allow the unlucky process to continue executing and then make a some-what-informed decision on which process to kill. Unfortunately, the behaviour of the OOM-killer sometimes causes problems which grind the machine to a complete halt, particularly when it decides to kill system processes. There is a good discussion on the OOM-killer here: [2]