sábado, 18 de abril de 2015

Linux load average - the definitive summary

What is the
Linux load average?

This is not exactly an orphan question but, as many other questions we tried to address in this blog, it is surrounded by misconceptions and incorrect information. Every time one starts discussing load averages, either in person or online, confusion steps in... and refuses to leave. We will try to provide an explanation that is "as simple as possible, but not simpler", as Einstein said once, and also short enough to be worth reading.

Definition 1

We will call the instantaneous load of a system the number of tasks (processes and threads) that are willing to run at a given time t.

Tasks willing to run are either in state R or D. That is, they are either actually running or blocked on some resource (CPU, IO, ...) waiting for an opportunity to run. The instantaneous number of such tasks can be determined using the following command

ps -eL h -o state | egrep "R|D" | wc -l

(see footnote [1] for more info on this)

Definition 2

We will call the load average of a system a specific averaging function of the instantaneous load value and all the previous ones.

For historical reasons the Linux kernel adopted the recursive functions

a(t,A) = a(t-1)exp(-5/60A) + l(t)(1-exp(-5/60A))

where parameter A takes the values of 1,5 and 15 and l(t) is the instantaneous load. To the above set of 3 functions, corresponding to the 3 values of A, we call 1m, 5m and 15m load averages. If we set A=0 we find a(t,0)=l(t) recovering, therefore, definition 1. That means, l(t) would be the 0m load average.

The load average values are calculated by the kernel every 5 seconds using a(t,A).

Discussion

First of all we should stress that the load average from definition 2 is just a generalization of definition 1.

While their values are similar in nature, the larger the value of A, the lower the contribution of the instantaneous load compared to the contribution of the historic load average value. The main purpose of using an "averaging" function is the smoothening of fast oscilations that could render human inspection of load values nearly impossible. The timespan of that smoothening effect is influenced by parameter A.

The load average can be calculated from a bash or python script, using definitions 1 and 2, just as the linux kernel does (see /proc/loadavg and https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/kernel/sched/proc.c). Here is the example output of one such calculation, using ps and a(t,1) to estimate the 1m load average:

Kernel vs Script 1m load calculation

Second of all we need to argue that there is no such thing as a too high load average, in absolute terms. In fact, number of tasks that are willing to run on a given system depends on:

the architecture
of the software that is running (is it mostly monolithic? or prone to
spawn many processes? how dependent are such processes between each
other?)

the CPU throughput
requested by the software that is running

the I/O throughput
requested by the software that is running

the CPU
performance of that system

the I/O
performance of that system

the number of
available cores

Therefore, we can only say "the load average is too high on that system" if we know the "the normal value for that system". The "normal value" is an empirically discovered value under which that system usually runs and is known to perform acceptably. The normal value could well be 2 for a server with a low number of cores that runs an interactive web application, or could be 50 for a server that runs (non-interactive) numeric simulations jobs during the night.

Furthermore...

for the same
requested I/O effort and the same hardware, a software implementation
that spreads the computation across many processes or threads will
generate a higher load average; even though the actual throughput is
the same, 10 processes trying to write 10MB each, on an I/O starved
system, generate a higher load average than one process trying to
write 100MB on the same system

given a certain
software that sets all the existing CPU cores to 100% while running
on a specific machine, its execution on a system with smaller a
number of cores, or slower cores, will generate a higher load
average; whether that higher load is a problem or not depends on the
use case (if it means your numeric simulation or your file server
backup takes 10 more minutes during the night but will still be ready
in the morning then no harm is done)

To finish the article we should describe the important relationship between load average values and the CPU usage values that can be seen with utilities like top or iostat (%usr, %sys, %wait, %idle). As we have seen, load average values don't have an absolute numerical meaning unlike CPU usage values, which are are expressed in % of CPU time:

%usr

Time spent running
non-kernel code. (user time, including nice time)

%sys

Time spent running
kernel code. (system time)

%wait

Time spent waiting
for IO. Note: %iowait is not an indication of the amount of IO going
on, it is only an indication of the extra %usr time that the system
would show if IO transfers weren't delaying code execution.

%idle

Time spent idle.

For systems running below their limits, CPU usage values are much more useful than load average values, since their numeric interpretation is universal. But once limits are hit, i.e, CPU %idle time becomes nearly zero, load average values allow us to see how much off the limits the system is running... once we establish a baseline, which is the normal load average for that system (software+hardware combination).

We summarize the load average / CPU usage relationship with a short list of true statements:

if all system
cores are running at %sys+%usr=100 the instantaneous load is equal to or
higher than the number of cores

the instantaneous
load being higher than the number of cores doesn't mean all cores are
running at %sys+%usr=100, since many processes may be I/O waiting
(state D)

the instantaneous
load being higher than the number of cores implies that the system
can't be mostly idle; at least some of the cores will be seen for a
relevant amount of the time in sys,usr or wait states

a system can be
slow / unresponsive even with an instantaneous load below the number
of cores because a small number of I/O intensive processes may become
a bottleneck

in a pure CPU
intensive scenario (negligible I/O, no processes in state D) where
%idle > 0, the instantaneous load is equal to ((100 - %idle)/100) *
NCORES; for example, on a 4 core system at steady %sys+%usr=90 we
would have an instant load of ((100-10)/100)*4 = 3.6

Statement 5) can be easily tested by running

stress -c X

while looking at the output of top on a different terminal, waiting for the 1m load average to stabilize. It is trivial to see that the above formula holds until X=NCORES, which will cause %idle=0.

We haven't discussed Hyperthreading, or the equivalent AMD feature, to avoid complicating the discussion but where above we say NCORES, it could be the number of virtual CPUs, including CPU threads. Of course, each additional % usage on the second thread of an already busy core doesn't yield a proportional throughut.

Footnotes

[1] - The same result should be obtainable by parsing /proc/loadavg (4th field) or /proc/stat (procs_running, procs_blocked) but we have seen from experience that multiple processes in state D are shown by ps but not counted on /proc/loadavg and that neither /proc/loadavg (4th field) nor /proc/stat include threads in the task counters, even though they are taken into account in the load average numbers exposed by the kernel.