Understanding Linux Load Average – Part 1

A frequently asked question in my classroom is “What is the meaning of load average and when is it too high?”. This may sound like an easy question, and I really thought it was, but recently I discovered that things aren’t always that easy as they seem. In this first of a three-part post I will explain what the meaning of Linux load average is and how to diagnose load averages that may seem too high.

Obtaining the current load average is very simple by issuing the uptime command:

$ uptime
21:49:05 up 11:33, 1 user, load average: 10.52, 6.03, 3.78

But what is the meaning of these 3 numbers? Basically load average is the run-queue utilization averaged over the last minute, the last 5 minutes and the last 15 minutes. The run-queue is a list of processes waiting for a resource to become available inside the Linux operating system. The example above indicates that on average there were 10.52 processes waiting to be scheduled on the run-queue measured over the last minute.

The questions are of course: Which processes are on the run-queue? And what are they waiting for? Why not find the answer to these questions by performing a series of experiments?

CPU utilization and load average

To be able to perform the necessary experiments I wrote a few shell scripts to generate various types of load on my Linux box. The first experiment is to start one CPU load process, on an otherwise idle system, and watch its effect on the load average using the sar command:

The above sar output reported the load average 6 times with an interval of 30 seconds. It shows that there was 1 process constantly on the run-queue resulting that the 1 minute load average slowly climbs to a value of 1 and then stabilizes there. The 5 minute load average will continue to climb for a few more minutes and will also stabilize at a value of 1 and the same is true for the 15 minute load average assuming the run-queue utilization will remain the same.

The next step is to take a look at the CPU utilization to check if there is a correlation between it and the load average. While measuring the load average using sar I also had it running to report the CPU utilization.

This shows that overall the system was roughly spending 50% of its time running user processes and the other 50% was spent doing nothing. Thus only half of the machine’s capacity was used to run the CPU load which caused a load average of 1. Isn’t that strange? Not if you know that the machine is equipped with two processors. While one CPU was busy running the load the other CPU was idle resulting in an overall CPU utilization of 50%.

Personally I prefer using sar to peek around in a busy Linux system but other people tend to use top for the same thing. This is what top had to report about the situation we are studying using sar:

The -bi command line option given to top tells it to go into batch-mode, instead of full-screen-mode, and to ignore idle processes. The -d30 and the -n7 instructs top to produce 7 sets of output with a delay of 30 seconds between them. The output above is the last of 7 sets of output top produced.

Besides everything we already discovered by looking at the various sar outputs, top gives us useful information about the processes consuming CPU time as well as information about physical and virtual memory usage. It is interesting to see that the busy-cpu process consumes 99.8% while the overall CPU utilization is slightly over 50% resulting in 49% of idle time.

The explanation for this is that top reports an averaged CPU utilization in the header section of its output while the per process CPU utilization is not averaged over the total number of processors.

We can verify this statement by using the -P ALL command line option to make sar report the CPU utilization on a per processor basis as well as the averaged values.

This output confirms that most of the time only one of the two available processors was busy resulting in an overall averaged CPU utilization of 50.2%.

The next experiment is to add a second CPU load process to the still running first CPU load process. This will increase the number of processes on the run-queue from 1 to 2. What effect will this have on the load average?

The output above shows that the number of processes on the run-queue is now indeed 2 and that the load average is climbing to a value of 2 as a result of this. Because there are now 2 processes hogging the CPU we can expect that the overall averaged CPU utilization is close to 100%. The top output below confirms this:

The final experiment is to add 3 additional CPU load processes to check if we can force the load average to go up any further now that we are already consuming all available CPU resources on the system.

We managed to drive the load average up to 5 ;-) Because there are only 2 processors available in the system and there are 5 processes fighting for CPU time, each process will only get 40% from the available 200% CPU time.

Conclusion

Based on all these experiments we can conclude that CPU utilization is clearly influencing the load average of a Linux system. If the load average is above the total number of processors in the system we could conclude that the system is overloaded but this assumes that nothing else influences the load average. Is CPU utilization indeed the only factor that drives the Linux load average? Stay tuned for part two!-Harald

Like this:

LikeLoading...

Related

This entry was posted on April 23, 2012 at 22:28 and is filed under Linux.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

Amir Hameedsaid

It seems that the CPU run-queue is reported differently on Linux than on Solaris. If I am interpreting it correctly, on Linux, run-queue shows the actually number of running processes. On Solaris, run-queue shows the number of processes that are not running yet but are waiting to be put on the CPU. I ran the same test, as shown above, on my Solaris server and the run-queue starts to show a value of greater than zero when the number of load processes increased the number of CPUs on the server.

Harald van Breederodesaid

Yes, that is correct. On Linux the run queue shows the number of running (and waiting) processes. I haven’t verified your statement about Solaris but I believe this is indeed true. I haven’t looked at Solaris for a very long time ;-) But if my memory serves me correct interpreting load average on Solaris is quite different.
-Harald

Harald van Breederodesaid

Thanx for the pointer to another load average article. However that article states that the load average is only affected by CPU utilization. Which is clearly not the case. This is a common misunderstanding hence my postings on this subject.
-Harald

I don’t understand why peoples keep using such a broken indicator. It add values that count on unit (number on process waiting for cores) and values that can count in tens or hundreds (number of process waiting for IO). So a value of 10 can be on the same machine a non issue (10 IO waiting for disks) or a big load (10 process wainting for CPU).
It was designed for computers with one core and one IDE disk, both components being mono-tasked. Those times are gone by now.

vmstat/iostat are usefuls tools. There is a lot of values in /proc too. There is just two of them than one should never use because they where haven’t be upgraded from IDE and mono core computers :
-load average.
-svctm in iostat -x.
But they are wrong only on linux. BSD/Solaris make them right.

Harald van Breederodesaid

I think you miss interpret both articles. I see no differences between both postings.
The load average will increase with each running process, i.e. 1 proc = load 1; 2 procs = load 2; 6 procs = load 6, but in order to determine if your CPUs are fully utilized you need to divide the load by the number of CPUs.
-Harald