I used to think that load average of 2,30 on this server means I have 2,30 processes running or ready-to-run on the server (ie using CPU or waiting I/O or waiting for CPU scheduling). On an 8 CPU server, it does not seem to be excessive. But a load is also displayed per CPU and is also near 2.

So, is run queue displayed as "Load averages" in top or in uptime displayed global or divided by CPU count ? does this load indicates waiting processes or an underloaded system ?

Re: load average

Dennis,

I do not plan to use load average as the only server state metric. It's just one metric over many others. I have many servers to watch over. Glance will help finding problems and bottleneck. What I need are metrics to put in a supervision tool to know which server needs particular care.

A percentage is not a better metric by itself. Look at CPU usage given by sar output. 100% means nothing. It could as well mean your server is fuly charged of overloaded.

As I understand your answer, load average shown in uptime or top output is divided by cpu count. A load of 2 on a 8 CPU server means 16 processes in run queue.

Re: load average

Hi SEP,

It seems to be confuse in many minds. I searched a lot about this on the web and many sources (from ITRC to wikipedia, see how large was my search) made me think load average was not divided by CPU count.

Re: load average

The "load' value reported by uptime and top is simply the average size of the runqueue. The runqueue counts every process that is currently running plus every process that is ready to run (nothing blocking the process - note that I/O is a block) but cannot run because all the processors are busy. You can easily test this with a simple 3-line script:

while :do :done &

The above will run in the background and consume 100% of one CPU. top will report 100% for one CPU and the load average will eventually go to 0.50 on a 2-CPU system (assuming no other significantly CPU-bound processes are running). The key is that this is a load average for the system as a whole...1.00 means that all CPUs were busy during the measurement period. If you run the above script 3 more times on a 2-CPU system, the load will be 2.00 meaning that 2 process were running and 2 were waiting to run, on average.

Now HP-UX treats CPU-bound processes as less important and starts to reduce the priority (increases the C number in ps). So this poor 2-CPU system is apparently overloaded yet logins and disk I/O and vi and other user processes that use disk seem to run without any delays. I/O is treated as high importance because it takes a while to complete so the scheduler will quickly restart a process that has completed an I/O operation.

So the load factor is indeed divided by the number of CPUs. A load factor of 66 for a 64 CPU system means that more than 4000 processes were ready to run on average. Although this might seem to be excessive, the metric cannot distinguish unique processes. So a very fast process servicing specialized activities might be counted in the runqueue multiple times as each copy runs for a few milliseconds. This is the ambiguity in trying to measure very rapidly changing activities in a multiprocessor, multitasking OS.