Understanding Linux Load Average – Part 3

In part 1 we performed a series of experiments to explore the relation between CPU utilization and Linux load average. We concluded that the load average is influenced by processes running on or waiting for the CPU. Based on experiments in part 2 we came to the conclusion that processes that are performing disk I/O also influence the load average on a Linux system. In this posting we will do another experiment to find out if the Linux load average is also affected by processes performing network I/O.

Network I/O and load average

To check if a correlation exists between processes performing network I/O and the load average we will start 10 processes generating network I/O on an otherwise idle system and collect various performance related statistics using the sar command. Note: My load-gen script uses the ping command to generate network I/O.

The above output shows that the lo interface sent and received almost 90 thousand packets per second good for a total of 136 million bytes of traffic. The other two interfaces had virtually no traffic at all. This is because my network load processes are pinging localhost. Let’s have a look at the CPU utilization before taking a look at the run-queue utilization and Load Average.

On average the CPU spent 14% of its time running code in user mode and 86% of the CPU time was spent running code in kernel mode. This is because the Linux kernel has to work quite hard to handle the amount of network traffic. The question is of course: What effect does this have on the Load Average?

The above sar output shows that the run-queue was constantly occupied by 10 processes and that the 1-minute Load Average slowly climbed towards 10 as one might expect by now ;-) This could be an indication that Load Average is influenced by processes performing network I/O. But maybe the ping processes are using high amounts of CPU time and thereby forcing the Load Average to go up. To figure this out we will take a look at the top output.

It is clear from the above output that the ping processes are not using huge amounts of CPU time at all and that eliminates CPU utilization as the driving force behind the Load Average. The above output also reveals that the high CPU utilization is mainly caused by handling software interrupts, 52% in this case.

Conclusion

Based on this experiment we can conclude that processes performing network I/O have an effect on the Linux Load Average. And based on the experiments in the previous two postings we concluded that processes running on, or waiting for, the CPU and processes performing disk I/O also have an effect on the Linux Load Average. Thus the 3 factors that drive the Load Average on a Linux system are processes that are on the run-queue because they:

Run on, or are waiting for, the CPU

Perform disk I/O

Perform network I/O

Summary

The Linux Load Average is driven by the three factors mentioned above, but how does one interpret a Load Average that seems to be too high? The first step is to look at the CPU utilization. If this isn’t 100% and the Load Average is above the number of CPU’s in the system, the Load Average is primarily driven by processes performing disk I/O, network I/O or the combination of both. Finding the processes responsible for most of the I/O isn’t straightforward because there aren’t many tools available to assist you in doing so. A very useful tool is iotop but it doesn’t seem to work on Oracle Linux 5. It does work on Oracle Linux 6 however. Another tool is atop but it requires one or more kernel patches to be useful.

If the CPU utilization is 100% and the Load Average is above the number of CPUs in the system, the Load Average is either completely driven by processes running on, or waiting for, the CPU or driven by a combination of processes running on, or waiting for, the CPU and processes performing I/O (which could be in turn a combination of disk and network I/O). Using top is an easy method to verify if CPU utilization is indeed solely responsible for the current Load Average or that the other two factors play a role as well. Knowing your system does help a lot when it comes to troubleshooting performance problems. Taking performance baselines using sar is always a good thing to do.-Harald

Like this:

LikeLoading...

Related

This entry was posted on May 28, 2012 at 18:16 and is filed under Linux.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.

Harald van Breederodesaid

Narendrasaid

Thanks for all the 3 articles explaining nicely the load average. I especially liked the fact that you also provided details of how it should be measured and actual statistics to support the conclusions. Can you please also share the “load-gen” script? That would be an icing on the cake!!!
Thanks again…

Krissaid

Harald van Breederodesaid

Harald, you might want to add ‘sar’ is in the ‘sysstat’ package. Upon installation (using the sysstat rpm package on linux), sar collects data per 10 minutes (via the configuration file /etc/cron.d/sysstat). This data is stored in /var/log/sa.

Harald van Breederodesaid

jason smithsaid

I see you’re using the UEK kernel. I’m particularly trying to find out why the UEK kernel shows better Load Averages than it’s RHEL counterpart. If you install Oracle Linux you get both kernels; and in 6 even the RHEL kernel performs better than back in 5. However, at least in our production systems – the UEK kernel reports much lower load averages.

Have you done any testing or know exactly why the UEK kernel shows better load average numbers vs the RHEL stock kernels?

Harald van Breederodesaid

No, I haven’t looked at the differences between the stock and UEK kernels. I doubt there is a change in how the load averages are calculated. I think the UEK reports lower load averages simply because it is a way more optimized kernel.
-Harald

jason smithsaid

“…because it is a way more optimized kernel.” – that’s exactly our explanation :)

but yeah are load averages w/ UEK are significantly lower. I’ve run w/ the RHEL kernel and made hot /proc changes to say the cpu scheduler and io scheduler that Oracle defaults to and see differences in system behavior.

Harald van Breederodesaid

Thanx for your comment. I didn’t know that. Can you give examples when a process enters a uninterruptable sleep state? (It was long ago that I knew exactly what is going on in the kernel ;-) Maybe I am able to demonstrate this behaviour.
-Harald

you already have demonstrated the behaviour. The uninterruptable state is used for a short term wait and disk I/O falls into this category. Look for the section “process state codes” in the manpage for “ps” and you will find the state “D” for “Uninterruptible sleep (usually IO)”. And surely you have seen LGWR or other I/O intensive processes in this state. So this has already been part of your analysis.

Mauriliosaid

Vasudevan Raosaid

It was a very interesting article that has been explained in minute detail. It was very well explained in Unix and Linux System Administration Handbook as well. Your article is indeed superb with command-in-line output. Conclusion and summary part is very excellent. If possible please post in your article load-gen script either in bash or perl script.

Harald van Breederodesaid

Thank you for your question. Yes, it is quite possible that the Load Average is above 500 or even above 1000. It does not matter how many CPUs or how much memory your system have. During my research I managed to drive the Load Average above 1200 on a dual core system with only 4GBytes of memory…
-Harald

Carlos Martinezsaid

thank you sir it was pretty helpfull, i you can share the load -cpu scripts it would be awesome, btw I had a case that there were high cpu-average 5 5 5 and the cpu% dile was like 97% no swaping, physical memory good, the disks were fine they practically has no utilization % but i found that they were having little peaks of iowait but really really few. and then i found a couple of process at D state at top command..i suspect that what is causing the hig load average cpu are those processes in D state, but the weirdest thing even if I reboot the server those process are back again :S

and actually those processes belongs to the S.O or well that’s what I think the owner is root and the name of the command per process is PAL_EVENT ERROR RETRY PARSE CMPLT IDLE thi is happening on a CentOS 6.5.