Linux High IO load.. what to check for trouble shooting?

When you look at the CPU activity of your computer, one of the parameters is the iowait. This value shows how much time your CPU wastes while it is waiting for I/O operations for complete. These include disk read/write operations, network, IPC, etc. Is this behavior a problem and, if so, what causes it and how to fix it? One one of the popular Unix-related forums one “genius” wrote:

The iowait “problem” is funny. It’s like when people complain that Linux is “using all my memory”. Yeah, no shit. You should be upset if you are copying files and your computer is /not/ in 100% iowait.

In reality, 100% iowait indicates that there is a problem and in most cases – a big problem that may even lead to data loss. Essentially, there is a bottleneck somewhere in the system. Maybe one of your disks is getting ready to die; or, perhaps, the NIC firmware is having problems with the latest kernel upgrade you installed. The troubleshooting process starts with the potentially more serious possibility: bad disk.

Take a quick look at /etc/messages, /etc/dmesg, /etc/boot.log and any other system log files. You are looking for disk I/O errors, failed read/write operations, bad sectors – anything that indicates a hardware problem with a disk. If you don’t find anything, look for IRQ and disk controller errors. Also look for memory errors and kernel panics. The three most likely culprits of high iowait are: bad disk, faulty memory and network problems.

If you still see nothing relevant, it is time to test your system. If possible, kick all the users off the box, shut down Web server, database and any other user application. Log in via command line and stop XDM.

Open three shell windows: run “top” in one, “iostat -x 1? in the other and “find /etc -type f -print” in the third. Make sure you can see all three windows at the same time. This is a simple test that should generate some I/O activity on the system disk. Repeat this process for other disks. If you see iowait hovering near 100%, chance are you have a problem but we don’t know what it is yet. However, now we do know that network is probably not the cause.

Next step, lets stress out your CPU but not the disks. The command below will try to create an endless zip file in /dev/null. This generates no disk activity, but loads the CPU. Continue running “top” and “iostat -x 1? in the other two windows.

cat /dev/zero | bzip2 -c > /dev/null

If you see high CPU load but low iowait, we can eliminate CPU issues, IRQ conflicts, and faulty memory. Just to be on the safe side, let’s test memory anyway:

The three MD5 values above should be identical. If they are not – your system has a faulty RAM chip.

When you have eliminated hardware problems as possible causes of high iowait, the next step is to review firmware and drivers. You are particularly interested in disk controller firmware: unstable performance and no error messages are the signs of a firmware problem. Try really hard to remember if you made any system changes recently, especially something that required a reboot – like kernel upgrade, for example. If this is the case, roll back the upgrade or search for upgrade firmware. You should grab a copy of Sysinfo (free 30-day trial) to help you identify makes and models of your disks, controllers, etc.

While your disks and controllers may be tip-top, your may have a problem with a filesystem. Even if you see high iowait when accessing any filesystem, you should still check out the partition where /var is mounted and swap – if there is a problem, it will manifest itself regardless of what your system is doing. But here you will run into a little problem: fsck will not scan a mounted partition and you cannot unmount /var. Let’s say these are your partitions:

You need to fsck /dev/hda2 because this is where your /var is mounted. Download KNOPPIX or Ubuntu LiveCD, boot from CD (without installing) and “fsck /dev/hda2? from there. If everything looks clean, shut down your system, take the CD out and boot normally. The next step is to check out swap. If you just run fsck on the swap partition, it will fail:

You need to disable swap on /dev/hda1 before you can scan it. Before you can do this, you need to add another swap area: you cannot run without any swap space. So, to add swap on the fly, create a swap file (1Gb in this example):

If you failed to identify the cause of the iowait problem, you should consider the possibility that there is no problem: perhaps your system is handling extra load and running short on resources. Take a look at the running processes and see what’s eating up memory. Perhaps you upgraded an application and now it is using more RAM, which leads to high swapping, which leads to high disk activity, which leads to high iowait.

The solutions are simple:

1. Install more RAM
2. Move swap to another disk or – even better – move it to another disk on a separate controller.
3. Move user applications to another disk/controller and specify default log locations outside of the system disk.