Check io-wait

Symptom: Slow IO performance investigation

One way to investigate the CQ performance is by checking request.log. If most of the requests take long time, what is the cause?

One of the possible causes would be disk performance as well as general application.

Some factors block disk performance including 'io-wait," which is the percentage of time that the CPU or CPUs wait for disk system because the system has an outstanding disk I/O request. The io-wait information from sysstat log tends to lead to misunderstanding. However, If you pay attention, it is not difficult to figure out server performance.

We can get io-wait by mpstat and iostat, on Linux. (We can get the info also on Windows, but it's not so easy to configure to collect the info.) The io-wait is calculated with just wait-time for storage device during the period. Make sure that you don't see the io-wait value on "all" of mpstat. It's an average value. We should sum up the total io-wait values of all CPUs.

For example, during the measurement period 1.0 sec under one disk system only accessed for CQ, if CPU0 waits for io-wait for 0.1sec, CPU1 for 0.2sec, CPU2 for 0.3 sec and CPU3 for 0.4sec, the system with all 4 CPUs waits for ii-wait 1.0 sec, totally 100%. That means, the system is blocked to run any process which read/write data from/to the storage due to io-wait. Meanwhile, the average value is shown as 25%. It looks like that there are enough resources. But don't use the value because the entire system does not wait for io-wait 0.25sec. If the CQ/AEM process is blocked by disk low performance, the maximum average value (shown on the column '%iowait' at CPU 'all' of mpstat) is about 100% on one CPU, 50% on two CPUs, 25% on four CPUs and 12.5% on eight CPUs. The value looks smaller if the system has more CPUs. But in fact, the total io-wait is 100% however many CPUs the system has.

...
Do not trust '24.04' on column '%iowait' at CPU 'all' line! Sum up all the %iowait values of each CPU, 14.41 + 17.63 + 56.91 + 6.25.
(If you use '24.04', multiply 24.04 by 4.)
---------------

If the system uses multiple storage devices for other application, the total io-wait value would exceed over 100%. At that time, check iostat to check how load each slice/storages eats.

Considerations

In general, if exceeded over 10%, consider that lack of I/O process would cause late response. The less io-wait, the better the performance is. To reduce io-wait unless some specified processes cause much io-wait, check the existing disk storage configuration and consider changing disk storage, such as disk with high performance, RAID configuration or SSD.