Performance Monitoring : Vmstat – to find performance bottleneck

The vmstat command is a useful tool to get a general overview of system performance and to triage for performance issues. Performance bottlenecks fall into four major categories: CPU, disk I/O’s, memory, and network I/O’s. The vmstat can help to find problems within the first three categories.Before getting started, it is important to note that the first line of output from vmstat (and the only one given if it is run with no arguments) is an average since system boot time. It is usually not very useful for performance issues, especially if the system has been up for a long time. However, it is a good information for comparaison and can be used to get an overall picture of the system workload.

There are two important parts of the vmstat output which pertain to CPU performance. The first is the r column. It is the very first column in the output. It contains a value which corresponds to the number of threads which were in the run queue during the past interval in which vmstat was run. The only thing keeping these threads from running was that no CPU was available to them.

There are several schools of thought on the maximum number is appropriate here, but most people agree that more than 2 to 5 times the number of CPUs on the system shows a bottleneck (This estimate is different when using multi-core CPUs). The way to resolve this is to run fewer applications or to add CPU modules to the system.

The second place to look for CPU-related data is the right-hand column of the output. There are three columns, us , sy , and id . These mean user, system, and idle percentage usage of CPU time, and they pertain to the three functions the CPU can be working toward. They should add up to 100%.

Ideally, a CPU will spend most of its time in the us and id categories. The sy category refers to time the CPU spends doing driver/kernel-level work. This is time taken away from user applications. If the CPUs are spending most of their time in this category, it could indicate excessive context switching due to either CPU or memory bottlenecks, issues with kernel-level locking, or other problems. A busy system will show a constant idle percentage near zero. Be careful, a busy system doesn’t necessarily mean that the system is overloaded.

Disk / IO Performance

The vmstat utility cannot tell us which disks have a bottleneck, but it can tell us if there is an IO problem overall. The important column in the output is the b column. It is the second column in the output, and stands for “Blocked.” This refers to the number of threads that were blocked, or waiting, for IO in the past interval. Over time, this column should contain 0’s the majority of the time. If there is constantly a number in that column you can check mpstat to see if wt (percent wait time) column has a high number too. That in conjunction with the vmstatb column indicate a system blocked for disk I/O and it is advisable to get an extended iostat -xpn or iostat -xtc output and examine it thoroughly to try to detect the bottlenecks.

Memory Bottlenecks

Memory bottlenecks are evidenced by two different things happening on the system — paging and swapping. Paging refers to pages of memory being reclaimed by the page daemon when the system starts to get low on free memory. Swapping is more extreme, and refers to entire processes being swapped out.

To determine if you are only paging, or also swapping, examine two columns in the vmstat output. The first column is the sr column. If the value in this column is greater than zero then the page scanner is scanning memory pages to put them back on the free list to be reused.

The page scanner runs when memory falls under the value of a system parameter known as lostfree – default value is 1/64th of physical memory – or cachefree if priority_paging is enabled default value is 1/128th of physical memory.

You should not worry about high scan rates if you are using the file system heavily. High scan rates can be normal in many circumstances. If priority_paging is enable, the page scanner steals the pages more effectively so the file system I/O does not cause unnecessary paging of applications. priority_paging causes sr rate to be higher for its own good. Solaris 8 introduces the cyclic cache. With cyclic cache, the scanner is not used to reclaim pages during file system I/O therefore if sr is greater than 0 then it’s a indication that the system is running low in memory.

To see if you are swapping, refer to the w column. It is the third column of the output, and refers to entire processes which are swapped out. You can determine what these processes are by running the command ‘ /usr/bin/ps -e -o pid,rss,args ‘ and looking for a RSS of 0 (sched, pageout and fsflush processes should always have a RSS of 0).

If you have anything in the w column, you are either low on memory right now, or you have been in the past. If your system gets low on memory and processes are swapped out, it may take a long time for them to get back into memory. This is especially true if they are daemons which are not run often, because they have to receive an event in order to try to run again. This is not necessarily bad, as long as when they need to run, they will have the memory to do so.

If, over time, you see swapping, you should probably consider adding memory to the system or devising a strategy to low overall memory usage on the system.