Back to Basics With Unix: System Visibility

Frustrated admins say Unix isn't always the most talkative when it comes to gathering performance information. With a few handy utilities you'll find on almost any machine, though, you can learn a lot.

Unix systems have forever been opaque and mysterious to many people. They generally don't have nice graphical utilities for displaying system performance information; you have to know how to coax the information you need. Furthermore, you need to know how to interpret the information you're given. Let's take a look at some common system tools that can provide tons of visibility into what the opaque OS is really doing.

Unfortunately, the same tools don't exist universally across all Unix variants. A few commonly underused ones do, however, and that is what we'll focus on first.

Disk Activity

A common source of "slowness" is disk I/O, or rather the lack of available I/O. On Linux especially, it may be a difficult diagnosis. Often the load average will climb quickly, but without any corresponding processes in top eating much CPU. Linux counts "iowait" as CPU time when calculating load average. I've seen load numbers in the tens of thousands on more than one occasion.

The easiest way to see what's happening to your disks is to run the "iostat" program. Via iostat, you can see how many read and write operations are happening per device, how much CPU is being utilized, and how long each transaction takes. Many arguments are available for iostat, so do spend some time with the man page on your specific system. By default, running 'iostat' with no arguments produces a report about disk IO since boot. To get a snapshot of "now" add a numerical argument last, which will prompt iostat to gather statistics for that number of seconds.

Linux will show number of blocks read or written per second, along with some useful CPU statistics. This is one particularly busy server:

Notice that iowait is at 23 percent. This means that 23 percent of the time this server is waiting on disk I/O. Some Solaris iostat output shows a similar thing, just represented differently(iostat -xnz):

The %b (block) column shows that I/O to device d101 is 100 percent blocked waiting for the device to complete transaction. The average service time isn't good either: disk reads shouldn't take 27.4ms. Arguably, Solaris's output is more friendly to parse, since it gives the reads per second in kilobytes rather than blocks. We can quickly calculate that this server is reading about 19KB per read by dividing the number of KB read per second by the number of reads that happened. In short: this disk array is being taxed by large amounts of read requests.

Vmstat

The "vmstat" program is also universally available, and extremely useful. It, too, provides vastly different information between operating systems. The vmstat utility will show you statistics about the virtual memory subsystem, or, to put it simply: swap space. It is much more complex than just swap, as nearly every IO operation involves the VM system when pages of memory are allocated. A disk write, network packet send, and the obvious "program allocates RAM" all impact what you see in vmstat.

Running vmstat with the -p argument will print out statistics about disk IO. In Solaris you get some disk information anyway, as seen below:

A subtle, but important difference between Solaris and Linux is that Solaris will start scanning for pages of memory that can be freed before it will actually start swapping RAM to disk. The 'sr' column, scan rate, will start increasing right before swapping takes place, and continue until some RAM is available. The normal things are available in all operating systems; these include: swap space, free memory, pages in and out (careful, this doesn't mean swapping is happening), page faults, context switches, and some CPU idle/system/user statistics. Once you know how to interpret these items you quickly learn to infer what they indicate about the usage of your system.

The two main programs for finding "slowness" are therefore iostat and vmstat. Before the obligatory tangent into what Dtrace can do for you, here are a few other tools that no Unix junkie should leave home without:

lsof

Lists open files (including network ports) for all processes

netstat

Lists all sockets in use by the system

mpstat

Shows CPU statistics (including IO), per-processor

Dtrace

We cannot talk about system visibility without mentioning Dtrace. Invented by Sun, Dtrace provides dynamic tracing of everything about a system. Dtrace gives you the ability to ask any arbitrary question about the state of a system, which works by calling "probes" within the kernel. That sounds intimidating, doesn't it?

Let's say that we wanted to know what files were being read or written on our Linux server that has a high iowait percentage. There's simply no way to know. Let's ask the same question of Solaris, and instead of learning Dtrace, we'll find something useful in the Dtrace ToolKit. In the kit, you'll find a few neat programs like iosnoop and iotop, which will tell you which processes are doing all the disk IO operations. Neat, but we really want to know what files are being accessed so much. In the FS directory, the rfileio.d script will provide this information. Run it, and you'll see every file that's read or written, and cache hit statistics. There's no way to get this information in other Unixes, and this is just one simple example of how Dtrace is invaluable.

The script itself is about 90 lines, inclusive of comments, but the bulk of it is dealing with cache statistics. An excellent way to start learning Dtrace is to simply read the Dtrace ToolKit scripts.

Don't worry if you're not a Solaris admin: Dtrace is coming soon to a FreeBSD near you. SystemTap, a replica of Dtrace, will be available for Linux soon as well. Until then, and even afterward, the above mentioned tools will still be invaluable. If you can quickly get disk IO statistics and see if you're swapping the majority of system performance problems are solved. Dtrace also provides amazing application tracing functionality, and if you're looking at the application itself, you already know the slowness isn't likely being caused by a system problem.