Performance Monitoring Tools for Linux

Mr. Gavin provides tools for systems data collection and display and discusses what information is needed and why.

For the last few years, I have been
supporting users on various flavors of UNIX systems and have found
the System Accounting Reports data invaluable for performance
analysis. When I began using Linux for my personal workstation, the
lack of a similar performance data collection and reporting tool
set was a real problem. It's hard to get management to upgrade your
system when you have no data to back up your claims of “I need
more POWER!”. Thus, I started looking for a package to get the
information I needed, and found out there wasn't any. I fell back
on the last resort—I wrote my own, using as many existing tools as
possible. I came up with scripts that collect data and display it
graphically in an X11 window or hard copy.

What Do We Want to Know?

To get a good idea of how a system is performing, watch key
system resources over a period of time to see how their usage and
availability changes depending upon what's running on the system.
The following categories of system resources are ones I wished to
track.

CPU Utilization: The central
processing unit, as viewed from Linux, is always in one of the
following states:

nice: like user, a job with
low priority will yield the CPU to another task with a higher
priority

By noting the percentage of time spent in each state, we can
discover overloading of one state or another. Too much idle means
nothing is being done; too much system time indicates a need for
faster I/O or additional devices to spread the load. Each system
will have its own profile when running its workload, and by
watching these numbers over time, we can determine what's normal
for that system. Once a baseline is established, we can easily
detect changes in the profile.

Interrupts: Most I/O devices
use interrupts to signal the CPU when there is work for it to do.
For example, SCSI controllers will raise an interrupt to signal
that a requested disk block has been read and is available in
memory. A serial port with a mouse on it will generate an interrupt
each time a button is pressed/released or when the mouse is moved.
Watching the count of each interrupt can give you a rough idea of
how much load the associated device is handling.

Context Switching: Time
slicing is the term often used to describe how computers can appear
to be doing multiple jobs at once. Each task is given control of
the system for a certain “slice” of time, and when that time is
up, the system saves the state of the running process and gives
control of the system to another process, making sure that the
necessary resources are available. This administrative process is
called context switching. In some operating systems, the cost of
this switching can be fairly expensive, sometimes using more
resources than the processes it is switching. Linux is very good in
this respect, but by watching the amount of this activity, you will
learn to recognize when a system has a lot of tasks actively
consuming resources.

Memory: When many processes
are running and using up available memory, the system will slow
down as processes get paged or swapped out to make room for other
processes to run. When the time slice is exhausted, that task may
have to be written out to the paging device to make way for the
next process. Memory-utilization graphs help point out memory
problems.

Paging: As mentioned above,
when available memory begins to get scarce, the virtual memory
system will start writing pages of real memory out to the swap
device, freeing up space for active processes. Disk drives are
fast, but when paging gets beyond a certain point, the system can
spend all of its time shuttling pages in and out. Paging on a Linux
system can also be increased by the loading of programs, as Linux
“demand pages” each portion of an executable as needed.

Swapping: Swapping is much
like paging. However, it migrates entire process images, consisting
of many pages of memory, from real memory to the swap devices
rather than the usual page-by-page mechanism normally used for
paging.

Disk I/O: Linux keeps
statistics on the first four disks; total I/O, reads, writes, block
reads and block writes. These numbers can show uneven loading of
multiple disks and show the balance of reads versus writes.

Network I/O: Network I/O can
be used to diagnose problems and examine loading of the network
interface(s). The statistics show traffic in and out, collisions,
and errors encountered in both directions.

These charts can also help in the following instances:

The system is running jobs you aren't aware of
during hours when you are not present.

Someone is logging on or remotely running commands
on the system without your knowledge.

This sort of information will often show up as a spike in the
charts at times when the system should have been idle. Sudden
increases in activity can also be due to jobs run by
crontab.

Comment viewing options

When you want to do network monitoring you need a network monitoring system also known as network monitoring software or a network monitoring tool. If you are looking then try SysOrb for free. http://www.evalesco.com/

There's been some progress in the last 12 years or so...for example, Zoom from RotateRight ( http://www.rotateright.com ) provides a rich GUI or CLI-based system-wide profiler for Linux. It takes callstacks with every sample and can show source and assembly code for any sampled function.

The sarChart.cgi script has a bug in it. It reads from the tstamp column in each table incorrectly. To calculate the time it uses substr to extract the hour and min, but the offset parameter is off by 2 in both cases. This problem is probably due to changing the length of the year from 2 to 4 digits.

Description of the columns in the CPU output is incorrect:
0000 4690259 69915 661038 7937582
Column 5: seconds in idle state since last booted
Column 2: seconds in system state since last booted
Column 3: seconds in nice state since last booted
Column 4: seconds in user state since last booted
Column 1: time-stamp of observation (HHMM)