Performance Monitoring Tools for Linux

Mr. Gavin provides tools for systems data collection and display and discusses what information is needed and why.

What Do We Do with the Data?

I now had the data, but since columns of figures are boring,
I needed a way to look at the data and make sense of it. I had used
gnuplot for similar tools on other
systems, so it seemed to be a good choice. I started with a script
to display CPU utilization, charting the percentages of time spent
in idle, user, system and nice states.

The cpu data file has five columns that look like
this:

0000 4690259 69915 661038 7937582
0005 4690408 69964 661286 7966975

Column 1: seconds in idle state since last bootedColumn 2:
seconds in system state since last bootedColumn 3: seconds in nice
state since last bootedColumn 4: seconds in user state since last
bootedColumn 5: time-stamp of observation (HHMM)

My reporting scheme was to get the amount of seconds spent in
each state since the last observation, add up the different states
and express each one as a percentage of the total. I ran into an
interesting issue right away—what about a reboot? Booting the
system zeroes out the counters and subtracting the old from the new
generates negative values, so I had to handle it properly to
provide useful information. I decided to watch for a counter value
that was lower than the last observation's value and, if found,
reset the prior values to zero. To make the chart more informative,
a data point was set to 100 for a reboot and -1 for a normal
record. The -1 value causes the data point to be outside the chart
and thus not displayed.

Sometimes a hard copy is preferred when presentations or
reports are needed. The gnuplot authors provide for a variety of
output formats, and the script will switch between X11 display and
PostScript output depending upon which option switches are
set.

Figure 1. Sample Chart

Figure 1 is a sample chart produced by the graphing script
shown in Listing 2. A breakdown of
the major parts of this script is included in the archive file on
SSC's FTP site,
ftp.linuxjournal.com/pub/lj/listings/issue56/2396.tgz.
Also included are the collection script, graphing scripts, a sample
crontab entry for running the collector script and the following
charting scripts:

cpu: charting cpu
information as described above

ctxt: charting
context switching per second

disk: disk
utilization: total I/O, read/writes and block read/writes per
second

eth: Ethernet
packets sent and received per second and both incoming and outgoing
errors

intr: interrupts
by interrupt number and charted per second

mem: memory
utilization and buffer/cache/shared memory allocations

page: page in and
out activity

ppp:
Point-to-Point Protocol packets sent/received per second and
errors

proc: new process
creation per second

swap: swap
activity and swap space availability

I'm currently converting this toolkit to Perl and building a
web interface to allow these charts to be viewed as HTML pages with
the charts as GIF files.

David Gavin
(dgavin@unifi.com)
has worked in various support
environments since 1977, when after COBOL training, he had the good
fortune to be assigned to the TSO (Time Sharing Option) support
group. From there he moved to MVS technical support, to VM and to
UNIX. He has worked with UNIX from mainframes to desktops,
baby-sitting Microsoft systems only when he couldn't avoid it. He
started using Linux back when it meant downloading twenty-five
disks over a 2400 BAUD dial-up line.

Comment viewing options

When you want to do network monitoring you need a network monitoring system also known as network monitoring software or a network monitoring tool. If you are looking then try SysOrb for free. http://www.evalesco.com/

There's been some progress in the last 12 years or so...for example, Zoom from RotateRight ( http://www.rotateright.com ) provides a rich GUI or CLI-based system-wide profiler for Linux. It takes callstacks with every sample and can show source and assembly code for any sampled function.

The sarChart.cgi script has a bug in it. It reads from the tstamp column in each table incorrectly. To calculate the time it uses substr to extract the hour and min, but the offset parameter is off by 2 in both cases. This problem is probably due to changing the length of the year from 2 to 4 digits.

Description of the columns in the CPU output is incorrect:
0000 4690259 69915 661038 7937582
Column 5: seconds in idle state since last booted
Column 2: seconds in system state since last booted
Column 3: seconds in nice state since last booted
Column 4: seconds in user state since last booted
Column 1: time-stamp of observation (HHMM)