Share

Why is my system so slow?

This is for Linux and Unix systems. Search engines find it
for Windows people looking for info about the "AVSERVE.EXE"
process, but this isn't going to help you with that. You are
probably infected by Sasser or some other worm/virus. My
Windows Performance Boosts may be helpful.

This is not a performance tuning article. If your system is
always slow, this article may not be what you are looking for. I'll
be covering some general performance related issues here, but the
main focus is for the system that was running fine yesterday but is
sucking mud today. The typical response to such problems is "Reboot
it", and while that may indeed fix the problem, it does not address
the root cause, so you are likely to have the situation again.

You need to figure out WHY it is slow. It would be nice if you
had prepared ahead of time by having such tools as "sar" enabled
for historical data or at least were familiar with what "top" and
"vmstat" look like when your system is "normal"- that is, doing
whatever tasks it is supposed to do, with the usual number of
users, and not seeming "slow". A baseline of "normal" performance
is very useful when trying to figure out what is causing abnormal
performance. For instance, here's a "sar -b" from a Linux
system:

Does the activity beginning just after 8:00 AM represent an
abnormal condition for this machine? Will the system seem slow with
this amount of disk activity? If the system is slow right now, is
this why- is it because of disk access? Suppose I now take a "sar
-b 5 10" sample and see this:

Well, in fact, this wouldn't be a particularly heavy load on the
hardware I ran it on, and wouldn't noticeably affect performance on
the single user desktop that it is. The "sar -b 5 10" shows that
whatever the disk activity is, it isn't constant. But is it
"normal"? The only way you'd know that is if your were familiar
with the system or could look at historical data- which is why it
is important to collect that data from the very first day you put a
machine into service.

Sar

Do you have the proper patches on your system?

With ANY problem, there really is no sense in chasing it very
far if you are running on systems that are not properly patched.
Updated programs and kernel fixes often make problems magically go
away.

Your beginning tool for that is "sar"- enable it on SCO Unix if
it isn't already enabled with

/usr/lib/sa/sar_enable -y

Linux folks will find their stuff just a bit below this - hang tight.

Ignore the warning about rebooting- it is not necessary.

If you already know about sar, you may want to skip ahead to
What's Wrong? now.

All that does is uncomment entries for sa1 and sa2 in the sys
and root crontabs. By default, SCO runs sar every twenty minutes
during "working hours" and ever hour otherwise; you should adjust
the "sys" crontab to meet your specific needs. The daily summary
(the sa2 script run from root's crontab) creates summary files in
/var/adm/sa. These will have names like "sar01", "sar02", etc.; the
daily data files are named "sa01" through "sa31". The daily data
files are binary data. If you wanted to examine memory statistics
from the 15th of the month, you'd run

sar -f /var/adm/sa15 -r

The "sar" summary files are text. The "sar15" is the output of
running

sar -f /var/adm/sa15 -A > /var/adm/sar15

and therefore can be viewed directly with "more", printed
directly or whatever.

The sar files are found in /var/log/sa, and use the same naming
as on SCO. If you have been running sar for a month or more, you'll
always have one months worth of historical data. As daily files are
overwritten by daily files from the following month, you don't have
to be concerned with using up disk space. Having this historical
data lets you quickly decide if the current sar statistics
represent an unusual condition.

The flags for sar vary on different OS's, so read the man page.
On all systems, sar without any arguments gives you cpu usage, but
even there the output will vary. Linux systems have a "nice" column
that SCO Unix lacks, and SCO includes a useful "wio" column not
found on Linux:

If you run sar without any numerical arguments, it will look for
today's historical data (and complain if it can't find it). If you
run it with numerical arguments, it samples what is happening now.
The first argument is the time between samples (5 seconds is a good
choice), the second is the number of samples. So "sar 5 2" gives
two samples, 5 seconds apart.

Look for a process who's time column has gone up by 3 to 5
seconds each time- if you have something like that, that's your
problem- you need to kill it. The TIME column is time on the cpu-
normally a process doesn't spend a great deal of time actually
running- it's waiting for the disk, waiting for you to type
something, etc. Most processes spend most of their time sleeping,
waiting for something else to happen, so something that gains 3
seconds or more in 5 seconds of wall time is usually suspect.

If you watch it over a few minutes, the time it gains here
divided by the elapsed wall clock time is the percentage of your
cpu this process is taking for itself. A shortlived process can
take a lot of the cpu to print, or to redraw an X screen etc., so
you have to use some good judgement here. But 3 seconds out of 5 is
very likely a real problem.

Of course you need to understand what you are killing: you
probably wouldn't want to kill the main Oracle database, for
example.

If you kill the errant process and another copy of it pops right
back to the top of the list, then you need to track down its
parent:

# for example, if process 15246 is the problem
ps -p 15246 -o ppid

Of course, it may go further up the chain. Here's a script that
traces back to init:

Sometimes you'll have a badly written network program that
starts sucking resources when its client dies. If you can't get the
supplier to fix it, you may want to write a script to track down
and kill these things. One clue that might help: the difference
between a good "xyz" process and a bad one might just be whether or
not it has an attached tty. So, if you see this:

5821 ? 00:00:42 xyz
6689 ttyp0 00:00:08 xyz
7654 ttyp1 00:00:12 xyz

It's probably the one with a "?" that will start accumulating
time. So a script that watched for and killed those might look like
this:

And even that may not be clever enough for your particular
situation, so test and tread carefully. You may even need to do
math on the time field to see what has really happened. If all else
fails, you might be able to set ulimit on cpu time or some other
limit that wouldn't affect a "normal" running of this process.

Another thing you may see is a process that has used a lot of
time but isn't gaining time right now. I've seen that many times
where the process is "deliver"- MMDF's mail delivery agent on SCO
systems that aren't running sendmail. What happens is that for
whatever reason (a root.lock file from a crash in /usr/spool/mail
or a missing "sys" home directory), there are thousands of
undelivered messages in the subdirectories of
/usr/spool/mmdf/lock/home

The fix for that is simple if you don't care about the messages:
rm -r all those directories and recreate them empty with the same
ownership and permissions

You'd then want to verify that mail is working normally and that
whatever caused the problem isn't still happening- for example, if
/usr/sys is missing this problem will come right back again very
quickly.

Another possibility is a program that is rapidly spawning off
other programs. You should be able to see that in "ps -e". First,
are the number of processes growing?:

ps -e | wc -l
sleep 5
ps -e | wc -l

Or, are there new processes briefly showing up at the end of the
listing?:

ps -e | tail
sleep 5
ps -e | tail

In either case, you need to track down the parent and kill
it.

Low Memory

If sar -r shows low memory or (worse) swapping, go buy more
memory. That's going to be easy to spot on SCO's sar, but Linux is
a bit harder. Let's look at SCO first:

This machine consistently has over 200 MB of memory available-
unused ( freemem pages are 4K each). Obviously no problem there,
and in fact, if this is always the case (which you'd know from sar
historical data), you may want to use some of that memory for disk
buffers- see Adding RAM to SCO Unix.

As is immediately obvious, all memory is in use. The reason is
that Linux always uses "unused" memory for the file system buffer
cache. So, roughly 75 MB has been put to work for that (kbcached
column).

Another way to look at Linux memory right this moment is to

cat /proc/meminfo

But what's using it? Well, ps -el will show you in the SZ column
how much memory each process is using. So

ps -el | sort -r +9 | head

can tell you a lot, particularly if you see that a process is
gaining memory over time.

Disk Performance

Remember, this article is not about performance tuning- it's
about specific performance degradation. However, it's almost always
true that the disk drives are the biggest performance bottleneck.
See Raid for a more general
discussion of improving disk performance.

On SCO systems, you can get a good overview of disk performance
from "sar -d":

The first thing you are interested in is %busy. The "avwait"
column is the average amount of time that processes wait for the
disk to give them their data, so that and "avque" (how many
processes are trying to use the disk) can give you a clear picture
of load. The "avserve" is a measure of the disks ability to deliver
that data, and isn't going to change much for the same
hardware.

Note that if %busy is small and avque and avwait are high, you
are probably seeing buffer cache flushes. Those can affect
performance, and there are tunables that affect how often and how
much is flushed, but those issues aren't the focus of this
article.

Linux sar doesn't have "-d" (or at least it doesn't on Red Hat
7.2), so the next best thing is iostat:

The "tps" is the number of transfers per second the disk was
asked for. The block read and write columns give you both the
number of blocks and the blocks per second.

For a sudden degradation of performance, you are interested only
to see if the disk is more busy than is normal. You might also be
looking at the blocks per second to compare how much data is being
moved around. Of course, if you decide that the disk activity is
unusual, your next problem is to find out why: you still have to
track down the process that is doing this.

The first place I'd look in such a situation is log files:
/var/adm on SCO, or /var/log on Linux. If there's nothing there,
and you see large amounts of writes, accompanied by decreasing disk
space (df -v), then you are looking for a growing file and the
process that is writing it. Finding that can be fairly easy:

cd /
du -s * > /tmp/a
sleep 30
du -s * > /tmp/b
diff /tmp/a /tmp/b

The directory that is growing will pop out of the diff. Change
into that directory and repeat the procedure until you find the
file that is growing, and finally use "fuser" to identify the
process that is writing the file.

fuser -k /tmp/growingfile

Network Performance

A sudden problem with network performance is almost always going
to be hardware, but there are other possibilities. For example, are
your routes what they should be? If "netstat -rn" shows different
routing than what you expect, do you have routed running and some
router is giving you bad information? Kill routed and reset your
routes.

"netstat -in" will give you an idea of collisions; a bad card
somewhere in your network can cause this- if you run switches
rather than hubs you won't have collisions at all, but that bad
card could still be affecting performance. Know what "normal"
network traffic looks like, know how long "ping" response times
should be on your WAN, etc.

Mis-negotiation of network speeds and cheap nic cards are
another source of network problems:

Other problems

Disk CPU and Network are the most common areas that will cause a
sudden performance drop. However, there are other things that can
happen. The general rule is that if you have historical data,
you'll be able to spot the problem much more quickly. Some of the
other things I'd look at if everything above came up blank include
"sar -y" (tty activity) and "sar -q" (run queue size)- these may
not be available on Linux.