Monday, November 2, 2015

A troubleshooting process for Linux problems

Although
it would be nice to believe that cars, home theater systems, computers,
and Linux never break, the reality is that they do.
Many people have no problems with Linux, but those who do want the
best information and guidance possible. You can obtain professional help
from a number of places. For example, if you purchased Linux from a
major vendor such as Red Hat, you are entitled to some level of service
from that vendor. In fact, what you are actually purchasing is the
service. Other help is available on the internet on various web sites
and forums. Local user groups may also be available in your geographical
area, and you may even have some friends who use Linux and are willing
to offer a hand. Do not hesitate to use any and all resources available
to you.
Most of the time those of us who use Linux prefer—even enjoy—doing our own troubleshooting.
Solving problems of any kind is an art and a science. Solving
technical problems, such as those that occur with computers, requires a
good deal of specialized knowledge as well.
Any approach to solving problems of any nature—including problems
with computers and Linux—must include more than just a list of symptoms
and the steps necessary to fix or circumvent the problems which caused
the symptoms. This so-called "symptom-fix" approach looks good on paper
to the old-style managers (those managers who do not participate in The Open Organization) but sucks in practice.

KnOWDAT

There are five basic steps that are involved in the problem solving process that I use:

Knowledge

Observation

Deduction

Action

Testing

You probably already follow these steps when you troubleshoot a
problem but do not even realize it. If you follow these steps each time
you engage in solving a problem, you should be successful most of the
time. These steps are universal and apply to solving most any type of
problem, not just problems with computers or Linux.
I used these steps for years in solving electronic and computer
problems without realizing it. Having them codified for me made me much
more effective at solving problems because when I became stuck, I could
review the steps I had taken, verify where I was in the process and
restart at an appropriate step if necessary.
You may have heard a couple other terms applied to problem solving in
the past. The first three steps of this process are also known as
problem determination, that is, finding the cause of the problem. The
last two steps are problem resolution which is actually fixing the
problem.

Knowledge

Knowledge of the subject in which you are attempting to solve a
problem is the first step. You must be knowledgeable about Linux at the
very least, and even better, you must be knowledgeable about the other
factors that can interact with and affect Linux, such as hardware, the
network, and even environmental factors such as how temperature,
humidity and the electrical environment in which the Linux system
operates can affect it.
Knowledge can be gained by reading books and magazines about Linux
and those other topics. You can attend classes, seminars, and
conferences. You can also just set up a number of Linux computers in a
networked environment and through interaction with other knowledgeable
people.
My personal preference is to play—uh, experiment—with Linux or with a
particular piece such as networking, and then take a class or two to
formalize the knowledge I have gained.
Remember that without knowledge, "resistance is futile," to paraphrase the Borg. Knowledge is power.

Observation

The second step in solving the problem is to observe the symptoms of
the problem. It is important to take note of all of the problem
symptoms. It is also important to observe what is working properly.
This is not the time to try to fix the problem; merely observe.
An important part of observation is to ask yourself questions about
what you see and what you do not see. Aside from the questions you need
to ask that are specific to the problem, there are some general
questions to ask:

Is this problem caused by hardware, Linux, application software, or perhaps by lack of user knowledge or training?

Is this problem similar to others I have seen?

Is there an error message?

Are there any log entries pertaining to the problem?

What was taking place on the computer just before the error occurred?

What did I expect to happen if the error had not occurred?

Has anything about the system hardware or software changed recently?

Other questions will reveal themselves as you work to answer these.
The important thing to remember here is to gather as much information as
possible. This increases the knowledge you have about this specific
problem and aids in finding the root cause.
Use on-line resources to search for similar bugs. Perhaps this problem has already been reported and there is a fix for it.
As you gather data, never assume that the information obtained from
someone else is correct. Observe everything yourself. This can be a
major problem if you are working with someone who is at a remote
location. Careful questioning is essential and tools that allow remote
access to the system in question are extremely helpful when attempting
to confirm the information that you are given.
When questioning a person at a remote site, never ask leading
questions; They will try to be helpful by answering with what they think
you want to hear.
At other times the answers you receive will depend upon how much or
how little knowledge the person has of Linux and computers in general.
When a person knows—or thinks he knows—about computers, the answers you
receive may contain assumptions that can be difficult to disprove.
Rather than ask, "Did you check..." it is better to have the other
person actually perform the task required to check the item. And rather
than telling the person what he or she should see, simply have the user
explain or describe to you what he or she sees. Again, remote access to
the machine can allow you to confirm the information you are given.
The best problem solvers are those who never take anything for
granted. They never assume that the information they have is 100%
accurate or complete. When the information you have seems to contradict
itself or the symptoms, start over from the beginning as if you have no
information at all.

Deduction

Deduce from your observations of the symptoms what the problem might be.
This is where art applies to problem solving. The art of deducing
from your observations of the problem and your knowledge and past
experience is where art, and perhaps a bit of magic, mix with science to
produce inspiration, intuition, or some other mystical mental process
that provides some clue to the root cause of the problem.
In some cases this is a fairly easy process. You can see an error
code and look up its meaning from the sources available to you. You can
then apply the vast knowledge you have to deduce—the artful part—the
cause of the problem. In other cases it can be a very difficult part of
the problem determination process.
It helps to remember that the symptom is not the problem. The problem
causes the symptom. You want to fix the true problem not just the
symptom.

Action

Now is the time to perform the appropriate repair action. This is
usually the simple part. The hard part is what came before—figuring out
what to do. After you know the cause of the problem it is easy to
determine the correct repair action to take.
This might be to replace a defective hard drive or motherboard, or it might be necessary to upgrade or even fix some software.
For software with bugs, if you do not have the skills to fix it
yourself or within your organization, the very least you should do is to
report the bug using the appropriate means. I have reported a few bugs
to Red Hat using Bugzilla. Anyone can create a Bugzilla account and
search for existing similar bugs or report a new bug.

Test

After taking some overt repair action the repair should be tested.
This usually means performing the task that failed in the first place or
something that exercises the broken bit.
If the repair action has not been successful, you should begin the
procedure over again starting with the observed symptoms. It is possible
that they have changed due to the action you have taken and you need to
be aware of this in order to make informed decisions during the next
iteration of the process. Even if the problem has not been resolved, the
altered symptom could be very valuable in determining how to proceed.

An example

One example of solving a problem from my own experience occurred a
few years ago in my role as a part time Linux System Administrator in a
test lab environment. It is fairly simple but can illustrate the process
flow of the steps I have outlined.
I received an email from one of our testers indicating that an
application he had installed as part of a test was crashing. It was
giving error messages indicating that it was out of swap space. This is
the initial observation performed by the user and transmitted to me.
My knowledge told me that the system that was being
used for testing this application has 16GB of RAM and 2GB of swap space.
Previous experience (Knowledge) told me that swap space in these
computers is almost never touched and RAM usage is typically far below
25% of the 16GB of RAM in these boxes.
At this point I deduced that the problem was not
really a problem with swap space as that would seem highly improbable. I
could still hold that possibility open, though only very slightly. You
will find that many error messages provided by programs can be quite
misleading and user observations can be even more so.
I made some observations of my own. I logged into the box and used
the free command as a tool to view memory and swap space. Lots of free
RAM and swap space usage was at zero. I Know that if swap space usage is
actually zero, then it is very likely that none of the available swap
space has never been allocated and no paging has occurred since the last
boot.
I also deduced from previous experience (knowledge) that there might
be a kernel of truth in that error message. That being it was very
likely to be out of some resource or other. The other primary consumable
resources are CPU cycles and disk space.
This did not seem like a CPU problem, so I observed disk space using
the df command which showed that the /var filesystem was full. I deduced
that the full filesystem was the cause of the problem.
All of our systems were kickstarted with a /var filesystem of 1.5GB.
Our policy was to install application programs in /opt which is where
the ones we tested are designed to be installed.
I discussed this with the tester and was told that he had indeed
installed the application in /var. I told him to uninstall from /var and
install the application in /opt where it belonged. After taking this action, I had him test
the corrective application by performing the operation that had
previously failed. The test was successful and the problem solved.
As you work through a problem it will be necessary to loop back
through at least some of the steps. If, for example, performing a given
corrective action does not resolve the problem, you may need to try
another action or you may need to go back to the observation step and
gather more information about the problem.

Analyze your process

I have been teaching people to repair both hardware and software for
many years. I think that many of us use some form of problem solving
process whether it has been formalized or not. When I was taught about
this process it enabled me to understand when and where it was breaking
down for me as I worked to solve problems. That allowed me to analyze
where I was going wrong and to get back on track.
Your process may be different, and you may not realize that you
actually have a describable and repeatable process. But, if you are
successful at solving computer problems, you do. Awareness of that
process, whatever it may be for you, can help you in resolving future
problems.