Troubleshooting

Summary

It happens nearly every day, the phone call or email that something is “broken”. All the best designs, hardware, software, standards, and procedures cannot prevent this from happening. Basically, if it is human made, it will break. Because of this, there is one skill all information technology professionals must possess and that is troubleshooting. Sadly, this is often not the reality. In my experience troubleshooting is a natural skill that is not easily taught and is often acquired only through difficult and sometimes painful lessons. Unfortunately, computer science and information technology curriculums often skip this necessary education. The result is IT departments full of very skilled and knowledgeable individuals that often struggle to find and solve even the most basic of issues.

My intention with this article is to share my experience and some troubleshooting techniques I’ve utilized throughout my career. In the last 27 years, I have worked as a technician, help desk agent, developer, database administrator, network administrator, web administrator, technology architect and even a manager. This diverse experience has given me a broad perspective of troubleshooting. I have bloodied my knuckles and have been stymied more times than I can count. Throughout my career I have learned about and uncovered troubleshooting approaches that have greatly improved my ability to solve problems. While every situation is unique and each issue requires an individual approach, there are patterns to troubleshooting that I would like to share.

The Occam’s Razor Approach

The first technique I’d like to discuss is actually the scientific principal called “Occam’s Razor”. This is a principle that states “Entities should not be multiplied unnecessarily”. Another interpretation of the principle is “Among competing hypotheses, the one with the fewest assumptions should be selected”. I translate this as “look at the obvious first” or “don’t overcomplicate the problem”. In other words “if you hear hoof beats, think horses, not zebras”.

I think this approach is the same as the KISS (Keep It Simple Stupid) principle. I cannot count the number of times this has worked. How often are problems fixed by restarting an application or rebooting the computer? A lot!

So how do you use this principle? First thing is to ask obvious questions. Don’t assume that these questions have already been asked and answered. These questions aren’t intended to insult the user, but are asked in an effort to eliminate the obvious.

Scenario 1

You receive a call where the user states “my computer isn’t running”. I know this is a very basic question, but it is also a common one.

Questions I’d ask are…

What do you mean by “running”? Perhaps this isn’t a computer problem, maybe it is a software issue, network issue or even a printer issue. Many terms like “running”, “working”, “on”, and “broken” have very different meanings to typical users than to techie people. We speak a different language. Speak like a muggle!

Is it on? You may think this is a little obvious, but I’ve seen it.

Is the power light on? This is another obvious question, but it is often missed.

Is the power light on the monitor on? The computer may be on, but if the monitor isn’t on it may appear that the computer isn’t running.

Is the computer plugged in? Power cords do get bumped.

When is the last time you used the computer? You never know, maybe it is a new system that has never worked, or perhaps it is a ancient computer that hasn’t been used in years, or maybe the user is using someone else’s computer.

It was years ago, but I once had a user ask me the scenario 1 question. It turned out that they didn’t know how to use a mouse. They were pressing the mouse, not on anything in particular, and nothing was happening. From the user’s perspective the computer wasn’t running.

Scenario 2

You receive a call where the user states “I cannot login” or “I cannot access the system”.

What are you trying to login/access? It could be the computer, network, an application or even Facebook. I’ve encountered them all.

Are you receiving an error message? What is the error message? It may not even be a login issue. It could be the system is down or there is a network connectivity issue or there is an issue with their computer.

Have you ever logged into this system/application? Perhaps they are a new employee. Perhaps they were told about this great application but were never actually granted access. For all you know the application may not be installed or properly configured.

Is anyone else able to login to the system/application? If nobody can login, then checking security group memberships for this user or resetting their password probably won’t work.

Ask the Computer

Computer systems do communicate and can often answer questions that the user of the system cannot. Error message boxes, warning dialogs, red blinking things, and log files are all examples of how the computer is trying to communicate. Don’t ignore them!

I am regularly asked to help troubleshoot issues by IT coworkers. They will approach me with an exhausted look and weary eyes and ask “can you help look at this issue? I’m stuck and out of options.” My first response is almost always “Have you checked the log files” and all too shockingly frequent the answer is “No, not yet”. Log files are the computers way of letting you know what happened. Learn to read these files and you’ll greatly improve your troubleshooting skills.

A note to developers. Please write errors and warnings to log files! The savvy developer will build in a verbose logging mode so additional detail can be recorded for troubleshooting. Detailed log entries can help greatly reduce the amount of troubleshooting time. Also, don’t just write the code stack to the log file. Code stack dumps are only helpful to the developer and since most IT professionals won’t have access to the source code, the call stack doesn’t help. One last thing, create meaningfull error messages. Messages like “A correlation error has occurred” or “An unhandled exception has occurred” or “Object reference not set to an instance of an object” are meaningless to anyone other than a developer.

Big Block Theory

Don’t assume anything. Be a skeptic. Ask questions, ask lots of them! I know that this concept can appear to contradict the Occam’s Razor approach, but once you have exhausted the obvious stuff, it is time to start breaking things apart. I use a concept that I personally developed years ago that I call the “Big Block Theory”. It really isn’t anything spectacular, just a simple process of isolating and eliminating working components.

The Big Block Theory is defined as “eliminate all working components of a system, whatever remains must be the cause of the issue”. Sound familiar? It should. I acquired the idea from the famous quote “once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth”, which was written by Arthur Conan Doyle, Sr. in the Sherlock Holmes detective stories.

There are three basic steps to the Big Block Theory.

Identify the largest possible blocks

Isolate the working blocks

Repeat until the problem is identified

Let’s look at the individual steps one at a time. First step is “identify the largest possible blocks”. The purpose of this step is to take a holistic view of a system or component. For example, lets look at a standalone computer. What are the largest components that make up a computer? I would include “power”, “output devices (i.e. monitor)”, “input devices (i.e. keyboard)”, and “core (i.e. CPU, memory, drives, etc.)”.

Once we have identified the large blocks, the next step is to isolate the working components. Please note, that I didn’t identify the failing components. I do this because we need to avoid jumping to a conclusion about one particular failing component. I’ve seen many times where a complex issue is actually caused by multiple failing components. By isolating the working components, we will be left with one or more broken components. Continuing with our standalone computer example, it should be fairly easy to identify which of these components are working and isolate them.

The last step in the theory is to repeat the process again with only the remaining broken components. Let’s say we identify the “power” as the failing component from our simple example. Now we identify the blocks of the power component. I would identify them as “wall socket”, “power cord”, and “power circuitry”. Ask yourself, which of these is working? Isolate the working components again and repeat. Sounds simple right? That is the point.

Let’s take a look at a larger example. A user calls with an issue with a web application. So, what are the largest blocks we can identify? For this example, it is “client”, “web server”, “database server”, “network”, and “user”. Yup, I included “user”. In many systems the “user” must be considered as part of the system. Now we need to determine which if these are functioning correctly. We start the isolation by asking questions like “Is anyone else having this issue?” and “When is the last time you used the system?”. Continue the isolation of working components by checking if the web server is on and responding to web requests and if the database server is on and responding to queries. Continue asking questions and checking each large block until we are left with the failing one(s). Now we break failing block(s) down into smaller blocks until the problem is found. What are the components of the “web server”? What are the blocks of a “database server”? Practice asking yourself these questions and the Big Block Theory will become natural.

Test Driven Troubleshooting (TDT)

Often I get asked one question regarding the Big Block Theory. The question is “how do we know if the individual components are working?” For those components with a user interface or for which there are existing tools this is usually easy, but what about those pesky service components? Components like DLLs, web APIs, and even databases are not as easy to directly test.

There is a development process known as TDD or Test Driven Development. It is used to perform unit tests of source code in a repeatable manner. I’ve adopted this concept to aid in troubleshooting. The idea behind TDT, just like TDD, is to test the functionality of individual components. This is where some developer skills can really pay off. Having the ability to create some basic PowerShell scripts to test a web service or make interface calls to an API will make this possible. Being able to write some SQL statements to test a database can be a great troubleshooting tool. I have even written small .Net programs with the explicit purpose to test a single API. In short, learn some basic programming or befriend a programmer.

Google-fu

I have witnessed technical professionals with mad Internet search skills find some of the most obscure, difficult fixes. They can do this because their “Google-fu” is strong. There are many Internet search techniques that I strongly recommend you master. Google provides a search guide with many search tips that can greatly improve the odds of finding what you are looking for. Simple things like using quotes around phrases and using boolean and wildcard operators can be extremely powerful. Google, Microsoft Bing, Yahoo, and most search engines provide a set of operators that you can include in your search query to better explain to the search engine what you are looking for. Practice these search techniques and become an Internet search master.

Trace Your Steps

Keep track of everything. I find that it is not only important for you but for anyone else asked to assist. Record the questions you’ve asked and what the answers are. Record all error messages, warning messages, log files and even screen shots. If you have help desk software, then record everything in that system. If not, then I find Microsoft OneNote to be a handy way to record your troubleshooting. If you don’t keep track, you’ll eventually find yourself repeating troubleshooting steps. I’ve been stuck in the troubleshooting endless loop. It can be maddening so do yourself a favor and trace your steps. The best thing about keeping detailed records is you’ll have a much easier time fixing the issue the next time.

Tools

There are so many tools and utilities available that it would be impossible to create a complete list. I’ve listed below a few tools that I utilize all the time.

SQL Profiler – You must have sys_admin access to the SQL Server to use this tool. If you do have this access, it can be really helpful in finding issues with SQL Server.

PowerShell – This is a must-learn skill for both developers and administrators.

Pull Over and Ask For Directions

We all get lost and sometimes we need to seek help. I usually try to troubleshoot issues on my own first. In fact, I can be a bit stubborn and determined to resolve it on my own. I get a great sense of accomplishment, when I fix things and solve problems. However, this personal obsession with resolving issues on my own doesn’t help the user. I have learned that it is more important to solve the issue in the least amount of time, than it is to feed my ego. So, I now set time limits for myself on how much time I’ll spend troubleshooting before I ask for assistance. My rule is one hour. Yup, that is it, no more than one hour. Once the time is up, I seek help. I usually start by asking a really smart co-worker that I know has immense knowledge in the particular area I’m troubleshooting. If that doesn’t bear fruit, then it is time to call the manufacturer’s support line. Asking for help isn’t admitting defeat or that you couldn’t resolve it. Instead, it is putting the spotlight on the issue in an effort to get it fixed in the least amount of time.

In Conclusion

Troubleshooting is an art and requires practice. Having a broad basic knowledge of your company’s infrastructure and systems will help. Gaining some development skills will help. Practicing the big block theory will help. Learn to read log files. Become a search master. Learn to use common troubleshooting and testing tools. Be willing to seek assistance. All of this can be used to improve your troubleshooting skills and reduce the amount of time you spend fixing stuff. Remember that information technology is really a service to support business systems and functions. Keep your troubleshooting efforts geared toward supporting the business and you will find great success.