Five debugging tips for solving software problems

Do you ever have trouble getting information out of your clients? Even when they could give you that information with just a little effort? Here's how to drill down to get to the bottom of a situation.

Do you ever have trouble getting information out of your clients? Even when they could give you that information with just a little effort?

"It's broken," said the client.

"Broken?" I responded. "In what way?"

"It's not doing what it's supposed to do."

"Could you describe for me what it's supposed to do that it isn't?"

"You know -- it compiles fine, but when you run it, it just dies."

"Dies how?"

"It gets an error and quits."

"What's the error message?"

"I didn't write it down."

Naturally, the problem doesn't reproduce on my test system. So I have to keep the client involved at least enough to give me access to their system. The client, however, doesn't want to be involved. They just want it fixed. Without saying so, they're probably thinking "Doesn't this software quack ever test this stuff? We're paying him the big bucks just to have this blow up in our users' faces, and then paying more to have him fix it!" They're not feeling at all like helping me, but I really need some information if I'm going to help them, because when I try it on their system it doesn't reproduce there, either.

"I need to know the exact steps you followed leading up to the error message."

To them, it sounds like I'm evading responsibility, because they have no idea what steps preceded this disaster.

Most of my clients are software developers, and it puzzles me how frequently they plop huge haystacks of code in my lap and ask me to find the needle. I often try to teach them good problem-solving techniques -- you know, teach a man to fish -- but I'm amazed at the resistance I sometimes encounter. Oh well, more billable hours for Yours Truly.

Here are some general problem-solving techniques I use. Many of these apply to all sorts of problems, not just software bugs.

Make it smaller. Distill the problem down to the minimum amount of code required to reproduce it. Eliminate anything extraneous. Why go to all that trouble? One of the easiest ways to find the needle in the haystack is to get rid of most of the haystack. Sometimes, the problem goes away when you cut something out, which should give you an idea where it's hiding. Besides, if you get into an iterative debugging cycle, you'll be able to cycle much faster with less code and fewer steps.

Question your assumptions. I had a client call me the other day to report a problem in which static data was supposedly being modified by a return statement. Ah, I know what you're thinking, but there were no objects involved in this code, so no destructors were being called. To believe him would mean that there was a horrible bug in the runtime environment for the language he was using, which didn't seem at all likely to me. So I asked him how he examined the data before and after the return. The data was stored in a library module for which he did not have debug symbols, so he added calls in his code to a standard routine to query the data. But he didn't realize that this routine (in the way it was called) also had the unfortunate side-effect of modifying the data, creating an instant Heisenbug. He had been moving these calls around, trying to isolate the statement that changed the data, when the debugging statement itself was doing just as much damage.

Beware of false causation. How many times have you heard, "the only thing that changed is X, so the problem must be related to X." No, no, no, no, no, no, NO! More than half the time, it's something else that changed that they forgot about - or didn't even know about. When the "obvious" cause turns out to be a red herring, or even before, I always echo the famous words of Sgt. Schulz: "I know nothing... NOTHING!" OK, so it's intended more in the spirit of Socrates. But no cause should be assumed until proven. That doesn't mean that you shouldn't check out your hunches first, though. We have intuition for a reason.

Start at the result and work backwards. I often see programmers start a debug session, then step through routine after routine, examining variables along the way, hoping to stumble across the moment when things go wrong. Usually that doesn't work at all, because problems have a nack for resulting from seemingly innocuous beginnings. It may seem counter-productive because code doesn't execute backwards, but it's more efficient to start at the moment of the failure's epiphany (an error message, for instance) and examine what's wrong at that instant. Then go back to the code that led up to that point to see where it went wrong, backing up routine by routine until you find the culprit. Ideally, debuggers should be made able to step backwards, but even if you have to restart your debug session a hundred times, you'll save time over trying to perceive the cause from the top.

Refactor. Sometimes the complexity of the situation is part of the problem. Maybe the definition of what the "plugh" function does is not entirely consistent, and that's what's leading to a failure. By simplifying and clarifying the design, those inconsistencies often reveal themselves. But use this judiciously -- once you get started down that road, you might not be able to stop for a long, long time.

About Chip Camden

Chip Camden has been programming since 1978, and he's still not done. An independent consultant since 1991, Chip specializes in software development tools, languages, and migration to new technology. Besides writing for TechRepublic's IT Consultant b...

Full Bio

Chip Camden has been programming since 1978, and he's still not done. An independent consultant since 1991, Chip specializes in software development tools, languages, and migration to new technology. Besides writing for TechRepublic's IT Consultant blog, he also contributes to [Geeks Are Sexy] Technology News and his two personal blogs, Chip's Quips and Chip's Tips for Developers.

One thing that helps me on a really difficult bug is to flowchart the section of code where the bug is occurring. It doesn't have to be a detailed chart, but diagramming the process helps me to spot irregularities in the code.

EXPECT PROBLEMS. A fundamental principle is that both the customer and the supplier (internal IT shop and end-user population) - supplier and end-user should expect that problems WILL occur. When you DO expect problems to occur, you will set up environments to automatically capture significant problem-solving data for the end-user to have available to convey to the problem-solving department. I have a lifetime career devoted to making this happen. My commercial message and MY website/business is: www.FirstFaultProblemResolution.com
Yes, if you ANTICIPATE that problems will occur you can ensure that a major head-start can be launched immediately. Often believe it or not, boys and girls, ladies and gentelmen, you CAN solve a problem on its FIRST occurrence. IMHO.

that I have found, is to look for a core file.
even if I can't reproduce the error if the app left a core, I have the exact state of the machine when it crashed.
[ assuming a fatal error and app crash on that ]
on my own apps that I have written, until it's on a second release, I leave debug code in so that errors can give useful information instead of generic messages that are useless.
the worst I saw for useless generic error messages was installing kylix 3 on a linux system running the 2.6 kernel, error code 10. Borland was no help at all. I had to struggle for a week to figure it that kylix 3 would not work with any linux kernel newer than the 2.4 series. What a wast of time and effort on Borland's part, to write an app that couldn't b used for more than a year or two before it wouldn't install at all.

I hate it when the person on the other end keeps asking me for more information on how to reproduce the error that I am getting. The impression I get is that the person on the other end has not even tried to check if there is a problem. Most of the times, I have seen that if the person at other end is candid enough to admit to me that he does not have the infrastructure for testing or he has tested and cannot reproduce the problem, it leads to a faster solution. If the person on the other end can reproduce the problem, then he can find the solution, if he cannot reproduce the problem, it might be something peculiar in my setup and we stop looking for bugs in the software and start looking for environmental issues.
If the person on the other end just keeps asking me for more information and steps without giving me a feedback, it is bad form. Clear and honest communication with the client is very important in problem solving. I view this communication as the first step in debugging any issue. I have been on both sides of the table. Timely and clear communication is appreciated by all the customers.

techniques.
Debugging backwards seems a pain, a lot less effort to be sure it's in routine X, however it always works, guessing is less efficient, when you guess wrong.
What I like to do is bracket the error, it does make an assumption of basic linearity, but if you can bracket the the execution from definitely good to definitely bad, you can home in where things are going out of wack. It also stops you spending hours to find out that say some other program put what should be / (or was assumed to be) an illegal value in the DB, or other input.
My number 5 (Refactor should always be the last thing you try) , is if what you currently know suggests rarely seen events such as witchcraft or a micro black hole in orbit around a memory chip, you've probably gone wrong. Sort of a re-application of point two.
Don't be afraid to start again if you've confused yourself. It happens to us all.
Oh and apply what you learnt from problems you've solved to problem avoidance.
Side effects, global variables and misleading names being classics.

Dry runs, have on occasion
If it works though, it's good.
We all have our ways of visualising what's going on maybe the fuzziness and the demonstrable lack on one right answer is why academia seem to skip debugging / fault finding as though something useful couldn't be taught.

was the bane of my existence in the first computer science course I took, back in high school.
hand me the assignment, and I wrote the code, but I could not draw up a flow chart at all.
one classmate could flowchart like a demon, but couldn't put it into code.
we got together to help each other out :)

Another piece of advice: let it die.
Remove or disable code that catches errors so you get the full advantage of your debug environment. Or at least catch a class of exception that gives you enough information, if there is one.

get at least one request for confirmation of how to produce a fault, if it can't be re-produced. I agree you explain why you are asking for it though.
This isn't because I think you are a dumbass, who forgot to mention something. :p
Of course if it turns out that in your horror at the error dialog, you did miss out step D, then we have to go round the whole cycle again.
Enviromental issues are usually unbelievably trivial, did you say XP Home ?.
To complete b'stards.
There may be a tendency to hope it isn't one of these so much , youll just ask one more time. May be it only happens when he's drinking tea, the blinds are half drawn, and a passing duck honks.

... dealing with support people who just keep asking what seem to be stupid questions or putting me through exercises that I know will lead nowhere. That usually occurs when I know more about the software than they do, and they're grasping for straws because they have no idea where to look.
On the other hand, I have sometimes been surprised by the result of going through the motions with them. As good as I am (and I'm damn good if I may say so), I'm not above making false assumptions myself, and these are often shaken out by taking a very methodical approach to a problem. So in spite of my impatience, I'll usually agree to rule out my stupidity first.

(another way to say what you said)
Thanks, Tony, I quite forgot to mention that technique. It sort of goes hand in hand with making the problem smaller.
Yes, when you find yourself ready to believe that quark mechanics or the phase of the moon really are significant to the behavior of a script, it's definitely time to question assumptions.
Another good point on misleading names. I don't know how many times I've tried to eyeball a problem and been taken in by two variable names that are similar but not identical. It's when you step over the statement that assigns the variable, yet it doesn't change, that you say "the language is broken!" and then you realize your mistake.

Virtual Serial Port Kit creates pairs of virtual serial ports in your system which are virtually connected to each other. Physical null modem cables are not required. You can connect any number of virtual port pairs. It's very useful for debugging.

distinguish names, but worse still is bad names.
Absolute classic eventually pinned down the other day
FindBySchedule(Schedule aschedule)
FindByScheduleName(String aschedulename)
No way you'd figure the first one sets up a predicate using aschedule.ScheduleName is there, I mean that would be outright lunacy would n't it?
Test Build Test, please boss just try it, please....
Needless to say this was about 32 fathoms deep in the application, and I must have missed it more than once as well.
Gone all the way through the code, there is no bug, it's those little blue f'kers from Rigel playing jokes again.
Another thing to learn from debugging, it is going to happen, plan for it....
Keep it simple isn't stupid.
Being too clever is.

More than a few times I've outwitted myself on revisiting old code in which I created a clever algorithm that wasn't obvious to the reader. I either eschew those now or comment them to death, er I mean, to clarity.