Marc Hedlund: Debugging Hacks, What They Never Taught You About Solving Hard Bugs

There's no doubt that debugging is a critical skill for anyone who
codes. Marc Hedlund is talking about how to tackle the really
difficult ones. I enjoyed Marc's
tutorial from last year, and picked this one on that basis.

Most bugs aren't hard. 95% of the time, you can find a fix easily
and move on. Marc's tutorial is about what to do when the simple
methods don't work anymore. He gives an example of a login that
would fail once every 10,000 times or so. Turns out the problem was
a filter that would through out URLs with swear words in them.
Finding bugs like that can be hard.

Wrote a test case that exercise the bug and discovered Rails was
factor

Used source code and a debugger to gather data.

Noticed a coincidence

Reproduced failure in his test case.

Here are some common mistakes:

"That doesn't look right, but it's probably fine." If you think
there's a bug, there's a bug. Pay attention to small hints. If you
can't find anything file a bug report.

"It seems to have gone away." If you didn't fix the bug, it's still
there. If you don't understand what the problem is, it will bite you
later.

"I bet I know what this is." Wait o form theories until you have
data. Let the data lead you. He quotes Sherlock Holmes: "It is a
capital mistake to theorize before one has data. Insensibly one
begins to twist facts to suit theories, instead of theories to suit
facts."

"That's impossible." Impossible conditions are often the source of
bugs. Set up logging, exceptions,a nd assertions. Make sure you get
the report. Make sure you see failure when it occurs. Ignoring the
obvious is a good tool. When your Web site produces and exception,
send it to the whole engineering team.

"Beats me...probably a race condition." Not all hard problems are
race conditions. Usually this means "I don't know." This is forming
a theory without data.

"I'm just going to fix this other bug quickly." Don't make any
changes until you understand the bug. File and log and bugs you find
along the way--but don't fix them. You end up suppressing the first
error and missing it.

"I think I found a bug in the OS." In all likelihood, the problem
won't be in the libraries of in the operating system. That can
happen, but you'd better have pretty good evidence.

"That not usually the problem." Beware of representativeness
errors. Sometimes 40-year olds have heart attacks. If the data
leads that way, then follow it.

"Oh, I saw this just last week." This is known as an availability
error. Third in a week could be an epidemic--or not.

"This guys too smart to make that mistake." Beware of sympathy
errors. Even engineers put CDs in upside down. Check the data no
matter the source. The opposite is also true: assuming someone's
stupid.

"I found a bug and fixed it-done." Finding a bug is
different than finding the bug.

"I haven't made any progress--it's still broken." Think of the bug
report is a collection of information. Adding data, eliminating
theories, and recording changes leads to understanding. Clearing
bugs is the end goal, but progress can be represented by other
things.

"I've got to get this out now--not time for..." Rushed fixes tend to
introduce more bugs. Stick to a good process even if the situation
is urgent. Break down suppression and closure.

Here's Marc's general approach to fixing bugs.

Revert any changes you made looking for a quick fix - Bring the
system to its initial state. People usually try something quick.
Getting back to the original condition as quickly as possible is
important.

Collect data from each of the components involved - Maintain a
page with the most concise problem descriptions. State everything
you know for a fact. List the questions for which you need answers.
Don't delete data; instead move it to a "probably unrelated"
section.

Reproduce the bug and automate it - You must have access to the
reporters environment. Use virtualization and the browser version
archives where needed.

Simplify the bug conditions as much as possible - Con you
reproduce the bug in other circumstances. Can you remove a condition
and still see it? Are there any contradictions in the conditions?
"We only see this on OSX with IE." Can you separate the problem?
Could be an error in the data?

Look for connections and coincidences in the data - Build a set
of "that looks weird" observations. Describe all the actors and
their roles. Parallel timelines can help. Look at data from client
and server viewpoints.

Brainstorm theories and test them - State each theory
separately. Does the theory cover all of the data in the report?
Does it explain why the conditions are necessary? Does it cover all
the related reports?

When you find a fix, verify it against the report - Go back and
re-read the whole bug report. Run all of your reproduction test cases.

Check that you haven't created new bugs - Very common for one fix
to create new bugs. Automated test quites help enormously at this
point. If X was failing under condition Y but not Z and it now
passes under Y, does it still pass under Z? Often the answer is
"no."

These steps almost always work. You might have to go through it
several times. You might need several people to make it work. You
might decide its too costly. Even so, if you go all the way through
this process, you will get a fix.

I missed 45 minutes after the break because of a conference call I
had to join. So, there's a gap here in what Marc said and what I
heard.

The best predictor of new bugs is change rate. Code that is changing
a lot will have a lot of bugs. Direct QA efforts by counting changes
per file. Spend time testing the stuff that changed.

The best estimator of code quality is the rate you find bugs. When
the find rate goes down, you're ready to ship. You should ignore
every other QA measure.

You can four things with each bug

Fix it

Suppress it

Record it and wait for more info

Ignore it

You probably can't always afford (1). Of the rest, (3) is the best
option.

There's a culture surrounding bugs. Don't scold people for bugs.
Everyone creates bugs. If bugs cause punishment, reports will be
killed and there will be severe tension with QA. If there's a
chronic problem with bugs from one person, deal with it in person.

Reopen rates measure how development deals with bugs. Lots of
reopens is a red flag for process--especially within one release.
Reopens indicate that bugs are being hidden rather than closed.

Marc has some book recommendations for people who want to understand
debugging better: