Tuesday, September 1, 2009

"One" Difference

When we're reproducing an issue, or when we're debugging a problem, we generally try to isolate the variables and then eliminate them one by one until we've found the subset of things that matter. This is, then, the simplest possible way to reproduce the problem.

In the end, you get down to wanting to change one thing at a time. Let's say, for example, that you find a problem on a directory that is compressed, encrypted, being written to by 4 NFS clients and 2 CIFS clients, and happens to be named "my volume". Coming up with a list of potentially relevant variables seems easy:

directory name

number of clients

type of clients

compression

encryption

Then we just start trying and eliminating variables. Try it again with compression disabled. If the problem still reproduces, then you know that variable is not relevant (hooray!). Try it again with encryption disabled. Launder, rinse, repeat until you've got it.

Here's the problem:

You haven't got anywhere near all the variables.

The problem also happened on a Tuesday. It happened on a system containing four servers that was 35% full. It happened at 8pm and a cleanup process was running on the servers. And we haven't really considered other processes on the same box, network traffic (in a multi-system configuration in particular), etc.

Ooooh...

There are a lot of variables in any sized system. Fortunately, most of them don't matter most of the time. That bug where we can't write to directories containing underscores really doesn't care about day of the week or hardware configuration, or whether the directory is compressed, or anything else.

There are three lessons we can take from this:

Lesson 1: You're never going to be able to change just one variable.

In a system of any real size, more than one thing is going to change between runs. It's just that you'll change one intentionally and others unintentionally.

Lesson 2: Most of the time that's okay.

We all have tales of some doozy of a bug that only occurred on alternate Tuesdays while standing on our heads and clicking with the left ring finger. Those are usually pretty rare. Most of the time something that's going to fail is going to fail because of the interaction between a couple things, or just one broken thing. (Hence: pairwise testing).

Here the best trick I know is to divide my variables into "proximate" and "background", with "proximate" being the ones that I believe are more likely to be relevant here. You can figure out likely relevance by your gut if you've been working with a system long enough. Base it on past bugs, the system architecture, and other things you've been testing on this build. Then manipulate the "proximate" variables and don't worry about the background variables for a moment.

Then, just do each test twice. Think you've reproduced it? Try again, preferably on a different system. If a background variable is relevant (and you haven't picked up on that) it's likely to have changed between your two so-called identical tests. Inconsistent behavior means that you've missed something and need to go digging deeper.

Lesson 3: Think first, then change.

I suspect some people are getting sick of hearing this from me! But I'll say it again since I think I need reminding myself:

Slow down. Think about what you've seen. Then make a deliberate change and proceed.

So what do I do when I'm trying to narrow down a bug?

See a potential bug

Try it again until it's somewhat reproduceable

Compare times when it did happen to times when it didn't happen and come up with a list of differences. These are my "proximate" variables.

Retest, changing one of these at a time and doing each test twice (on two different systems if at all possible).

Repeat for each of my proximate variables until I can make it happen every time.