One of a tester's tasks is to isolate a bug. Narrowing a problem search space will help a developer to find the cause and fix it.

By isolating the bug I mean both finding the class of inputs that generate the error output of the app and finding which component, configuration file or piece of code contains the cause of an error.

Some methods I heard about and sometimes I combine together (less or more intuitively) are:

Scientific method. I hypothesize about possible cause and run repeatable experiment to verify it. I sample inputs, and compare actual program output with expected one. Even if I don't find the cause I may find the smallest input for which the bug is reproducible.

Debugging. I inspect the state of the running program (stack, variables) using debugger tool or by adding additional logging instructions (e.g., printf). Particularly good, when by restarting the app I could lose the state in which the bug occurred or app is on not mine, remote machine.

Reading and understanding the code base, logs, stacktraces and configuration files. I want to understand why this code base produced this output.

Root cause analysis. I try to narrow down the problem to one of the possible root cause classes defined by Software Testing Primer. I make sure errors are not in my test cases, nor in environment configuration. Then I check the code itself.

In the last case I also sometimes try to understand if the problem is caused by another bug, or by ambiguous/incomplete requirements/design.

What's your approach? Can you recommend me some reading, videos, excersises? And what are the right keywords here to google for?

5 Answers
5

As well as "what other commenters have said", here's a few other items from my experience:

Locate similar bugs - does your bug occur in an area of the system that's known to be problematic? Does it look a lot like something that's already been logged?

Areas of change - does the bug occur in a part of the system that's under heavy development? Where I work, there will be sections of the system getting a lot of work: knowing these and checking with developers working in that section can get you a first pass cause.

Just fixed - is your bug in an area of the system that just had a fix applied? There's a high chance the correction caused your problem if this is the case.

Simplest reproduction - this isn't quite the same as adjusting the inputs: it's repeating the process to reproduce the bug, but leaving out steps that seem extraneous to determine the shortest, simplest set of steps for a manual reproduction (this isn't always feasible). For example, where I work, we use a transaction-based testing, where we'll make a sale and tender payment with various modifying functions called between selecting the item to purchase and making the payment. To isolate bugs, we'll remove modifications like adding discounts, until we've got the most granular process for reproduction possible (many of the bugs we find are related to tax calculations under specific circumstances, so there are no trace logs to use: our sole information source is how the application behaves and whether the final amounts of a transaction match what they should be).

In my experience there is no hard and fast set of methods to use to isolate bugs - after a while one gets familiar with the kinds of bugs that are common to a particular system in test, and methods that don't relate to that style of bug tend not to be considered.

To me those are very practical solutions. First three items I would put under a general "umbrella" of risk areas. Hence I would include also app parts that we know have been designed in rush + ambiguous requirements + areas implemented by junior programmer, etc. Areas defined by the same criteria that are used to determine what to test.
–
dzieciouJun 4 '12 at 19:38

1

+1 for "after a while one gets familiar with the kinds of bugs that are common to a particular system". I do not have enough experience to confirm that, but I imagine that it would be more useful for the community to ask questions about particular bugs or classes of bugs, e.g.: I have the following defect in the following system, what can I do to isolate it?
–
dzieciouJun 4 '12 at 22:01

dzieciou - that can work, so long as the system itself isn't too proprietary. I've been working with a business-to-business point-of-sale and admission tracking system for the last 7 years, so I'm intimately familiar with it, but I can't discuss it here in anything but broad generalities because it's both proprietary and not widely known outside the businesses that use it.
–
Kate PaulkJun 6 '12 at 10:55

I both agree and disagree with this response. If you have an idea of where the problem is, then changing 1 thing at a time makes sense. But if you have hundreds of inputs and anyone of them could be the problem; using a binary search and changing 1/2 at a time will reduce the time it takes to isolate the problem.
–
xQbertJun 2 '12 at 16:51

@Qbert I agree that "change only one thing at a time" is not always the optimal approach for all debugging problems.
–
user246Jun 2 '12 at 17:33

@Qbert Correct, and sometimes changing a half of inputs can be less time-consuming than changing a single input. E.g. replacing the complex configuration file can be faster than getting the hardware replaced.
–
dzieciouJun 2 '12 at 17:43

Sometimes a bug might make even more sense if you have more information, and developers tend to have such information. You may think that a bug is from area X in the business logic, but really it's in Y code based on database storage calls. At the very least, asking developers (or managers, or other testers, etc) can sometimes help you avoid going in the wrong direction for isolating a bug.

But this story may not be applicable to your system, because your bug may be specific only to your system or technologies used. It may require a tester to be "intimately familiar" with the system and the technologies (as Kate Paulk commented).

Hence, what I started to try at my previous work was to ask developers and testers that are more experienced or familiar with the system and its technologies:

How have you found this cause?

How have you isolated this bug?

What steps have you followed?

What tools/commands have you used in the process?

I tried to do this every time another developers isolated not trivial bug.