Effective Debugging

In the first chapter the author advises1. Work out why the software is behaving unexpectedly.2. Fix the problem.3. Avoid breaking anything else.4. Maintain or improve the overall quality (readability, architecture, test coverage, performance, and so on) of the code.5. Ensure that the same problem does not occur elsewhere and cannot occur again.

The author emphasizes that without first understanding the true root cause of the bug, we are outside the realms of software engineering and delving instead into voodoo programming or programming by coincidence.

He suggests that empirical means is the best way to Debug i.e. provide different inputs and observe how the system behaves.

The Core Debugging Process involves the following steps:1. Reproduce: Find a way to reliably and conveniently reproduce the problem on demand.2. Diagnose: Construct hypotheses, and test them by performing experiments until you are confident that you have identified the underlying cause of the bug.3. Fix: Design and implement changes that fix the problem, avoid introducing regressions, and maintain or improve the overall quality of the software.4. Reflect: Learn the lessons of the bug. Where did things go wrong? Are there any other examples of the same problem that will also need fixing? What can you do to ensure that the same problem doesn’t happen again?

Address one Bug at a time: Picking too many bugs to address at one time will prevent focus on one.Check Simple Things first: Somebody may have encountered something similar and may already have a solution.

Reproduce

1. Reproduction of the error should consistent and efficient, otherwise testing the fixing will become a botheration.2. So reproduce the error in an controlled the environment to achieve consistency.3. To keep it efficient try and reduce the input to be provided and reduce the processing that needs to be done, store the state at every step so that only the errorneous step needs to be rerun.4. Automate the test conditions to make it quicker and easier to test the application after the fix. Replaying the log file can be a good strategy in scenarios where logging using proxy was used to capture the error condition.

The following will help in reproducing the error:1. Logging at appropriate places so that one knows what is happening in the system. Too much logging will be unacceptable in a production system.2. Where possible usage of a proxy to capture the network traffic and try to reproduce the error with this traffic.3. If calls to libraries are problematic or they need to be emulated in a test environment, write a Shim (a proxy to a library) and capture the inputs and outputs and use this to reproduce the error. In engineering, a shim is a thin piece of material used to fill the space between objects. In computing we’ve borrowed the term to mean a small library that sits between a larger library and its client code. It can be used to convert one API to another or, as in the case we’re discussing here, to add a small amount of functionality without having to modify the main library itself.4. Reach out the user community that is able to reproduce the error and get inputs from them. Give them specially instrumented code to figure out the error.5. Read the documentation on the system, if the problem seems to be occurring beyond the realms of the code that has been written, and read the errors reported by others using the same platform.

Irreproducible Errors

Most bugs are reproducible. The few scenarios where the bug may be irreproducible or difficult to reproduce will be because of the following reasons:1. Starting from an unpredictable initial state: C, C++ programs are prone to this error.2. Interaction with external systems: This can happen if the other system is no running in lock-step with this software. If inputs from this external systems arrives when the current system is under different states the error can be difficult to reproduce.3. Deliberate Randomness: In some systems there is deliberate randomness as in games. These can be difficult to debug. But if the same seed is used for the pseudo-random number generator then the bug will become easier to reproduce.4. Multithreading – This happens because of the pre-emptive multi-tasking provide by the Operating System. Since the threads can be stalled and restarted at different times depending on the activity in the CPUs at that that time, it becomes difficult to reproduce errors in such an environment. Trying using the sleep to try and simulate the stalling of one thread and execution of another to try and emulate the error.

Good Practices of Reproducing

If a bug takes a long time and is still not identified this may be because another bug is masking this one. So try to concentrate on a different bug in the same area and possibly clear it before retrying the difficult one.

Diagnosis

How to Diagnose?

1. Examine what you know about the software’s behaviour, and construct a hypothesis about what might cause it.2. Design an experiment that will allow you to test its truth (or otherwise).3. If the experiment disproves your hypothesis, come up with a new one, and start again.4. If it supports your hypothesis, keep coming up with experiments until you have either disproved it or reached a high enough level of certainty to consider it proven.

Techniques of Diagnosing

1. Instrument the code to understand the flow better.2. Use a binary search pattern and logging to locate the source code of error. I.e. look for error before and after the execution of a stretch of code. If error is found now look for the error in the first half of this code stretch, if not found then look for the error in the second half of the stretch; then further split the stretch found into further two halves; repeat this until the exact point of error is found.3. Use a binary search pattern in version control to identify the version when error was introduced.4. Use a binary search pattern on data to identify the version of the error.5. Focus on the differences. The Application works for most customers, but not for specific ones. Check how these customers are different from the rest where the application is working. Similarly works in most environments, but does not work in a particular environment. Try and figure out what is different in that environment. If it happens for specific input files then figure out what is different in that file as compared to other files where it works.6. Use debuggers when available.7. Use the Interactive Consoles where debuggers are not available or are not good.

Good Practices of Diagnosing

1. When experimenting make only one change at a time.2. Ignore nothing. Do not shrug off the unexpected as and anomaly. It could be that our assumptions are wrong.3. Maintain a record of experiments and results so that it is easy to trace back.4. Anything that you don’t understand is potentially a bug.5. Learn from others. Search in the net for similar problem and solution offered.6. All other things being equal, the simplest explanation is the best. – Occam’s Razor7. Writing Automated Test Cases helps because this lets us concentrate only on broker cases.8. Keep Asking “Are you changing the right thing?” If the changes you’re making have no effect, you’re not changing what you think you are.9. Validate and revalidate your assumptions10. Ensure that the underlying system on which diagnosis is being done is static and not changing.11. If one is stuck in debugging a problem, one good way is to ask somebody else to take a look at it.

Fixing

Best Practices

1. Make sure you know how you’re going to test it before designing your fix.2. Do not let the fixes mess up with the original clean design and structure of code. Haphazardly put together fixes can mess up the good design principles followed in the original design. Any fix should leave the code in better shape than it was before.3. Clean up any adhoc code changed before making the final fix so that no unwanted code gets checked in. Keep only what is absolutely necessary.4. Use existing test cases. Modify the test cases if required or write the failing test case and test code without the fix. Then fix the code and test the failing test case to see that it passes after the fix.1. Run the existing tests, and demonstrate that they pass.2. Add one or more new tests, or fix the existing tests, to demonstrate the bug (in other words, to fail).3. Fix the bug.4. Demonstrate that your fix works (the failing tests no longer fail).5. Demonstrate that you haven’t introduced any regressions (none of the tests that previously passed now fail).5. Fix the Root Cause not the symptom. E.g. if one encounters a NullPointerException, the solution is not to capture the NullPointerException and handle or even worse suppress it, it is necessary to figure out why the NullPointerException is occurring and fixing that cause. Giving into temptation of quick fixes is not the right thing, making the right fix is the right thing.6. Refactor or change functionality or fix a problem — one or the other, never more than one.7. Always check in small changes. Do not check in large changes as it will make it very difficult to find out which change actually caused the problem. Ensure check-in comments are as meaningful (and specific) as possible.8. Diff and check what exactly is being checked in before actually checking in.9. Get the code reviewed. This is very important as unnoticed errors

After Fixing – Reflect

Sometimes “The six stages of debugging” reads as follows:1. That can’t happen.2. That doesn’t happen on my machine.3. That shouldn’t happen.4. Why is that happening?5. Oh, I see.6. How did that ever work?

After fixing one needs to reflect on the following points:• How did it ever work?• When and why did the problem slip through the cracks?• How to ensure that the problem never happens again?

Find out the root cause. A useful trick when performing root cause analysis is to ask“Why?” five times. For example:• The software crashed. Why?• The code didn’t handle network failure during data transmission. Why?• There was no unit test to check for network failure. Why?• The original developer wasn’t aware that he should create such a test. Why?• None of our unit tests check for network failure. Why?• We failed to take network failure into account in the original design.

After fixing do the following:1. Take steps to ensure that it does not ever happen again. Educate yourself, educate others on the team.2. Check if there are other similar errors.3. Check if the documentation needs to be updated as a result of the fix.

Other aspects of handling and managing bugs

1. To better aid debugging collect relevant environment and configuration information automatically.2. Detect bugs early, and do so from day one.3. Poor quality is contagious. Broken Window concept. The theory was introduced in a 1982 article by social scientists James Q. Wilson and George L. Kelling. So do not leave bad code. Fix bad code at the earliest.4. A Zero Bug Software is impossible, so take a pragmatic approach and try to reach as close to Zero bugs as possible. Temper perfectionism with pragmatism.5. Keep the design simple. Not only does a simple design make your software easier to understand and less likely to contain bugs in the first place, it also makes it easier to control—which is particularly useful when trying to reproduce problems in concurrent software.6. Automate your entire build process, from start to finish.7. Version management of code is absolutely mandatory.8. Different source should mean different version number. Even if the change to the code is minuscule.