Guide to hardware troubleshooting (Part 1)

When the Gantt charts are drawn up at the start of a project, perhaps the most difficult part for the hardware engineer to estimate is the debug phase of a product development. It is also one of the most ignored sections in planning.

CAD tools have progressed over the years in terms of ease of use and integration into PCB and mechanical. But ultimately, the design work is carried out by a person who is not only fallible, but may also be working with incomplete or incorrect data. Some bugs are inevitable on all but the simplest designs and so the art of troubleshooting these bugs is all-important.

Bugs can range from something going BANG the first time power is applied to intermittent glitches reported in association with completely unrelated things like "it was raining" or "it only happens on his bench, not on mine". Consequently the ease of fixing bugs similarly ranges from a five-minute job to months of work.

Debugging can be the most fun part of electronics design when it is going well. There is a great of satisfaction in finding and fixing the intractable bugs. But to succeed, it is important to be systematic in the approach taken to fixing bugs.

In this article are listed the steps needed to bring such a systems approach to troubleshooting hardware in product development. To illustrate these principles, I will refer back occasionally to work I performed years ago as a junior engineer on a system that was used for monitoring sixteen analogue audio inputs at the same time. It looked something like the figure.

Figure: Monitoring sixteen analogue audio inputs at the same time.

The system consisted of a multi-channel ADC board, with the digital audio signal passing through an FPGA that multiplexed it onto a DSP bus. The DSP received interrupts telling it that when data was ready, it was to read and store that data. The FPGA logic was entirely asynchronous.

Occasionally, the DSP would stop receiving interrupts and the whole system would grind to a halt. This could happen days apart, or in a matter of just minutes. Software bugs had been eliminated as the cause, so this looked like a hardware bug and I was asked to investigate.

Step 1: Picture success An important part of debugging is having the right mental attitude, as persistent problems can grind down your morale. In particular, it feels bad going to work two days in succession with the investigation stuck at exactly the same point. In such a case, ask yourself "Will I still be working on this bug in a year's time?" The answer: Of course not! This bug isn't forever, it's going to be fixed. It's not that there's no solution, it's that I simply haven't seen it yet.

Step 2: Keep notesResist the temptation to dive straight in trying to fix the bug immediately. But it is important to determine first if others have dealt successfully with a similar problem. Collect reports from multiple sources, even though they may sometimes have conflicting data attached. A spreadsheet can work well here to organise what you find.

Step 3: Reproduce the problemThis is often the hardest and most time-consuming part. The frequency that bugs show themselves varies enormously. So at this point, based on the information you have collected, you need to create the conditions by which you can make the bug happen at your command.

At this point diagnosis can begin. The initial bug report may be "It stopped working", "It crashed", or other equally vague reports. Keep working until you have all the information you can get from the one who reported the problem and also have enough to narrow down the range of possible causes.