The 2 Faces of Debug – by Neil Johnson

May 16, 2018

Recently, I was part of a discussion about the different types of bugs that can pop up for verification teams. We were talking about the differences in bugs found in subsystems and those found by teams verifying large integrated subsystems and entire devices. One fellow on the call made the comment that bugs found in integration tend to be easier to deal with than those found in the individual subsystems simply because blocks are verified extensively before they’re integrated.

I agreed.

But then I thought back to the integration bugs I’ve seen in the past and the havoc they inspired. Even simple integration bugs like unconnected wires or disabled features can become a nightmare. It got me thinking about how we characterize debug, how we deal with bugs themselves, and how subsystem debug presents problems that are more technical while integration presents problems that are more organizational. I wonder if acknowledging the differences in how we react to bugs in both situations could improve how we handle them. That’s what we’ll tackle here; characterizations for debug effort required in individual subsystems and debug effort required in integrated subsystems. We’ll call them technical debug and organizational debug.

Of the two types of debug, technical debug is the one that developers identify with more easily because it directly involves code and classic engineering problem solving. Technical debug is prevalent during active development and is usually initiated by functional tests. Either a test fails or the test runner notices invalid or missing behaviour. Technical debug ensues; a period of diagnosis followed by formulation of a patch, applying that patch and validating the patch.

A primary feature of technical debug is scope – not necessarily in terms of size or complexity of a bug but with respect to the number of people required to resolve it. Technical debug happens in situations where it is relatively easy to identify the person best equipped to resolve a bug. Quickly identifying the right person reduces debug to almost a purely technical effort for one or two people. Involving very few people requires little coordination (or no coordination at all if you’re patching a bug you created yourself!). If you diagnose a bug and patch it yourself or you can contact a specific person and and between the two of you a bug is diagnosed and patched, this is technical debug. More than a couple people, maybe three at the most, and you move to organizational debug.

Organizational debug is technical debug but with additional coordination. Coordination can be the extra dimension that turns resolution into a lengthy and convoluted process.

Organizational debug still starts with an observation of invalid or missing behaviour, but the person that initially observes that behaviour does not have the ability to quickly diagnose the issue nor is it immediately obvious to them who is best equipped to handle the issue. Those unknowns lead naturally to one of three options:

Guess at who is best equipped and reach out to them directly;

Send a mass email including many people on the development team, usually including various managers, describing the bug with a request for input; or

File a bug report with a description of the bug that is then forwarded to several people on the development team, usually including various managers.

Best case scenario is a good guess that finds the right person to resolve the bug. Second best is that someone receiving the email or bug report recognizes and clearly assumes responsibility for the issue, then sees it to resolution. A distant third is that the mass email or bug report spawns multiple independent and diverging conversations that consume effort from many people on the team. It’s typical those threads grow to include additional members of the team, often blowing up to include additional managers, team leads and meetings.

The epicenter of organizational debug is typically verification engineers responsible for deciphering test results from large integrated subsystems composed of unfamiliar design and testbench code inherited from other teams. Organizational debug can be incredibly damaging from a time-to-resolution standpoint, also from a productivity standpoint by virtue of consuming time from so many people. Side effects, both positive and negative, can extend to team cohesion and bureaucracy. Cohesion improves with rapid acknowledgement of responsibility and knowledge sharing; it deteriorates with avoidance and poor communication. Additional bureaucracy can be positive when it is geared specifically toward effective diagnosis and communication; it can be negative when issue tracking is prioritized over issue resolution (unfortunately, this is a common reaction to organization debug).

Reducing the impact of technical debug has been actively addressed for many years by tool vendors and teams. Somewhat disappointingly, solutions tend to be reactive for patching poor quality instead of proactively focused on improving the initial quality of design and testbench code. But, solutions exist nonetheless, and they are improving.

The key to reducing the impact of organizational debug is improved diagnosis and communication, reducing the number of people required for resolution, and quickly identifying the people best equipped for resolution. Basically, turning organizational debug situations into technical debug situations as quickly as possible.

Debug is a major productivity killer for hardware development teams. The next time you encounter a bug, try characterizing your response as either technical debug or organizational debug for a faster and more effective resolution.