Monday, 14 June 2010

Extreme debugging

This is a tale of the hardest bug I have ever debugged.

I was working on a baseband module in the DSP level as an integrator. The system run on a chip with three cores, a receiver, a transmitter and application core which dealt with user and control planes.

We had a problem that sometimes one of the cores seemed to crash because it went unresponsive. Halting the core in the debugger it looked like it was still running, though. The place was different so it looked like it should work, it just wasn't replying to messages from other cores etc. Depending on the source package it could take anything from 30 seconds to hours before this occured and even within a package this was pretty random (but still some packages crashed faster than others).

After quite a lot of debugging and writing test code it became apparent that one core alone would not crash so this had something to do with several cores doing something simultaneously. One thing I for instance did was a trace system for this particular problem. The SW already had one but that was more about the logical way the SW worked rather than at a function and interrupt level. This only provided some help, problem was we didn't have any tools available to debug several cores at the same time. We did have an emulator that we could use to trace the SW a core at a time. I tried my own trace first because the log that the emulator created was absolutely massive as it traced every single assembler level operation the core did. And because the amount was so large and the USB connection so slow I could only collect the program counter (PC). This of course could be merged with the SW image and C code so one can walk through the code in the order it was run.

I spent hours and hours looking at traces and this was not helped by the fact that often it was some other core crashing than the one I had connected the emulator to. Problem was, the tails of the traces didn't usually seem to have anything in common. In other words the call stacks were very different.

But having watched how the code behaved and looking at the traces I still started to get a pattern. There were always one of two things happening some time (thousands or tens of thousands of cycles) before the crash. There was either an inter-core message or access to the HW semaphore block. It was not easy to spot because it was so much earlier so that's why it took literally weeks before this pattern started to appear in my head.

So I started suspecting that maybe there was an issue when one core was sending a message and another was accessing the semaphore block. I didn't have a way to test this because it was impossible to synchronize the cores to such extent that I could write some logger for this as the way the cores were started required messages from upper layer going to each core etc. Any timestamps would have been in an order of tens of thousands of cycles off and even when one core crashed the others would still run for quite some time (speaking from DSP's point of view).

In the end I took the baseband system to our customer who happened to have logic analyzers with high enough frequenct support and our schematics. The schematics unfortunately showed that there weren't too much debugging support from HW level. The chip could have supported all sorts of signals but they were not connected anywhere. Basically all I had that I could use was seven GPIO pins.

So working on my hunch I wrote some simple macros that turned a pin on or off and inserted these to strategic places in the code. The logic analyzer was then connected to the board to show the state of the pins. After several crashes it started to look like my hunch was right: there was always two operations from different cores very close to each other. So it was time to check the chip block diagram and there was an interesting thing there: all the blocks in the chip are connected to a common DMA crossbar (from all three cores). As it happens, both the block involved with the message passing code and the HW semaphore were behind the same bridge (which in turn was connected to the crossbar).

A colleague then wrote code that basically didn't do much else than to use those two blocks from two cores and upon running it one of the cores crashed immediately. Proof positive that we had found the source of the problem: the chip's DMA engine had an issue.

We sent this code to the chip vendor and only took a day before they responded that there was indeed a bug in the HW. It took another week before full analysis was delivered:

There were several HW blocks connected to the same bridge. Therefore the bridge had a queue when ever accesses were made. Each access was given a number in the queue and the block was to send a message to the bridge when that queue item was free again (=access request was handled by the block). However, two of the blocks had a bug (HW semaphore and another we didn't use) where, if this freeing operation wasn't acknowledged back by the bridge they wouldn't try to send it again. This could happen if two blocks tried to send it at exactly the same time. In that case the bridge would only know of one of them.

What made the bug so damn hard to debug was that this wasn't the point when things failed. Instead the core continued to run and use the queue (which was used as a circular buffer) until it came back to the item that was not freed. The bridge was programmed to assume that all resources were used up and stall the core until the item was freed. Of course because the block had already sent the command and didn't send it again this caused an eternal stall for that core. So depending on what all of the cores were doing the time difference between the offending command and the stall varied greatly.

Moral of the story? Debugging is labour but sometimes also instinct. Work only gets you so far.