Bug of the Year - 2011

Dec 24, 2011

My favourite bug of all time was uncovered because I was just too impatient.

I was working on a fairly large and complex embedded project. A microcontroller
was interfacing with some custom hardware. It would configure, control, and
monitor the hardware. The microprocessor was running an RTOS with a lightweight
TCP/IP stack, so remote clients could connect to the device via Ethernet and
perform remote configuration. We also used this for diagnostics and testing
during development. We had successfully used a similar design on several
products, with great results. We had also designed a protocol for monitoring
and control that was used in earlier models, and we extended it here - though
this product was significantly larger and more complex than its predecessors.
What could possibly go wrong?

The first thing the microprocessor does when you connect is to dump the status
of the entire system. With smaller devices, this was very fast. But this new
model was so large and had many more components and subsystems, it was taking a
very long time every time you connected. This device can also serve many
concurrent clients, so this slow connection startup would be repeated for every
client. The processing and time cost was not trivial, and this delay would
limit how quickly clients could start updating their own settings.

Slow and inefficient code bothers me. Eventually I got sick of waiting a good
10-20 seconds or so every time I connected before I could start working with the
hardware to test my latest changes, so I decided to track down the performance
problem. As a baseline, I measured the system performance before changing
anything, and then got to work. It had to be something simple, after all - we’ve
had this same code working on other products for many months in the field.

I figured a few hours would probably be enough to solve it. The debugging
process was made somewhat more difficult by the fact that there was no keyboard
or monitor on this hardware; all debugging was either via JTAG and a serial USB
connection, or by the Ethernet port - which was part of the problem.

The actual time to solve the mystery would be more like a week, with several
long, late, pizza-fuelled nights, drilling through many layers and thousands of
lines of code to find the very surprising root cause.

First layer: application protocol

The most obvious point to start with was the application level buffering. A
protocol layer abstracted away the TCP/IP networking (which also allowed for
serial interfaces) and had its own read and write buffers. I guessed it might
help to increase the buffer size a little, to have fewer networking calls. A
little increase didn’t help much overall, nor did a big increase. Ok, so it’s
not the protocol buffering.

The status dump on connection has to traverse many data structures, build
strings and create structured text for output. I reviewed the algorithms,
memory management and improved the code by reducing reallocations and copies,
tidied things up and used pre-allocated buffers more efficiently.

I measured the performance improvement and noticed a modest difference. All that
work helped - but it was still not enough. It was just too slow for my liking.
According to my calculations, the system should be capable of much higher
throughput. I decided to keep digging.

Second layer: networking and OS

Underneath the protocol code was a high level interface for networking. Perhaps
this wasn’t as efficient as it could be? After poring over the code, analysing
potential performance issues, I found a few small issues, but no smoking gun.

Now this RTOS has many tasks (threads) running, only one of which is the
networking support. Could this networking task be getting starved for processing
time? The interrupt handling latency should be guaranteed and well within the
required time. I disabled all non-essential tasks, tried increasing the
priority of the networking task, and various other tweaks. None had any impact.
The RTOS kernel was perfectly happy and running smoothly.

Keep digging…

Third layer: TCP/IP tuning

The TCP/IP stack we were using has a boatload of parameters you can configure at
build time, including its internal buffers. This was a prime candidate for
performance issues. I dug up the documentation, went through our configuration,
and sure enough - bingo! Several parameters were not at the recommended values
for this version of the library. Some buffer sizes needed to be multiples of the
packet sizes (eg MSS), and tuned to match other significant parameters. This
could have caused fragmented packets or memory buffers, and introduce small but
potentially disruptive delays to the flow.

This tuning process took many many hours, and eventually resulted in a decent
improvement in throughput. But was it enough? No - the big dump when the
connection was established wasn’t just slow now, it was noticeably jerky and
very bursty. I really needed to see exactly what was happening between the two
socket endpoints. I needed to understand why it was bursty - and fix it.

Fourth Layer: On the Wire

Having calculated the theoretical peak throughput, I decided there was no good
reason this microprocessor shouldn’t be able to maintain a much higher level
of throughput. Time to do some low-level packet analysis.

I set up Wireshark and started capturing packets. At first, everything seemed
ok but looking at the timestamps showed clearly that the transmissions were very
bursty. Sometimes there were delays of a few seconds between packets! No
wonder it was taking so long for a full status dump… but what was causing
this?

Looking at the IP layer, I decoded and inspected the session piece by piece,
from the very first packet. SYN, SYN-ACK, ACK... All good so far. But after
transmitting only a few data packets: NAK. Retries? Backoff? Delays! What on
earth was going on? The trace showed the micro was resending packets it had
successfully sent. Yet by matching up the sequence numbers, it showed the
packets were being ACKed by the other end. Eventually after receiving a few
out-of-order packets, the receiver tried to back off by increasing timeouts.
This perfectly illustrates the bursty nature of the traffic. But what could
be causing it?

Not leaving anything to chance, I tried changing Ethernet cables to make sure
it wasn’t a dodgy connection causing the fault. No dice.

At this point, my best hunch pointed to a bug in the TCP/IP library. Resending
an already acknowledged packet? Madness! Since we had found bugs in this library
before, it was quite conceivable. I upgraded the stack to the absolute latest
version and reran all the tests. Same problem. Yet according to the forums and
bug tracker, nobody else had reported this kind of problem with this stack
before.

I decided some major action was needed. I needed to partition the problem and
eliminate large components to isolate the fault.

Isolation

First stop, to write a simple socket server which would accept a client
connection, and then send out packets in a tight loop, as fast as it could. This
would exercise the TCP/IP stack, driver and hardware without any of the protocol
or application code. The packets contained a monotonic counter so I could see if
any packets were being corrupted or lost.

Running this test and capturing packets on the wire revealed the same problem. A
burst of traffic, a flurry of ACKs and NAKs followed by timeouts and
resends. Curses, foiled again!

Ok, how do I eliminate the TCP/IP stack from the equation? I constructed a UDP
ping packet by hand, using parts of the wire capture data to fill in the
relevant fields (such as MAC addresses). I kept a monotonic sequence counter
and copied this into the binary ping blob at the correct offset, which I passed
directly to the driver, with my workstation hardcoded as the destination. I
started with a small delay, in the order of 100ms between ping packets. This
seemed to work ok. But as I decreased the delay, packets were being dropped.
Dropped?!

The only thing between this test and the wire is the device driver and hardware.
Could the driver be corrupting or dropping packets?

Fourth layer: device driver

A code review of the device driver didn’t show up anything suspicious. Looking
at the memory management, interrupt handling - it all seemed quite carefully
written. Many hours later, no closer to the problem.

I pulled up the datasheet for the Ethernet controller and started querying the
status registers, halting the microprocessor and printing a diagnostic. There
were no clear errors to be found, so the driver did not appear to be causing the
hardware to fail sending or receiving data.

Fifth layer: hardware

The microprocessor has onboard Ethernet support, which is connected to a
separate MAC (Media Access Control) chip. This MAC performs the actual
electrical interfacing, and is the last piece of silicon before the wire. I
started reading the datasheet and looking at the initialisation sequence in the
driver, which configures the registers in the MAC on powerup. I verified the
correct register flags and values, but while I was reading I noticed there were
some counter registers which collected counts of certain types of media
(physical layer) errors.

I added some code to my minimalist hand-crafted ping test to read these counters
from the MAC registers, showing the values before and after the ping burst. Sure
enough, the counters were 0 on powerup, and after the ping test one of the error
counters had a very large number. Ok, I think we’re finally on to something…

Back on the wire

I modified the test program to send out hand-crafted ARP packets. The only
other code in play was the driver. I went back to Wireshark and captured
another session. This time, I exported the trace data to a file and analysed
the timing information in the headers.

I then stepped through and counted the number of successful packets sent before
a failure. Then the next, and the next. And I started to notice a sort of
pattern. The gaps were something like 9, 17, 33, 51… and eventually it would
come back down and repeat. A regular pattern is very interesting, but what
could be causing this kind of failure?

Stepping back and looking at the regular pattern of successes and failures over
time was like looking at an interference pattern. Like ripples in a pond,
where the waves met, packets were dropped. A colleague observed that this
looked a bit like there were two slightly different frequencies involved…
Wait a minute!

Don’t blame the Hardware

It was nearly midnight, and I desperately wanted to talk to the Hardware
Engineer who designed the system. But it would have to wait until the morning.
I fired off an email explaining what we had found, and went home exhausted.

The next day, I walked up to the Hardware Engineer who had a big grin on his
face. “I think I found your problem…”, he opened. I was skeptical, but
excited and urged him to explain. “In the last board spin, I rerouted the clock
source of the MAC controller. So the Microprocessor and the MAC were actually
running off two different clocks!”

I was elated. This perfectly explained the “interference pattern” we had
observed. The frequencies of the two clocks were supposed to be the same, but
were not perfectly aligned. Even a slight difference in frequency would cause a
‘beating’ effect as they drifted in and out of phase. Much like you can hear
when tuning a guitar, and two strings are almost, but not quite, in tune and you
hear a lower frequency ‘wow’.

So - while the two clocks were aligned, the microprocessor and the MAC
controller chip could reliably communicate, and the packets flowed normally. But
as the clocks drifted slightly out of phase, their chip-to-chip communication
was corrupted as the rising and falling signals between them became misaligned.
This explained why packets appeared to be sent or received at the higher layers,
but were in fact lost in the intermittently garbled transfers. It’s almost a
marvel TCP/IP worked at all!

The fix

In the end, it was a simple matter of ensuring both chips used a common clock
source - which required a modificaiton to the PCB routing. But for now, to test
the issue, the Hardware Engineer broke out the soldering iron and fixed the
routing by hand on our development system. (We were fortunate that the clock
signal was on an outer layer of the PCB!) I started the test again, very
nervous to see the results. After days of chasing ghosts, I didn’t want to get
my hopes up.

It worked. The hand-crafted arp and ping tests would run for as long as we
liked, and never skipped a beat, running as fast as it could go. Finally, full
throughput was achieved. I queried the registers for protocol and link errors,
and it was all good. I checked the TCP/IP layer diagnostics for errors and
statistics, and there were no red flags. I went back to the original application
firmware and tested out the protocol, monitoring the wire for good measure.

This time, it took less than a second for a full status dump. Finally - success!

Wrapup

So - what began as a seemingly simple software performance problem eventually
turned out to be caused by a hardware design fault. And it revealed several
other opportunities for improvement along the way. This was a great learning
experience, and a very satisfying puzzle to solve!

Random thoughts:

TCP/IP is really, really good at surviving unreliable hardware and problems in layers below.

Don’t mix clock sources between chips!

Don’t assume that the first problem you find is causing the problem.

Don’t assume the improvement you make is sufficient. Measure!

Performance is relative. What can you reasonably expect from the system in front of you?

Performance tuning is all about careful measurments and consistent tests. And changing
one thing at a time!

It’s hardly ever the fault of the OS. But it could be. And it’s hardly ever the fault of the
hardware. But it could be.

Don’t be satisfied with a fix until you understand how it works and why it fixes the problem.

It is sometimes possible to diagnose hardware problems through analysing software.