Debugging technique: Time the problem

I once had a ISDN router that got incredibly slow now and then. People on the far side of the router lost service for 10-15 seconds every now and then.

The key to finding the problem was timing how often the problem happened. I used a simple once-a-second "ping" and logged the times that the outages happened.

Visual inspection of the numbers turned up no clues. It looked random.

I graphed how far apart the outages happened. The graph looked pretty random, but there were runs that were always 10 minutes apart.

I graphed the outages on a timeline. That's where I saw something interesting. The outages were exactly 10 minutes apart PLUS at other times. I wouldn't have seen that without a graph.

What happens every 10 minutes and other times too? In this case, the router recalculated its routing table every time it got a route update. The route updates came from its peer router exactly every 10 minutes plus any time an ISDN link went up or down. The times I was seeing a 10-minute gap was when we went an entire 10 minutes with no ISDN links going up or down. With so many links, and the fact that they were home users intermittently using their connections, meant that it was pretty rare to go the full 10 minutes with no updates. However, by graphing it the periodic outages were visible.

I've seen other outages that happened 300 seconds after some other event: a machine connects to the network, etc. A lot of protocols do things in 300 second (5 minute) intervals. The most common is ARP: A router expires ARP entries every 300 seconds. Some vendors extend the time any time they receive a packet from the host, others expire the entry and send another ARP request.

What other timeouts have you found to be clues of particular bugs? Please post in the comments!

No TrackBacks

Another classic interval is 49.7 days, or 2^32 milliseconds. Timer wraps cause all kinds of interesting buggy behaviors. That one is a little more maddening to track, though most mature code has timer wrap protection or 64 bit counters. 64 bit counters take more than 584 thousand years to wrap and I have yet to see any system with that kind of uptime.

I remember WD Velociraptor SATA disks, that caused a random disk failure every 2^32 milliseconds. If there was disk activity at the timer rollover, the disk went offline. Since all of the disks are started at the same time (at power-on) this is a nice way to lose a RAID array completely.

We once had a problem reported by web hosting folks that FTP file transfers done to web servers that were using NFS shares, failed. It turned out they had a script killing idle FTP session that had a bug and was killing everything instead. We discovered it by doing an FTP transfer and recording the time it failed.. turned out it was exactly every 5 minutes.. and they were able to track it down. It was fun to troubleshoot their own systems from outside.

Debugging admin complaints for Exim on exim-users, you quickly learn that 30 second delays are DNS timeouts, ie they have broken DNS, and 5 second delays are because ident traffic being dropped, instead of allowing RSTs to return.

People experiencing slow POP3 service are using some stupid POP3 client which not only polls every minute, but for some unearthly reason thinks that it should poll *on* the minute, rather than every 60 seconds after starting.

Running ISP mail-servers, you see POP3 load spike every minute, on the minute. Customers with less-broken clients got rather more responsive service. (But then, less broken clients are unlikely to hang the UI rendering based on talking to a remote server).