I figured I'd take a break from being assailed from all sides regarding the state of Perl (though I did find this very interesting, and especially this comment). No, this week I decided to go in a vastly different direction.

Every so often, I like to highlight curious or extreme examples of troubleshooting. Besides being generally good reads like a detective novel, these write-ups serve to fit in little crevices in our minds, dormant until a similar situation crops up weeks, months, or years later. What makes a good troubleshooter is the ability to eliminate possible causes until the actual source of the issue is revealed, and equally, the ability to instantly call upon tiny flecks of information from the distant past, then apply that knowledge to the current situation.

This is troubleshooting at a level that most people never reach. When we're investigating issues of a similar nature, such as intermittent network connectivity problems, we step through the usual suspects, and 999 times out of 1,000, we replace a patch cord, update a driver, or do something equally pedestrian. But to encounter an issue like the one Kristian faced, with multiple reports of problems within brand-new hardware platforms, completely different infrastructures, completely different clients, involving behavior that appeared to be OS-independent -- there are no usual suspects.

The best part of this write-up is that it's not only a very peculiar real-world case that keeps you guessing, but Kristian details the tools he used and shows examples of how he arrived at the final answer. For those just entering the fray as a network engineer, or even those who want to add to their skill set, reading through his methodology and playing around with the tools referenced will only serve to enhance those skills. If you're at all interested in networks, being on a first-name basis with everything from Wireshark to tcpreplay to Ostinato will make fixing many problems amazingly easier.

This jaunt into the wild also serves to underline a little-recognized phenomenon: exactly how much we trust the lowest-level code in our infrastructures.

The fundamental problem that Kristian faced was bad code in an Intel network controller, a bug that would shut down the interface if certain conditions were met -- conditions that occurred intermittently on a general-purpose network. Throughout the millions upon millions of Intel-driven NICs out in the world, a bug like this is so rare as to be absolutely the last place you'd look for the problem, as Kristian found out. You certainly don't look at a NIC EEPROM as the first step in deducing the cause of a network issue.