Experienced a network outage? I feel your pain.

By now, all of us in networking have read about the four-day Blackberry outage or experienced our own personal Crackberry withdrawal. We also know it was a core switch failover event and backup-switch-gone-bad that is shouldering the blame for this. (According to RIM co-CEO Mike Lazaridis, the outage occurred “when a dual-redundant, dual capacity core switch failed and its backup switch failed to activate.”)

I feel for those who were working in the networking department at RIM. I can’t think of a worse position to be in. Wait, actually I can think of a worse position to be in: It would be worse to be the support engineer on the phone for the last 24hours from whichever core switch vendor they used. For that person or persons, it’s probably going to be another long night tonight including a sleeping bag next to the cubicle. Live production networking is a tough gig, too tough for me, which is why I moved out of it and into network testing. In fact, I don’t think I could handle live production networks anymore, especially not with all the outages lately.

So to all network engineers that have been experiencing outages: I feel for you. I really do. We can’t undo what happened so my only suggestion is, if you are reading this, please consider doing more stress testing. That may seem a trite statement at this stressful time because I’m sure you did test, but consider more testing with line-rate load generators such as the ones Spirent makes, to truly create stress and cause outages to understand the full impact before they happen for real. Move it up your supply chains. Urge your vendors to recreate this scenario in their labs, and suggest they use Spirent tools (Spirent TestCenter, Spirent Landslide, and Spirent Avalanche) to generate the loads you experienced when the Failover event occurred.

This should henceforth be a mandatory test case that every major core switch vendor should run their equipment through. I’m sure we don’t want to see these kinds of things happening again, and neither do any of the millions of customers. And of course, neither do the poor support engineers that have to deal with this.