We experienced an odd outage a week ago that makes me wonder where the whole datacenter design concept is heading.

We experienced an enterprise-wide network outage at about 4:30 EST. The switch, and its redundant backup, both failed. All traffic came to a screeching halt. The guys in the command center were scrambling to not only figure out what had happended, but what to do. Our network is outsourced, so a call was placed for support. They accessed the switches and made multiple attempts to reboot them, with no success.

Meanwhile, the command center was trying to switch the workload over to the DR site. But, since almost all applications had been up, they were literally frozen in mid-stream during attempts to write data to the backend databases. They tried to put together a conference bridge, and get everyone on it. Email was out. Internet was out. IM was out. All of their contact lists, documents, and procedures are stored on the LAN, so they were unavailable. Our on-call messaging system couldn't work since access to the call-out details stored in the database were unavailable.

Eventually, after about two hours, some hardware was replaced in those switches and they were able to be re-booted. Everything got back to normal by that evening.

I've never understood the concept of high-speed data I/O over a TCP/IP network, rather than using direct point-to-point hard-wired connections. It's always seemed like a recipe for failure in the right circumstances.