From

Thank you

Sorry

Try as we might to keep chaos at bay, there will come a time when the perfect storm hits and everything falls apart. Usually a confluence of elements triggers total meltdown, but sometimes one overlooked weak link fails and causes a cascade of problems that takes an entire network offline.

These situations are never easy to deal with and are generally compounded by the fact that admins are feverishly working to fix problems while being bombarded with alarms from other systems that are also failing due to the initial outage. It’s like trying to rebuild a house while it's falling down on top of you.

To make matters worse, depending on the nature of the problem, tools that might be used for the fix are not available. In some cases, this includes Internet access. I can recall a time when the network was down and the only Internet source was pre-iPhone cellphones. There was no cell signal in the data center and no wireless access anywhere, so someone stood outside Googling for answers and relaying information back down to a crew member with a laptop jacked into a console port.

Diabolical dependenciesI've seen admins try to boot a broken virtual server to a PXE rescue image to recover a corrupted disk caused by a bad NIC, only to realize after several minutes that the PXE services are provided by the server they're trying to fix. They then dig through the ISO share located on a nonproduction storage array for the boot image, but discover that nobody bothered to configure the array with a static IP and the DHCP lease just expired -- because the broken VM was also the DHCP server. Utter mayhem.

It's during crises like these that some admins find religion and start promising they'll do things differently in the future if only they could get this problem fixed right now. Lack of proper backups, lack of a backup plan, and lack of hardware support services make all of this much more challenging, but it's always harder to recognize that fact when the seas are calm. If the absence of a $100 replacement part is keeping an entire network offline, you may want to rethink your strategy and budget.

These days, we enjoy a significant concentration of technologies that make day-to-day IT work much simpler than in the past. The era of a corporate infrastructure built without virtualization are behind us, and we revel in the fact we're running so many virtual servers on so few physical hosts. We can throw virtual machines around the infrastructure with abandon and completely renovate the underlying physical hosts without taking down a single production server. It's miraculous -- until a cog in that highly condensed world snaps and takes down many more systems than if we had never virtualized at all.