Despite our best attempts to design for the worst, the failure of high-availability systems is shockingly common. Here's how to avoid career-ending mistakes

InfoWorld|Oct 4, 2010

If you've been in IT long enough, you've seen it happen: The crown jewel mission-critical application, built from the ground up to be highly available, goes down in flames and stays down, though a multitude of expensive backups and safeguards are in place. The reasons why this kind of "impossible" failure occurs are wide and varied, but ultimately trace back to two major factors that generally go hand in hand: complexity and plain old human error.

Complexity everywhere

The complexity of even the smallest business networks today dwarfs that of the enterprises of yesteryear. While I absolutely love server virtualization, virtual machine migration, SAN arrays, snapshots, replication, converged networks, and a whole host of other relatively new technologies, implementing them comes at a severe cost that many tend to overlook.

In the good old days, the functionality of a single application might depend upon nothing more than its own internal hardware and the network functioning properly. Today, that dependency tree is likely to include a group of centralized storage devices together with its ever-growing firmware code base, a virtualization hypervisor packed with features, and a more elaborate network architecture to support it all.

In balance, I think we should be happy about all of that -- maintaining tens or hundreds of stand-alone servers each with their own compute and storage hardware is fantastically wasteful and massively time consuming. The complexity is simply a result of forward progress, but it does come at a cost.

Anytime you have to ask "what's it doing now?," you're essentially paying that bill. The solutions we deploy are far more complicated than any of us can really understand completely. If you've spent a few hours sifting through a pile of arcane log files trying to figure out why something that really should work isn't working, you know exactly what I'm talking about.

Perhaps partially as a result of that complexity, but also due to modern business's much larger reliance on technology to function, maintaining high availability has become more and more critical in IT departments of all sizes. Fifteen years ago, many businesses would see the 24-hour failure of a tier-one application as unpleasant, but not disasterous. When an outage like that occurs today, heads roll.

This uptime imperative results in lots of system redundancy: clustering, replication, and warm sites, to name a few popular solutions. While these systems usually accomplish their goals if they are designed, implemented, and maintained properly, what they mean, in effect, is: "Our systems are too complex and might fail, so we're going to add another level of complexity to solve that." It sounds dumb when it's written that way, but that's exactly what we do. And it works -- most of the time.

The human element

I suspect that if you placed every major IT failure under the same scrutiny that the NTSB applies to airline crashes, you'd see human error listed as the sole factor or contributing factor to nearly every one of them. As much as IT is about racks of equipment, cables, telecommunications lines, and software, it's more about the people that design, build and run all that stuff.

Product design. The human error parade starts before your fancy new equipment ever shows up on your doorstep. This isn't a particularly new phenomenon. Everyone has probably dealt with equipment that died because it wasn't assembled correctly or included some bad components.

Today, though, the complexity present in the systems we use results in much more insidious types of failures. As a case in point, a well-known SAN vendor recently released new firmware for its flagship storage product. The firmware supported a number of very cool features, and it was an exciting release -- that is, until it started crashing arrays and occasionally hosing customer data.

While I don't have the inside track on exactly what the problem was, you can bet it was probably a failure to do enough regression testing. As the solutions these vendors ship become more and more complex, the challenges of testing them against all of the bizarre scenarios that customers will run them through in the field becomes dramatically more difficult. That's not to say that I don't fault them for the failure, but it has become the status quo to expect new software to break something. That's sad, but that's where we are.

In the good old days, if you had a centralized storage device made by a trusted storage company die on you, chances are a card or a drive fried and you'd have well-equipped support technicians rappelling out of the ceiling to fix it in short order. Today, the somewhat unsurprised offshore support tech on the other end of the phone will likely be trying to figure out which one of the unpublicized critical software bugs you've just bumped into.

Implementation. Ultimately, it doesn't matter how good the product is -- if it's not implemented properly, chances are it's going to break or at the very least perform poorly. Complexity makes incorrect implementation more likely. Fortunately, with proper testing, most implementation errors can be rooted out before systems go into production. Failure to perform adequate acceptance testing will leave the discovery of the worst of these problems until the systems are under load.

Maintenance and testing. In my experience, lack of appropriate maintenance and testing are the two largest factors that contribute to downtime of all kinds. The reasons why this is true should be obvious to anyone working in IT: We're all being asked to do more work with fewer resources.

I honestly can't recall the last time I saw an IT department where an employee didn't have enough work to do to justify his or her job. It's usually the opposite: The business is asking for new functionality faster than IT can deliver it -- so that regular maintenance and appropriate levels of testing fall by the wayside.

What you can do about it

If you do absolutely nothing else, test. Test everything as frequently as you can. Test backups, failover clusters, redundant switches, and SAN snapshots -- test anything that you've spent good money on to save your bacon if something breaks. Make sure to test under non-ideal circumstances -- don't check to see that everything is working properly before you test, because you won't have that luxury in a real failure. Don't shut things down cleanly, pull the plug. Assume that if you haven't tested something in the past three months or since major architectural changes have been made that it just won't work like it's supposed to.

If you ask management or business stakeholders for the necessary time to do the testing and are denied it for whatever reason, make it crystal clear that you can't guarantee failover systems will function properly in a failure scenario. This will not make you popular. But trust me, it's a heck of a lot better than being shown the door because the failover system you're responsible for didn't work.