Beyond Backups: The Next Steps for Fault Tolerance, Pt. 1

For many organizations, particularly smaller ones, the concept of fault tolerance extends only as far as doing a nightly tape backup. In many cases, the reason cited for using just this measure is a lack of available funds, but perhaps more prevalent is simply a lacking full appreciation of the impact a downed server brings. Tape backups provide an insurance policy against one thing -- data loss. They do not protect against downtime and, quite often constitute the slowest part of a system recovery process.

A fact that organizations must understand is the distinction between fault tolerance and data protection. The data is of value, obviously, and the hardware is of value, but the cost of downtime is somewhat harder to determine. Backups protect the data, and the hardware is protected by virtue of its bring kept in a safe location, but the prevention of downtime is a little more complex. Even those working in environments where clustering [tk http://www.webopedia.com/TERM/c/clustering.html ] and fail-over systems are used must consider downtime.

Many of the same organizations that just use tape backups will be very willing to implement fault tolerant measures after a damaging event has occurred. As with most things, the benefit of hindsight is great. The principle of fault tolerance is that an ounce of prevention is worth a pound of cure, and it should be viewed as an investment just like any other aspect of business. This is even truer nowadays as more companies find themselves unable to function without the use of a server, and the price of hardware that can be used to provide fault tolerance continues to fall.

I remember when teaching technical training courses some years ago, the downside of disk mirroring was cited as the fact that it costs 50% of disk space. In today's market, disk space is one of the cheapest commodities we have. So should we all mirror our drives? In the absence of a RAID 5 array, I would say yes, why not? Heck, for the sake of a few hundred bucks you could even consider implementing disk duplexing, but more about that in part two.

For each fault tolerant step you consider, you must look at a number of factors. Possibly the biggest consideration is the question of how likely a given component is to fail. I attended a seminar given by Intel recently, where we were discussing a feature called Fault Resilient Booting (FRB) which is where if one processor fails, the system will disable the failed processor and reboot. Someone had to ask the question, so I did. How often does a processor fail? (In my 13 years on the job I have never, to my knowledge, had a failed processor.) The answer was 'very, very seldom' though of course what would you expect someone from Intel to say? Unless you are looking to create a supremely fault tolerant system, features that protect against 'very, very seldom' occurrences must be weighed against those that protect a more susceptible component. But that raises another question. What is a susceptible component?

Some years ago, when working for a major financial institution I arrived at work one Monday morning (have you ever noticed how these things always happen on a Monday!) to find that three drives in the RAID array of one of the servers had gone down. The cause? No, it wasn't a faulty batch of drives -- it was a faulty back plane. This server was the full meal deal -- 'biggie sized'. It had dual power supplies, adapter teaming, RAID 5 with a hot spare and a vastly oversized UPS. None of which could prevent the system falling foul of a $90 component. The fact is no matter how many fault tolerant measures are in place there is always an unknown factor. In other words, the search for the Holy Grail of reliability, 100% uptime, is not possible. But increased availability can be achieved.

In part two of this article, we will look in more detail at some of the options available for fault tolerance on server based systems, and evaluate their effectiveness in relation to investment.

Drew Bird(MCT, MCNI) is a freelance instructor and technical writer. He has been working in the IT industry for 12 years and currently lives in Kelowna, BC., Canada..