High Availability

As companies become more and more dependent upon their
information systems just to be able to function, the availability of those
systems becomes more and more important.
Outages can costs millions of dollars an hour in lost revenue, let alone
potential damage done to a company’s image. To add to the problem, a number of natural disasters have shown that
even the best data center designs can’t handle tsunamis and tidal waves,
causing many companies to implement or re-evaluate their disaster recovery
plans and systems. Practically every
customer I talk to asks about disaster recovery (DR) and how to configure their
systems to maximize availability and support DR. This series of articles will contain some of
the information I share with these customers.

The first thing to do is define availability and how it is
measured. The definition I prefer is
availability represent the percentage of time a system is able to correctly
process requests within an acceptable time period during its normal operating
period. I like this definition as it
allows for times when a system isn’t expected to be available such as during
evening hours or a maintenance window. However, that being said, more and more systems are being expected to be
available 24x7, especially as more and more businesses operate globally and
there is no common evening hours.

Measuring availability is pretty easy. Simply put it is the ratio of the time a
system is available to the time the system should be available. I know, not rocket science. While it’s good to measure availability, it’s
usually better to be able to predict availability for a given system to be able
to determine if it will meet a company’s availability requirements. To predict availability for a system, one
needs to know a few things, or at least have good guesses for them. The first is the mean time between failures
or MTBF. For single components like a
disk drive, these numbers are pretty well known. For a large computer system the computation
gets much more difficult. More on MTBF
of complex systems later. Then next
thing one needs to know is the mean time to repair or MTTR, which is simply how
long does it take to put the system back into working order.

Obviously the higher the MTBF of a system, the higher
availability it will have and the lower the MTTR of a system
the higher the availability of the system. In mathematical terms the system availability in percent is:

So if the MTBF is 1000 hours
and the MTTR is 1 hour, then the availability would be 99.9% or often called 3
nines. To give you an idea about how
much down time in a year equates to various number of nines, here is a table
showing various levels or classes of availability:

Availability

Total Down Time per Year

Class or # of 9s

Typical application or type of system

90%

~36 days

1

99%

~4 days

2

LANs

99.9%

~9 hours

3

Commodity Servers

99.99%

~1 hour

4

Clustered Systems

99.999%

~5 minutes

5

Telephone Carrier Servers

99.9999%

~1/2 minute

6

Telephone Switches

99.99999%

~3 seconds

7

In-flight Aircraft Computers

As you can see, the amount of allowed downtime gets very
small as the class of availability goes up. Note though that these times are assuming the system must be available
24x365, which isn’t always the case.