Posts tagged ‘load balancing’

In the course of architecting systems for different verticals (health care, automotive, financial services, etc), my discussions with the customer eventually gets to availability and the customer’s expectations of how much downtime is permissible. The demand from the customer usually always starts off with “100% uptime”. That’s before they realize the cost of absolutely no downtime and that’s about when they start exploring the possibility of “the nines” (99.999 or 5 nines, or in some cases 99.9999 or 6 nines). The truth is that no system can hardly ever be up all the time every time. The cost is too prohibitive and the liability for the vendor is too high. What is usually negotiated is the hours of the day that 100% uptime is needed. This is usually very doable as long as those time frames are clearly defined. Consider a brokerage that does all it’s business between 9:30AM and 4PM. During those hours, it absolutely needs 100% uptime, otherwise the loss to the business is huge. This is a defined time frame that can be achieved with proper practices in place.

Most system vendors will market their “5 nines” or “6 nines” of availability. However, I try and make my customers aware of other important factors that they need to take into account when signing contracts. All downtime is not created equal. The consequence of downtime is very important. If my customers are driven away as a result of the downtime vs. the downtime just causing a minor inconvenience, those are two very different scenarios. Is the downtime spread out over days or does it all happen at once?

So what exactly is downtime? Definitions vary – some say that it’s when component in the chain is not functioning, others says they are experiencing downtime when the network is slow. In my mind, you are experiencing downtime when the system prevents you from getting your work done on time. What causes downtime? It could be different things:

human error

natural disasters

network issues

hardware issues

software issues

viruses

etc

So when people talk of 5 nines or 6 nines, what exactly does that mean. If you do the math, you’ll realize that achieving 6 nines (99.9999) means having 0.6 seconds of downtime a week (or 31.5 seconds a year!). This is very hard to achieve and quite often an unrealistic goal. To achieve a high degree of reliability and availability means looking at each link in your chain and strengthening the weak links and improving the strong links, because, at the end of the day, it takes just one weak link to bring down your system. So if you think about it, what are the links in your chain that need attention? The answer is everything from start to finish:

hardware

software

networks

file servers

printers

databases

applications – crashes, hangs, bugs

web servers

application servers

security

backups

As you can see, designing a system with an eye towards minimum downtime is not an easy task. A lot of practices come into play – clustering, load balancing, RAID disks, SANs, NAS, virtualization, redundant networks, virus protection, proper documentation, application design geared towards monitoring and self healing, server farms and much more. In subsequent posts, we’ll explore some of these in greater detail.