A designer knows he has achieved perfection not when there is nothing left to add, but when there is nothing left to take away. ~Antoine de Saint-Exupery -- Note, the opinions stated here are mine alone and are not those of any past, present, or future employer. --

Saturday, August 16, 2008

Availability Enlightenment

The air in the room was still and slightly cool, some might complain too cool. Sitting near the window, providing light filtered by vertical shades partially drawn was a man. Hair streaked with glints of gray that indicated either age or stress. Yet the hair was somewhat betrayed by the bright Hawaiian shirt that fit loosely over his shoulders. The room had sufficient shadows to cause the glow of the LCD screen to light his face.

Approaching the room at a hurried pace was a young engineer. Stress was clear upon his face. He was obviously in need of guidance, help with some crisis. As he approached the desk, the older engineer turned and met him with a calm steady stare. Our young engineer quickly introduces himself, "Hello, my name is Bill, they say you are the fellow". The older man nods, and offers Bill a seat with a motion of his hand and offers, "You can call me TF for short." And our story begins.

Bill: "I have a terrible crisis on my hands. Several months ago we were asked to achieve 99.999% availability."

TF simply returns Bill's gaze.

Bill: "That's only 5 minutes of downtime per year!"

TF nods, "How did you approach it?"

Bill: "Well, we sought out the most reliable components we could buy. We surveyed all of the vendors and selected them carefully. We deployed our applications on this new configuration."

TF: "And your availability?"

Bill: "Well", glancing at his shoes now, "it actually went down."

TF nods again, "You were lead to believe that reliable components result in a reliable system, weren't you?"

Now it was Bill's turn to simply nod.

TF: "And nobody expressed any concern that this might not be right?"

Bill shook his head slowly, almost with shame.

TF: "Of course not as the wisdom of the many is that reliability comes from reliable vendors who purvey their products as being the solution to the availability they so desperately desire. But alas, this is not the path to reliability, only the road to stress."

Bill nods, "But then was it the secret to availability."

TF: "Well, if reliable components aren't the answer, what do you think?"

Bill looks puzzled.

TF: "What is the opposite of reliable components?"

Bill: "Unreliable components?"

TF: "Exactly!'

Bill: "Wait, to get better availability, I should begin with components that break more often?"

TF nods.

Bill: "I don't follow. If my components are failing, then how do I keep my system up?"

TF: "What do you think you should do if a component is down but your system still needs to be up?"

Bill ponders it for a moment, "Design around component failures?"

TF smiles broadly and nods, "You are a chosen one. Assume nothing in your system can be counted upon to work. Everything breaks, regularly. With that as a baseline, now you can design your system to compensate for failures. Through this approach you can achieve new levels of availability."

Bill: "But I can only design the software, what about failures in other components."

TF shakes his head disappointed. "Bill, you are building a system, not a collection of parts. It has to be designed from the customer all the way to the silicon and ferrous atoms or you will never achieve the nirvana you seek."

Bill's face suddenly brightens. "Wait, if I assume nothing is reliable, then the argument for vendor supported software loses a lot of credibility. I can choose less expensive open source solutions."

TF nods. "I assume you know that open source doesn't imply that software is less reliable. In fact in may cases it can be argued it is more reliable."

Bill: "Yes, I know but there are chains of approval that have to be convinced."

TF nods again.

Bill: "But still, five 9's. Can it be done on inexpensive components including open source?"

TF: "It already has been done Bill. I helped build it. Actually this approach can be used to achieve almost any level of availability. You simply decide what component failures you'll design around and which you'll accept as having the ability to bring your system down. It gives you significant control over selecting cost vs. availability decisions."

Bill looks a bit shocked. "So let me see if I can summarize the path to high availability. Assume nothing works, design for all possible failures, and design the system holistically."

TF: "Yes Bill, you are on the path to availability enlightenment."

Bill: "I wish I had come to you first."

TF: "The path to enlightenment is often only visible after failure, if at all."

Comments

This is a neat contemporary addition to the hacker canon. I have not yet had the pleasure of building a distributed system on this scale but I am nonetheless interested.

Do you know where I might get involved wirh such a project in my spare time? I liked a recent writeup on the High Scalability blog about the 'Bumper Sticker' Facebook app being created as a proof of concept for a high-traffic cloud app built with Ruby on Rails. I believe the LinkedIn team built it.