Design

Designing for Reliability

By Dennis Shasha, April 24, 2011

Whether you are designing a data center or a manned space tourism vehicle, you will probably design in redundancy to ensure reliability

If you've been reading Dr. Dobb's for many years, you may remember "Dr. Ecco's Omniheurist Corner." In that column, Professor Shasha posed puzzles that required computer solutions. Dennis has returned to Dr. Dobb's, but with a new twist on puzzles. The idea is to take inspiration from the challenging apps that readers have faced and to turn those into puzzles.

If you have written an application that required extensive use of heuristics (because no algorithm would solve it) or that is otherwise algorithmically interesting, please send mail to Dennis at shasha@cs.nyu.edu. Put "Tough Apps" in the subject line. If he likes it, he will contact you. Then he will try to understand what you've done and create a puzzle explaining the algorithmic core. You will write a little blurb about yourself and the problem if you want.

This column came out of such a conversation.

Pat McGuire's organization North American Aviation didn't make normal products in the 1960s and '70s. They designed Minuteman nuclear missiles. Fortunately, the missiles were never fired in anger.

But to make them credible as a deterrent (and to economize on the number of them required), the weapons had to be reliable. Increasing their reliability at a good price was McGuire's job.

"We went to various vendors and asked them how much they could improve the reliability of their components (transistors, diodes, and the like), and how much that would cost. In the end, we paid for whatever gave the most bang for the buck. That is, what could give us the greatest gain in missile performance at minimum cost."

By the end, the mean time between failures was over 5.5 years according to a recent Wikipedia article. The parts included an on-board D17B computer (a solid state cylindrical 10 cubic foot computer designed in the 1960s) that had over 9000 electronic components on 75 circuit boards. Is your 2011-vintage data center that reliable?

I asked Pat whether he valued improvements in parts that were redundant in the missile more or less than others. He said there was little redundancy in those missiles.

Warm-Up: Suppose you have a bunch of non-redundant parts, all of which must work for your device to work. The parts have different reliabilities. (Let's define the reliability of a part as the probability that the part will behave properly over the next year.) Suppose that you could increase the reliability of any part by 0.1. If each such increase costs the same, which part would you pick?

Solution to Warm-Up: The one with the least reliability. That may seem intuitive, but the math is also easy. The reliability of a bunch of independently failing parts is just the product of their probabilities, viz. R = p1 * p2 * p3 ... The partial derivative with respect to π is just the product of the other probabilities. That product is greatest when π is smallest.

For example, if you have two parts having probabilities of success 0.5 and 0.9 respectively, then the overall reliability R = 0.5 * 0.9 = 0.45. Increasing the reliability of the first part by 0.1 gives R = 0.6 * 0.9 = 0.54. Increasing the reliability of the second part by 0.1 gives R = 0.5 * 1 = 0.5.

Whether you are designing a data center or a manned space tourism vehicle, you will probably design in redundancy. (I don't know about the current generation of missiles. It's a subject I find more than a little scary.) How does the analysis change?

Let's take a simple model as illustrated in Figure 1.

[Click image to view at full size]

Figure 1.

Here we have a start and then a "step" composed of three redundant elements P1, P2, and P3. Then another step composed of Q1 and Q2 and then an end. The idea is that in every step, at least one component must be reliable. That is, the whole device "succeeds" provided at least one Pi and at least one Qj is operational.

We assume for this simple analysis that the components have independent failure probabilities. For the purposes of the puzzle, we assume that all the components for a given step have the same probability of working over the one-year period.

Warm-Up: Suppose each Pi has a probability of 0.4 of staying reliable over a year and each Qj has a probability of 0.6. Then the probability that the whole system will work the whole year is the probability that the P step works times the probability that the Q step works. The probability that the P step works is 1 - the probability that the P step fails = 1 - (0.6*0.6*0.6) = 0.784. The reason is that, for the P step to fail, all three P components must fail and they fail independently. By similar reasoning, the Q step will succeed with probability 1 - (0.4*0.4) = 0.84. Both steps will therefore succeed with probability 0.784 * 0.84, which is close to 0.66.

Notice that this model, as simple as it is, permits us to model hierarchical redundancy quite easily. For example, we could have the setting shown in Figure 2:

[Click image to view at full size]

Figure 2.

If each Ri has a probability of being reliable of 0.3, then the bottom path has a probability of being reliable of 1 - (0.74) = 0.76. Recall that the top path has a reliability probability of 0.66. But now, we can conceive of these two paths as having the form of Figure 3.

[Click image to view at full size]

Figure 3.

Here, U1 and B1 are redundant and U1 has a reliability of 0.66 and the bottom has a reliability of 0.76. So the system overall has a probability of being reliable of 1 - ((1-0.66)*(1-0.76)) = 0.92.

The main problem with this model is the assumption that failures are independent. Redundant hardware can sometimes be modeled by the independent failure model, but software failures rarely are. In one project I consulted on a few years ago, 10 physically independent devices all failed because of an unfortunate software bug. (So, if someone claims 99.999% reliability to you sometime, ask about software.)

Puzzles

That caveat aside, let's see if we can solve some interesting puzzles. Here the problem is to take a device that has 10 steps. Each step will be implemented by a component having a certain reliability and a certain number of redundant sisters of that component. We represent those as a pair: (probability each component will remain reliable the whole year, number of components). For example, for the two step P and Q steps of Figure 1, we would represent the steps as: (0.4, 3), (0.6, 2).

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!