IMACST: VOLUME 1 NUMBER 1 DECEMBER 2010

1857-7202/1008007

Abstract This paper is a brief tutorial exposition of some recent developments in the evaluation of the reliability of partially-redundant (k-out-of-n) systems. A novel contribution of the paper is that it identifies many practical examples of such systems, which are spread across a wide spectrum of engineering disciplines, including, in particular, the areas of computer and telecommunication engineering. Some formulas for the reliability and life expectancy of these systems are discussed in the case of equal-reliability components. Certain celebrated formulas are shown to be numerically unstable and totally useless in the case of large systems with high-reliability components. In fact, these formulas are highly susceptible to round-off errors and severely suffer from catastrophic cancellations. The paper also reviews how the Boole-Shannon expansion (or equivalently, the pivoting or factoring technique) is used to derive pertinent recursive relations, leading to a highly efficient algorithm for k-out-of-n reliability evaluation. This algorithm has a nice interpretation in terms of a regular Mason signal flow graph, which turns out to be (a) a reduced ordered binary decision diagram representing a monotone symmetric switching function, and (b) analogous to the minimal circuit realization of this function. In the worst case, the temporal and spatial complexities of this algorithm are shown to be quadratic and linear, respectively, in the number of system components. The paper lists some extensions and applications of this algorithm and compares it with a few related algorithms. The paper concludes with a quick consideration of some important issues in the area of k-out-of-n system reliability, including the issues of useful redundancy, criticality measures, and cost.

I. INTRODUCTION HE reliability R(t) of a component or a system is the probability that it will adequately perform its specified purpose/job/function for a specified semi-closed period of time (0, t] under specified environmental conditions. Implicit in this definition is the assumption or condition that the pertinent component or system is initially good (i. e., at t = 0, R = 1.0). The field of system reliability deals with the relation between the reliability of a system and the reliabilities of its

components. At the core of this field is the concept of the k-out-of-n:G(F) system (also called a partially-redundant system), which is a system of n components that functions (fails) if at least k out of its components function (fail) [1-14]. Typically, the k-out-of-n system (for 1 s k < n) is intended to provide useful redundancy, i. e., to have a reliability better than that of the simplex or single-component reliability. This necessitates that the simplex reliability itself be good enough (based on the values of k and n). For the same simplex reliability, more useful redundancy is achieved the lower k is for a fixed n, and the higher n is for a fixed k.

The k-out-of-n system has many attractive features. It has a symmetric structure that has many convenient mathematical descriptions such as Boolean expressions, recursive equations, generating functions and so on. Nevertheless, for 1 < k < n, the k-out-of-n system lacks a graphical representation in the form of a network or a fault tree, unless replicated network edges or fault tree inputs are allowed. The k-out-of-n system plays a central role for the general class of coherent systems, as it can be used to approximate the reliability of such systems [15], and its reliability function is the steepest function among all coherent reliability functions of order n [8]. While virtually all nontrivial network reliability problems are known to be NP-hard for general networks [16], the regular structure of the k-out-of-n system allows the existence of simple efficient algorithms for its reliability analysis that are of quadratic-time linear-space complexity in the worst case [2-4, 6].

The k-out-of-n system is being rediscovered in the literature from time to time without being identified as such (See, e.g., [17] and [18], where that system, as well as extensions and compositions thereof appear in disguise). We hope that the present exposition may remedy this situation (at least partially), by familiarizing a large readership with an extensive number of examples of the k-out-of-n system. Though the results of this paper are relating system reliability to component reliabilities, these results are also applicable in the context of availability.

This paper is intended to be a review or a tutorial exposition, and we hope to make it of a significant pedagogical utility. We strive strongly for simplicity and clarity to allow a non-expert reader to easily follow our discussion. Therefore, we deliberately include certain details and explanations of Partially-Redundant Systems: Examples, Reliability, and Life Expectancy Ali Muhammad Ali Rushdi T12 A.M.A. RUSHDI: Partially-Redundant Systems: Examples, Reliability and Life Expectancy

terminology that experts might consider obvious or even trivial. The absence of such details in some papers has occasionally led to misunderstanding and pitfalls. Mostly, we do not start our reliability analysis directly in the probability (algebraic) domain, but we initiate it in the switching (Boolean) domain, and then transform our results to the algebraic domain. We try to fully utilize the wealth of knowledge available for switching (Boolean) algebra [19]. We adopt the special symbols representing the operators of this algebra instead of borrowing symbols from real algebra, which could be confusing and misleading [20]. In particular, we use the symbol (v) (rather than the symbol (+)) for the OR switching operation. For convenience, we will use R(p) or R(p) to denote the system reliability for non-identical component reliabilities p or a common component reliability p, i.e., we will make the time dependence of R implicit through the time dependence of p or p.

The organization of the remainder of this paper is as follows. Section II lists the assumptions, notation and nomenclature employed throughout the rest of the paper. Section III identifies many reliability systems related to the k-out-of-n system, while Section IV lists many practical examples of that system, which are spread across a wide spectrum of engineering disciplines, including, in particular, the areas of computer and telecommunication engineering. Closed-form formulas for k-out-of-n system reliability and life expectancy are given in Section V for the case of equal-reliability components. Section V also adds some pictorial insight to the topic, and points out a pitfall that went practically unnoticed in the reliability literature. Section VI discusses how the Boole-Shannon expansion (or equivalently, the pivoting or factoring technique) is employed to derive binary-recursive relations as well as an efficient iterative algorithm for computing the reliability of a k-out-of-n system with non-identical components. Section VII discusses complexity issues for this algorithm and compares it with other notable algorithms. Section VIII concludes the paper.

II. ASSUMPTIONS, NOTATION, AND NOMENCLATURE A. Assumptions - Both the components and the system are of two states, i.e., either good or failed. - Component states are statistically independent. - The system is a mission-type one, i. e., without repair.

pm+1... pn]T. q = a vector of n elements representing the component unreliabilities = 1.0 p, where 1.0 is an n-tuple of real 1s. p/pm = a vector of (n-1) elements obtained by omitting the mth element of vector p. p|jm = a vector of n elements obtained by setting the mth element of p to j which is either 0 or 1. R(p),U(p) = reliability and unreliability of the system. Both R(p) and U(p) are real values in the closed real interval [0.0,1.0]. R(p) = Pr [ S(X) = 1 ] = E[ S(X )]. U(p) = 1.0 R(p). R(k, n, p) = reliability of a k-out-of-n:G system of component reliabilities p, 0 s k s (n+1). R(k, n, p) = the value of R(k, n, p) when component reliabilities are all equal to a common value p. U(k, n, p) = unreliability of a k-out-of-n:G system of component reliabilities p, 0 s k s (n+1). U'(k, n, q) = unreliability of a k-out-of-n:F system of component unreliabilities q, 0 s k s (n+1). This is the complemented dual of R(k, n, p) in the sense that the successes of the k-out-of-n:G system and the k-out-of-n:F system are dual switching functions.

U(k, n, p) = U'(nk+1, n, q). (1)

c(k, n) = the binomial (combinatorial) coefficient = the number of ways of choosing k objects from a set of n objects, when repetition is not allowed and order does not matter. Binomial coefficients satisfy Pascal's identity

IMACST: VOLUME 1 NUMBER 1 DECEMBER 2010

1857-7202/1008007

3c(k, n) = c(k, n1) + c(k1, n1), 0 < k < n, (2)

together with the boundary conditions

c(k, n) = 1, (k = 0 or k = n) and n > 0. (3)

E/F = the set difference of sets E and F, E/F = {j | j e E, j e F}. |Y| = cardinality of the finite set Y = the number of elements in the set Y. y( = the ceiling of the real number y = the smallest integer greater than or equal to y.

C. Nomenclature Duality: strictly speaking, the dual of switching function is obtained by complementing the function and all its switching arguments (inverting both output and inputs) [6]. In the reliability literature, "duality" is sometimes freely and loosely used to indicate "similarity," "analogy," or "being mirror images."

Monotone: a monotone system is one whose reliability function is a non-decreasing function in each component reliability, i.e.,

R(p | 1m) R(p | 0m) = R(p) / pm > 0.0, 1 s m s n. (4)

Relevant: component number m is relevant to the system if there exists a valid value for p such that R(p) / pm 0.0. Relevancy means that R(p )is not vacuous in (independent of) pm.

Coherent: a coherent system is a monotone system whose components are all relevant [21]. If the reliability function R(p) of a coherent system with equal-reliability components is plotted versus p within the square 0.0 s p s 1.0, 0.0 s R(p) s 1.0, then it satisfies R(0.0) = 0.0, and R(1.0) = 1.0, and exhibits an S-shape, i. e., the curve R(p) versus p is monotonically non-decreasing and if it crosses the diagonal (p versus p), it does so only once and from below [8, 22].

k-out-of-n:G system: a system that is good if and only if (iff) at least k out of its n components are good.

k-out-of-n:F system: a system that is failed iff at least k out of its n components are failed.

k-out-of-n (partially-redundant) system: a collective name for k-out-of-n:G and k-out-of-n:F systems; a k-out-of-n:F system is equivalent to an (nk+1)-out-of-n:G system (as indicated by equation (1)). A k-out-of-n system is a coherent system in the practical case of 1 s k s n, while it is only monotone for the hypothetical or fictitious limiting cases of k = 0 and k = (n+1).

s-p complex: a coherent system is series-parallel (s-p) complex iff it has no components in series or in parallel [23]. A k-out-of-n system is s-p complex for 1 < k < n, and hence cannot be treated (even partially) by series-parallel reductions.

Pivoting: by pivoting on component number m, the system reliability R(p) can be written as

R(p) = qm * R(p | 0m) + pm * R(p |1m), (5)

where R(p | 0m) and R(p | 1m) are the reliabilities of the minors or subsystems of the original system with respect to component m. Pivoting is also called factoring and is equivalent to the total probability theorem [11] in the algebraic domain or to the Boole-Shannon expansion [19] in the Boolean domain.

III. RELATED SYSTEMS The k-out-of-n:G system covers many interesting or limiting-case systems as special cases [6]. These include the fictitious perfectly reliable system (k = 0), the parallel system (k = 1), the voting or N-modular redundancy (NMR) system (k = (n+1)/2(), the fail-safe system (k = n 1), the series system (k = n), and the fictitious totally unreliable system (k = n+1). Note that as k decreases (for a fixed n) from 0 to (n+1), the usefulness of the k-out-of-n:G system declines and finally diminishes. For 1 < k < n, the k-out-of-n system is sometimes called a partially-redundant system [6, 24], as it lies somewhere between the extreme cases of the (non-redundant) series system and the (fully-redundant) parallel system. In this paper, we will view these two extremes of zero and total redundancies as limiting cases of partial redundancy, and hence consider a k-out-of-n system as synonymous to a partially-redundant system. The k-out-of-n:G system and the k-out-of-n:F system are dual or mirror images of one another; the k-out-of-n:G system being exactly equivalent to the (nk+1)-out-of-n:F system (as formally indicated by equation (1) above).

The k-out-of-n: G system is a subclass of some important systems. Besides being a coherent system for the values 1 s k s n, it is s-p complex for 1 < k < n. It is also

a) a special case {r = n} of the consecutive-(nk+1)-out-of-r-from-n:F system [25], b) a limiting case { = n} of the generally non-coherent k-to- -out-of-n system [10, 26-28], which is useful in approximating the class of non-coherent systems [15], but does not precisely exhaust or cover such a class (contrary to a claim made in [28]), c) a binary-state limiting case of the multi-state k-out-of-n system model [29].

An important generalization of the k-out-of-n:G(F) system is the threshold system, which can be neither symmetric nor coherent. A threshold system is a system whose success is a threshold (linearly-separable) switching function in the successes of its components [30]. This system is successful if and only if the weighted arithmetic sum of its component successes is equal to or exceeds a certain threshold. Therefore, a threshold system is characterized by (n+1) coefficients, namely, its threshold and the set of its n component weights (which are not necessarily unique). An important special case of the threshold system is the weighted k-out-of-n:G system, which is a coherent non-4 A.M.A. RUSHDI: Partially-Redundant Systems: Examples, Reliability and Life Expectancy

symmetric system of strictly positive weights and a threshold equal to k [12, 31]. If further, all the weights are equal to 1, the weighted k-out-of-n:G system reduces to the ordinary k-out-of-n:G system. Therefore, the k-out-of-n:G system can be defined as a threshold system with a common positive weight for its components and a threshold equal to k multiplied by this common weight [30].

A system that is closely related to the k-out-of-n:G(F) system is the consecutive k-out-of-n:G(F) system, which functions (fails) if at least k consecutive components function (fail) [32, 33]. The k-out-of-n:G(F) system and the consecutive k-out-of-n:G(F) system are not generally comparable since neither of them is a subclass of the other (except when they overlap at their limiting cases k s 1 and k > n). The k-out-of-n system is a threshold system, but generally the consecutive-k-out-of-n system is not. The k-out-of-n system is structurally symmetric, i.e., the order of its components is immaterial, while the set of components in a consecutive k-out-of-n system is an ordered one (either on a line or on a circle, corresponding to linear and circular versions of the system). The failure switching function of the consecutive k-out-of-n:F system implies that of the k-out-of-n:F system, and hence the reliability of the latter system is a lower bound of that of the former system.

Yet another system that is also closely related to the k-out-of-n:G(F) system is the (n, f, k) system. This system is defined in [34] (based on a proposal in [35]) to consist of n components ordered in a line or a cycle, such that the system fails if, and only if, there exist at least f failed components or at least k consecutive failed components. This system reduces to the k-out-of-n:F system for f s k. A generalization of the (n, f, k) system is one with weighted components [36].

Occasionally, we might have a system consisting of certain subsystems, which in turn consist of lower-level subsystems, and so on, till we reach some innermost subsystems that consist of final components. If the relation between every sub(system) and its constituent components or subsystems is a k-out-of-n relation, the overall system is a k-out-of-n composition [8, 22]. Under appropriate conditions, the k-out-of-n composition can serve to achieve a dramatic increase in reliability by constructing ultra-highly-reliable systems out of typical or ordinary but somewhat good components. In such a composition, it is preferable to locate the more useful redundancies in the lower or innermost levels of the composition.

A system that would have been related to the k-out-of-n:G(F) system is the so called strict consecutive-k-out-of-n:G(F) system (or simply, strict system) proposed in [37]. The original definition of this system suffered from ambiguity or inconsistency. Rushdi [38, 39] attempted to produce well-defined versions of this system, either by accounting for statistical dependencies among its components, or by employing conditional probabilities. However, these versions lacked the claimed utility of the original system. Later, Rushdi [40] published his concerns about the strict system demanding a unique, precise, and consistent definition of it, enquiring about the real nature of some of its states, questioning whether it is coherent or non-coherent, and asking for an example showing its utility as a model for a real-life system. Evans (then Editor of the IEEE Transactions on Reliability) [40] responded that "the problem as originally (and implied) [37] is an example of a collection of words that appears to make sense, but is actually self contradicting." Fortunately, the questions posed by Rushdi [40] prevented the appearance of more algorithms that represent correct but irrelevant mathematics. These questions are hailed by Hwang [41] as an example of a critical review of literature that prunes overgrown branches in order to keep the growth in control. Such a pruning is always necessary when enthusiasm of expansion overextends its usefulness [41].

IV. EXAMPLES OF PARTIAL REDUNDANCY A model is a useful representation that captures the essence of a real system and behaves sufficiently like it in such a way that conclusions can be drawn from the model's behavior to aid in making prudent decisions about the real system. Situations in which the k-out-of-n:G(F) system serves as a useful model are frequently encountered in engineering practice and include the following examples:

a. A piece of stranded wire with n strands in which at least k strands are necessary to pass the required current behaves as a k-out-of-n:G system. The same concept generalizes to applications involving supply-type components with identical fixed ratings for their capacity, flow, throughput, strength or the like, such that system success is achieved when a minimum supply is met, or when a certain threshold is exceeded (see, e.g., [24, 30]).

b. A three-engine airplane which can stay in the air if and only if at least two of its three engines are functioning is a 2-out-of-3:G (also called a 2-out-of-3:F system or a triple modular redundancy (TMR)) [13]. A space vehicle requiring three out of its four main engines to operate in order to achieve orbit is a 3-out-of-4:G and also a 2-out-of-4:F system [9]. The common practice of having a spare tire in a 4-wheel car constitutes a 4-out-of-5:G and also a 2-out-of-5:F system. All these are examples of a fail-safe system, i. e., a system that tolerates the failure of one (and only one) component, since such a failure reduces the system to a series system, which is still a working system (albeit with no more redundancy, and hence no capability to withstand any further failure). The idea of a fail-safe system works well provided the assumption of statistical independence among components holds. A car driver cannot rely on a single spare tire on a rough unpaved road that might result in a double flat tire (instantaneous or common-cause failures). Thanks to a merciful Providence, the fail-safe concept is entrenched in many biological systems. For example, a human being can survive the failure of one of his two kidneys, which constitute a 1-out-of-2:G system.

5used in the realization of ultra-reliable systems that are based on multichannel computations [43]. Likewise, voting is commonly used in faulty distributed computing systems to achieve mutual exclusion among groups of isolated nodes.

d. A bridge with n main supports that can survive an earthquake if and only if at least k supports remain intact is approximately modeled as a k-out-of-n:G system. Here, the modeling is qualitative rather than quantitative, since a bridge is usually not structurally symmetric with respect to its supports, while a k-out-of-n system is structurally symmetric with respect to its components.

e. In a binary communication channel, an error-correcting code might be employed in the transmission of n-bit code-words [44]. If the code is capable of correcting up to k bit errors, word transmission becomes a (k+1)-out-of-n:F and also a (n-k)-out-of-n:G system. A code lacking any error-correcting capability (k = 0) {such as the BCD code without any parity check} is a series system, while a code of a Hamming distance 3 (k = 1) {such as the famous (7,4,1) Hamming code} is a fail-safe system.

f. A bus-structured multiprocessor computer system consists of n processors sharing m memory units via b common buses. If this system is required to operate in MIMD mode (i.e., with Multiple Instruction streams and Multiple Data streams), then it is logically equivalent to the series connection of a k1-out-of-n:G system, a k2-out-of-m:G system and a 1-out-of-b:G system, where k1 > 2 and k2 > 1 with the precise values of k1 and k2 being determined by system requirements [45].

g. In the majority voting (MV) algorithm for managing replicated data, out of n copies of an object (n+1)/2( copies must be up to form a quorum [18]. This is an (n+1)/2(-out-of-n:G system. A generalization of this algorithm, the hierarchical quorum consensus (HQC) algorithm is a multilevel system in which the availability of a group at level i expressed in terms of the availability of its subgroups at level (i+1) constitutes a ki-out-of-ni:G system. This means that HQC is nothing but an iterative or repeated composition (See, e. g., [8, pp. 202-203] or [22, pp. 203-206]) of the MV structure. Such a composition improves availability (making HQC superior to MV) provided the basic component availability is higher than a certain value [8, 22, 46].

h. The k-out-of-n model is useful in the study of multistage interconnection networks [47]. For example, the terminal reliability of a Gamma network [48] is represented by a ladder network (of unreliable nodes and perfect links) whose behavior can be approximated by that of a k-out-of-n system. Specifically, the ladder network in [48, Fig. 5] and in [47, Fig. 1] is logically equivalent to a series connection of two components with a structure that has 8 cut sets of two components each. The reliability of a 2-out-of-6:F system is a lower bound for the reliability of this structure.

i. The k-out-of-n model is used in the petro-chemical industry in the evaluation of the life of furnace systems and in decision making on when to replace the furnaces [49, 50]. The furnaces are considered to be systems while the tubes in the furnaces are the components of the corresponding systems. A tube is designed to provide an environment for methane, steam, and a catalyst to react at a high temperature to produce hydrogen. A tube is considered failed when it is unable to perform its intended function any more (for example, when it is ruptured or is pinched). The function of a furnace is to produce hydrogen at certain output, temperature, pressure, and efficiency [49]. If too many tubes (k or more out of n) are failed, the furnaces proper operation is affected, and it is failed. This is a k-out-of-n:F system.

j. In mining operations, a shovel-truck system in an open mine usually consists of a shovel and a fleet of n trucks (say 20 trucks). The system functions properly if at least k trucks (say 15 trucks) and the shovel are good. This is a series system of 2 subsystems: the shovel and a 15-out-of-20:G system. If the shovel is assumed to be perfectly reliable, the system becomes simply a 15-out-of-20:G system. In general, the k-out-of-n concept is useful in modeling many types of fleets of vehicles, including aircrafts, ships, buses, and trains.

k. In a perfect secret sharing scheme (PSSS), a secret is to be divided into shares, and distributed among members. The secret can be determined, i.e., the system works when k or more distinct members collaborate together, but only k shares are required to reconstruct the secret. In the context of perfect secret sharing, the secret can be reconstructed with any k or more members, but (k1) or fewer members cannot reveal anything about the secret [51]. The PSSS is nothing but a k-out-of-n:G system, in the sense that the reliability of this system expresses the probability of constructing the secret as a function of the probabilities of member contributions to such a construction.

V. SYSTEMS WITH COMPONENTS OF EQUAL RELIABILITIES A. Reliability The reliability of a k-out-of-n:G system (with independent components of identical reliabilities p) equals the probability of at least k successes in n Bernoulli trials, and hence it is given by [6]:

Formula (7) is considered more suitable than formula (6) for hand calculation [11, 52], because (7) expresses R(k, n, p) as a polynomial of p only, while (6) involves powers of both p and (1.0 p). In fact, formula (7) seems more convenient for symbolic differentiation needed to express the instantaneous hazard rate ( (dR(t)/ dt) / R(t)), and for symbolic integration required to find the life expectancy according to the forthcoming equation (8).

B. Life Expectancy The life expectancy or Mean Time To Failure (MTTF) of a general non-repairable or mission-type system is given (under appropriate assumptions on the behavior of R(t) as t tends to infinity) by

}

= =0. ) ( dt t R MTTF T

(8)

For a k-out-of-n:G system having components subject to a common constant failure rate (CFR) , the component reliability is , ) (te t p

= t 0, (9)

so that the MTTF of a single component is (1/ ), while the MTTF of the system itself is obtained from equations (7)(9) as

, ) )( , ( ) 1 , 1 ( ) 1 (0dt e n m c m k c Tm tnk mk m}

=nk mk mn m c m k cm). , ( ) 1 , 1 () 1 (

(10)

We call the dimensionless product (T) the normalized life expectancy of the system, since it is the quotient of the actual life expectancy of the system by the life expectancy of a single component. A simpler expression for T, can be obtained from the Markovian state diagram for a k-out-of-n:G system [53], namely

1 1.nm kTm ==

(11) C. Computational Accuracy

Figure 1 demonstrates our computational experience with formulas (6) and (7). Specifically, Fig. 1 presents the reliability R(k, n, p) versus n for a k-out-of-n:G system with k = 20 and a component reliability p = 0.9. In Figs. 1(a) and 1(b), the system reliability R(k, n, p) is computed via formulas (6) and (7), respectively, for n varying from 20 to 80. While formula (6) maintains its stability at values of R(20, n, 0.9) approximately equal to 1.0 as n grows, formula (7) produces results that grow to extremely high values (up to the order 1018, drastically violating the restriction R s 1.0). As a matter of fact, the two reliability equations start off identically equal or almost equal up to about n = 40 before the results produced by formula (7) start to deviate and overflow, as demonstrated in Fig. 1(c). To quantify the comparisons of (6) versus (7), numerical values of the computed reliability are summarized in Table I, for values on n between 20 and 80. 20 40 60 80-0.500.511.5Number of system components: nReliability:

Figure 1. Reliability R(20, n, 0.9) versus the number of system components n for a 20-out-of-n:G system, as computed via: (a) Equation (6) for n between 20 and 80, (b) Equation (7) for n between 20 and 80, and (c) Equations (6) and (7) overlapped for n between 20 and 45.

Similarly, Figure 2 illustrates the computational accuracy of formulas (11) and (10). In Figs. 2(a) and 2(b), the normalized life expectancy (T) is computed via formulas (11) and (10), respectively, for n varying from 20 to 80. Similarly to formula (7), formula (10) goes fast to very high erroneous values as n grows, while formula (11) continues to produce accurate values. Figure 2(c) verifies the fact that the results of the two life expectancy formulas stay almost equal up to about n = 40 before the results produced by formula (7) start to deviate and overflow. To quantify the comparisons of (11) versus (10), numerical values of the normalized life expectancy are summarized in Table I, for values on n between 20 and 80.

IMACST: VOLUME 1 NUMBER 1 DECEMBER 2010

1857-7202/1008007

720 40 60 80-0.500.511.5Number of system components: nLife

Expectancy:

T(a)20 40 60 800246x 1018Number of system components: nLife

Expectancy:

T(b)20 25 30 35 40-0.500.511.5Number of system components: nLife

Expectancy:

T(c)

Equation (11)Equation (10)

Figure 2. Normalized life expectancy (T) versus the number of system components n for a 20-out-of-n:G system, as computed via: (a) Equation (11) for n between 20 and 80, (b) Equation (10) for n between 20 and 80, and (c) Equations (11) and (10) overlapped for n between 20 and 45.

In passing, we need to alert the reader to some smoothing effect exhibited in the small-size drawings of Figs. 1(a) and 2(a). Though these two figures represent numerically stable computations, their initial values at n =20 are indistinguishable from zero though they are not zeroes (albeit somewhat small). According to Table I, the value of R(20, 20, 0.9) is (0.9)20 = 0.121576654 or approximately 0.1216, while the value of T for n = 20 is (1.0/20) = 0.05. For n > 40, the graph of Fig. 1(a) is not distinguishable from 1.0, which is typical for highly- or ultra-highly-reliable systems. Table I is not better than Fig. (1) in this case, since it reports values of 1.0000 for R(20, n, 0.9) and n > 40. Better accuracy or significance is attained for highly- or ultra-highly-reliable systems if we report their unreliabilities in floating-point arithmetic instead of their reliabilities in fixed-point arithmetic. This point is clarified by Table II, which present some k-out-of-n reliabilities in fixed-point arithmetic together with their complements or corresponding unreliabilities in floating-point arithmetic. Certainly a value of R(20,45,0.9) expressed as (1.0 3.663710-15) is more informative and easier to grasp and comprehend than when expressed as (0.999999999999996).

Despite the fact that formulas (7) and (10) are frequently cited in the reliability literature, and despite their amicable suitability to hand calculations, they become totally worthless from the numerical point of view when dealing with large systems of high component reliabilities. These formulas are highly susceptible to round-off errors and severely suffer from catastrophic cancellations, and therefore they produce the highly erratic plots in Figures 1 (b), and 2(b). In retrospect, the undesirable behavior of formulas (7) and (10) should have been anticipated since they are essentially outcomes of the numerically-notorious inclusion-exclusion principle [6]. By contrast, formulas (6) and (11) are purely additive formulas and are really minimally insensitive to round-off errors.

D. Pictorial Representation

The unreliability U(k, n, p) of a k-out-of-n:G system is the probability of at most (k1) successes in n Bernoulli trials, and hence it equals the Cumulative Distribution Function (CDF) B(k1, n, p) of the binomial distribution. Figure 3 shows a very regular Mason signal flow graph (SFG) that illustrates the computation of B(i, j, p) [6, 54, 55]. Note that each diagonal arrow has a transmission equal to p, while each horizontal arrow carries a transmission equal to q = 1.0 p. There are two types of nodes: (a) Source nodes of known values which are either black or white. A black node has a value of 1.0 while a white node has a value of 0.0, and (b) Non-source nodes drawn as shaded ones, which include (at least) one sink node whose value is the final result sought. Figure 3 is therefore a pictorial representation of the computation of U(i+1, j, p). It can be made to represent the computation of R(i+1, j, p) by either interchanging the colors (and values) of the source nodes, or swapping the symbols p and q [6]. If further, algebraic multiplication and addition are replaced by their logical counterparts (ANDing and ORing), Fig. 3 can also be made to represent the computation of the success function S(i+1, j, X) [6], and can then be identified as a Reduced Ordered Binary Decision Diagram (ROBDD), which is well 8 A.M.A. RUSHDI: Partially-Redundant Systems: Examples, Reliability and Life Expectancy

known to be the state-of-the art data structure for encoding and manipulating switching functions [10, 56, 57]. Figure 3 is also analogous to tank circuits, which are minimal pass-network realizations of symmetric switching functions [58]. Moreover, Fig. 3 has certain similarities and minor dissimilarities with Pascal's triangle that reflect the similarities and dissimilarities of the forthcoming recursive equations (15a) and the boundary conditions (15c&d) with their counterparts (2) and (3). Fig. 3 has also good similarities and minor dissimilarities with the signal flow graph that illustrates the computation of the probability mass function (pmf) of the generalized binomial distribution [54, 55]. VI. RELIABILITY ANALYSIS FOR K-OUT-OF-N SYSTEMS WITH NON-IDENTICAL COMPONENTS There are three general classes of methods for system reliability analysis, namely; (1) the inclusion-exclusion method, (2) the methods of disjoint products, and (3) the Boole-Shannon expansion or equivalently pivoting (pivotal decomposition or factoring) [6]. These classes of methods are applicable (and have been extensively applied) to the reliability analysis of k-out-of-n systems. The inclusion-exclusion principle was the readily-available tool from classical probability theory [11], and naturally became the basis of early attempts for evaluating the reliability of k-out-of-n systems with non-identical components (See, e. g., [6, 59]). Though the computational disadvantages of this principle are now well known [6], new methods are still being forwarded that are reproducing its results, sometimes without identifying them as such. For example, the recent direct method in [60] reproduces (in disguise) the improved inclusion-exclusion formula (5.29) of [6] as its main result (1). The fact that these two formulas are identical becomes clear if one identifies the tabulated coefficient bk(m) of [60] as a shifted binomial coefficient c(k1, m+k2). Nevertheless, [60] contributes a novel elegant proof via mathematical induction for the inclusion-exclusion result.

The study of system reliability has been classically achieved in terms of purely real-algebraic structure functions [8]. An equivalent approach is more insightful and less error-prone for the methods of disjoint products or the Boole-Shannon expansion. This approach is a logical formulation that utilizes the isomorphism between the algebra of events (set algebra) and the bivalent or 2-valued Boolean algebra (switching algebra). In this approach, one expresses the system success as a logical (switching or Boolean) function of the component successes. Next, one moves from the Boolean domain to the probability domain so as to obtain the system reliability as a function of the component reliabilities. This is facilitated by converting the switching (Boolean) expression for the system success into a probability-ready expression (PRE), i.e., into an expression that is directly convertible, on a one-to-one basis, to a probability expression. Note that in a PRE: (a) all ORed terms (products) are disjoint, and (b) all ANDed alterms (sums) are statistically independent. The conversion from a PRE to a probability expression is trivially achieved by replacing Boolean variables by their expectations, AND operations by multiplications, and OR operations by additions [61-63]. The most powerful class of algorithms producing PREs are algorithms based on the use of the Boole-Shannon expansion [61-63]. In the following subsections, we describe how the Boole-Shannon expansion is used to derive pertinent recursive relations, leading to a highly-efficient algorithm for k-out-of-n reliability evaluation.

Figure 3. A Mason signal flow graph that illustrates the computation of the CDF B(i, j, p) of the binomial distribution.

A. Recursive Techniques

The Boole-Shannon's expansion [19] for a (totally) symmetric switching function of n variables X about one of these variables, namely, Xm, 1 s m s n, can be stated as follows [3, 6]

Sy(A, X) = Xm Sy(B, X/Xm) v Xm Sy(C, X/Xm), (12)

where the characteristic set A is a subset of Zn+1 = {0,1, ...,n} and the sets B and C are subsets of Zn = {0, 1, ..., n1}. Let A be given by A = {a0, a1, ..., au}, u s n, then, we have

, n =aif A/{n} = naif A =ZA = Buu n =

1}, -a, ... 1, -a1, -a{ =A u 1 0 1

0. =aif /{-1}A= 0aif A =Z A= C0 10 1 n 1 =

Now, the success function of a k-out-of-n:G system is the symmetric monotone switching function [3, 6]

Since the right hand side of Eq. (14a) is a disjoint sum of products of statistically independent expressions, it is a PRE that is readily convertible, on a one-to-one basis, into a probability expression. Hence, the following recursive relation for the reliability of a k-out-of-n:G system is obtained [3, 6]

It is possible (though less intuitive) to obtain (15b) directly in the probability domain through the application of the pivoting (pivotal decomposition or factoring) [64] which is simply a version of the well-known total probability theorem [11]. The unreliability U(k, n, p) satisfies recursive relations similar to (15a) or (15b) and boundary conditions complementary to those in (15c) and (15d).

B. An Iterative Algorithm Based on Binary Recursion

Based upon the recursive relation (15b) and boundary conditions (15c) and (15d) an efficient non-recursive algorithm for computing R(k, n, p) and U(k, n, p) has been reported by Rushdi [3, 4, 6]. This algorithm has a nice pictorial interpretation in terms of a SFG generalizing the one in Fig. 3. In fact, if we replace the graph transmissions p and q preceding column j by the subscripted symbols

pj and qj, respectively, the node (i, j) represents the unreliability U(i+1, j, p), and if we replace these two graph transmissions by the swapped subscripted symbols

qj and pj, respectively, the node (i, j) represents the reliability R(i+1, j, p).. The algorithm constructs an array of values inclusively bounded in the ij-plane by the four straight lines, i = 1, i = k, i = j, i = (jn+k), which are the edges of a parallelogram with corners (i, j) at (1, 1), (k, k), (k, n) and (1,nk+1). The algorithm has three versions depending on the order of traversing or sweeping the aforementioned parallelogram elements, namely:

1- The vertical-sweep version: Nodes are visited column-wise, starting from the leftmost column (j = 1) and ending at the rightmost column (j = n). Within each column j, the bottom node (i = min(j, k)) is visited first, and then followed by upper nodes till the top node (i = max(1, jn+k)) is reached.

2- The horizontal-sweep version: Nodes are visited row-wise, starting from the topmost row (i = 1) and ending at the bottom row (i= k). Within each row i, the algorithm proceeds from left (j = i) to right (j = i+nk).

3- The diagonal-sweep version: Nodes are visited diagonal-wise, starting from the leftmost diagonal (ji = 0) and ending at the rightmost one (ji = nk). Within each diagonal, the algorithm proceeds downwards from the top row (i = 1) to the bottom row (i = k).

Our efficient algorithm has a local memory requirement of (k+1) scalars. Its temporal complexity is measured by N = k(nk+1) multiplications and 2N additions. In the worst case (k ~ (n+1)/2), the algorithm has a linear spatial complexity of (n+3)/2 and a quadratic temporal complexity of (n+1)/4. There exist "dual" versions of the algorithm that compute the unreliability U(k, n, p) instead of the reliability R(k, n, p) with no change whatsoever in complexity [3, 6]. The algorithm can be shown to be correct, since when it is given a valid input it produces the right output in a finite amount of time, and also it passes the tests in [65]. To date, this algorithm has the least temporal complexity within the class of algorithms that basically use real multiplications and additions to compute R(k, n, p). It is believed [6] to be optimal in the worst case, in the sense that it is unlikely that there is an algorithm in the same class that performs fewer basic operations in the worst case.

The Rushdi algorithm in [3, 4, 6] is the basis for efficient algorithms that compute the reliabilities of more general systems such as the k-to--out-of-n system [26, 27], the threshold system [30], the combined k-out-of-n:F, consecutive-k-out-of-n:F, and linear Connected-(r; s)-out-of-(m; n):F System [66], and the multi-state k-out-of-n system [29]. The algorithm was successfully applied in the analysis or design of some practical real-life systems such as furnace systems [49, 50], and static synchronous compensators (STATCOM) used in electric power systems [67]. It is also applicable in the analysis and design of fleets of aircrafts [68] and systems of pervasive computing [69].

The Rushdi algorithm in [3, 4, 6] and its extensions for the k-to- -out-of-n system [26, 27] and the threshold system [30] have been rediscovered repeatedly in the literature. The concept of a weighted k-out-of-n:G system in [12, 31] is a special case of that of a threshold system in [30]. The recursive relations and the algorithms in [12, 31] are also strongly similar to those in [30]. In another direction, Dutuit and Rauzy [10] paraphrased the Rushdi algorithm into an algorithm that they admitted is "strongly similar" to the Rushdi algorithm in [3, 4]. They also paraphrased the extension of the Rushdi algorithm for the k-to- -out-of-n system [26, 27], and believed that the resulting algorithm is new, obviously unaware of the work in [26, 27]. 10 A.M.A. RUSHDI: Partially-Redundant Systems: Examples, Reliability and Life Expectancy

The above statements should never be understood to belittle the visionary insights in [10], which significantly enhanced the utility of switching (Boolean) algebra in reliability evaluations through the prudent use of the highly efficient and extremely popular ROBDD data structure [56, 57].

The Rushdi algorithm in [3, 4, 6] has an elegant technique of handling two-dimensional recursion, reminiscent of the use of Pascals triangle in computing combinatorial (binomial) coefficients. This technique is very useful in tackling other problems of two- or multi-dimensional recursion, such as the computation of Stirlings numbers, and the computation of multinomial coefficients and probabilities.

VII. COMPLEXITY ISSUES The k-out-of-n system has taken a considerably large share of the reliability literature. Virtually all the major techniques of system reliability analysis have been applied to k-out-of-n systems. The outcome is a potpourri of algorithms that have been surveyed in [6], where careful attention has been paid to ensure a uniform treatment of the various algorithms and to point out similarities, differences and interrelations among them. Notable among these algorithms is an algorithm due to Barlow and Heidtmann [2], which uses the coefficients in a generating-function expansion to express the probabilities of exactly m successes in n Bernoulli trials, and then employ an efficient technique to obtain their summation for k s m s n. This algorithm has a spatial complexity of (k+2) scalars, and a non-symmetric temporal complexity of ((k+1)*(nk+1)1) multiplications plus ((2k+1)*(nk+1)1) additions. In the worst case (k ~ (n+1)/2), these complexities reduce to (n+5)/2 memory cells and ((n+1)*(n+3)/41) multiplications, respectively. This means that the Rushdi algorithm [3, 4, 6] discussed herein has a slight trivial advantage over the Barlow-Heidtmann algorithm. Both algorithms are temporally O(n/4). They are the best in temporal terms (among algorithms using real operations), in addition to being good space economizers. The similarity between the two algorithms is so strong that they are sometimes mistaken to be the same. To summarize the minor difference between the two algorithms, we note that the Rushdi algorithm is based on explicit recursion related to the Cumulative Distribution Function (CDF) of the generalized binomial distribution. The Barlow-Heidtmann algorithm, however, does not use recursion explicitly, but it has been shown by Rushdi [3, 6] to have implicit recursion, which turns out to be related to the probability mass function ( pmf ) of the same distribution [54, 55]. More detailed comparisons between these two algorithms are available in Rushdi [6], and in Kuo and Zuo [12].

In 1995, absolute optimality of the Rushdi algorithm and the Barlow-Heidtmann algorithm was lost to a new algorithm by Belfore [7] which has a temporal complexity O(n(log2n)2). This algorithm combines the generating-function expansion concept [2, 70] with a recursive application of the Fast Fourier Transform (FTT). The FTT uses complex arithmetic, and involves multiplications by complex roots of unity, which are equidistant points on the unit circle in the complex plane. Recursively, the FTT facilitates the computation of the convolution of two sequences, and hence the evaluation of the product of two generating functions. It is not desirable to apply the FFT for smaller problems since large overheads are involved [7]. The Belfore algorithm is faster than other algorithms for n > 4000 [7]. It is very hard to use (compared to the Rushdi algorithm) in manual computations for small-size systems, and hence does not provide a similar pedagogical insight.

Patel et al. [71] has a serious observation about numerical algorithms, which they emphasize by calling it a "folk theorem." This so called "theorem" states that "If an algorithm is amenable to 'easy' hand calculation, it is probably a poor method if implemented in the finite floating-point arithmetic of a digital computer." The "converse of the folk theorem" states that "Many algorithms that are now considered fairly reliable in the context of finite arithmetic are not amenable to hand calculation." Notable pertinent examples that fully support the "folk theorem" include formula (7) above for computing the reliability of a k-out-of-n:G system with equal-reliability components, and also formula (10) for computing its normalized life expectancy. Fortunately, the Rushdi algorithm for computing the reliability of a k-out-of-n:G system with non-identical components can serve as a "concrete counterexample" for a stronger version of the folk theorem that lacks the qualifying word "probably," and also for a stronger version of the converse theorem in which the qualifying word "Many" is omitted. The Rushdi algorithm is very nice for hand calculations, and still it is one of the fastest and most reliable and robust methods for digital computations. VIII. CONCLUDING REMARKS

This paper presented a brief tutorial exposition of some recent developments in the evaluation of the reliability of partially-redundant (k-out-of-n) systems. A novel contribution of the paper is that it identified many practical examples in which such systems serve as a useful model. Some formulas for the reliability and life expectancy of these systems were discussed in the case of equal-reliability components. Certain celebrated formulas were shown to be numerically unstable and totally useless in the case of large systems with high-reliability components. The paper also reviewed how the Boole-Shannon expansion (or equivalently, the pivoting or factoring technique) is used to derive pertinent recursive relations, leading to a highly efficient algorithm for k-out-of-n reliability evaluation. This algorithm has a nice interpretation in terms of a regular Mason signal flow graph, which turns out to be (a) a reduced ordered binary decision diagram representing a monotone symmetric switching function, and (b) analogous to the minimal circuit realization of this function. In the worst case, the temporal and spatial complexities of this algorithm were shown to be quadratic and linear, respectively, in the number of system components. The paper listed some extensions and applications of this algorithm and compared it with a few related algorithms. In the following, a few additional remarks are added.

IMACST: VOLUME 1 NUMBER 1 DECEMBER 2010

1857-7202/1008007

11For the parallel system, which is totally or fully redundant, redundancy is always useful in the sense that using several components is always better than using a single component. This is not the case, however, when strict partial redundancy is used. Rushdi and Al-Hindi [46] utilized the Rushdi algorithm in [3, 6] in computing and tabulating values for the lower boundary of the region of useful redundancy for k-out-of-n systems, i.e., the region in which system reliability is better than that of a single component. If the components have constant failure rates, their lifetimes are exponentially distributed, and this lower boundary can be used to determine the point that divides the time axis into intervals of short and long missions, respectively [11]. In a different direction, Rushdi and Al-Thubaity [72] modified the Rushdi algorithm in [3, 6] to obtain efficient algorithms for computing the first-order sensitivity of k-out-of-n system reliability. These algorithms are useful in computing several criticality or importance measures and in evaluating the instantaneous failure rate of the system.

An important problem of interest is to study the characteristics of k-out-of-n systems under a set of assumptions other than the one employed throughout this paper. Examples of these include k-out-of-n systems with statistically dependent components, common-cause failures, two failure modes, repair, or spares. Typically, Markov-chain modeling is appropriate and effective in these cases [5, 11, 14, 73].

While simple reliability-cost metrics (such as reliability per cost or life expectancy per cost) can be used to guide the selection of a system from among several systems that are candidates for providing the same performance in a given mission, other more elaborate metrics (such as the cost elasticity of reliability or cost elasticity of life expectancy) have been developed [53] for partially-redundant systems. These metrics can be used to assess the cost-benefit aspect of adding redundancy to a system with the purpose of enhancing its reliability.

Ali Muhammad Ali Rushdi was born in Port Said, Egypt, on May 24, 1951. He received the B.Sc. degree (Honors) in Electrical Engineering (Electronics and Communications) from Cairo University, Cairo, Egypt, in 1974, and the M.S. and Ph.D. degrees in Electrical Engineering from the University of Illinois at Urbana-Champaign, USA in 1977 and 1980, respectively. He maintained a perfect GPA of 5.0/5.0 throughout his study.

In 1974 he was appointed Demonstrator and Instructor in the Department of Electronics and Electrical Communications Engineering of Cairo University. From 1976 to 1980 he was Research Assistant in the Electrical Engineering Department of the University of Illinois at Urbana-Champaign. Since 1980 he has been with King Abdul-Aziz University in Jeddah, Saudi Arabia, where he is now Professor of Electrical and Computer Engineering as well as Head and Coordinator of the Computer Engineering Group. At King Abdul-Aziz IMACST: VOLUME 1 NUMBER 1 DECEMBER 2010

1857-7202/1008007

13University he has structured and taught a variety of graduate and undergraduate courses, supervised master theses and senior projects, and contributed significantly to accreditation activities. He served as a member of the Editorial Board of the IEEE Transactions on Instrumentation and Measurements (1986-1994), and was a frequent reviewer for the IEEE Transactions on Reliability (1983-1998). He is currently a member of the Editorial Board of the Journal of King Abdul-Aziz University: Engineering Sciences, and is an Associate Editor of Reliability and Computer Engineering for the International Magazine on Advances in Computer Science and Telecommunications (IMACST). His Research Interests and publications over the past four decades spanned the areas of Electromagnetic Communications Engineering, Computer Engineering, Reliability Engineering, Digital Design, Engineering Education, Neural and Switching Networks, Advanced Mathematics, Boolean Algebra and Logic, Engineering Design, Inferential Thinking, and Problem Solving.

Prof. Rushdi is an initiated member of the Honorary Societies: Eta Kappa Nu and Phi kappa Phi. He is a Senior Member of the Institute of Electrical and Electronics Engineers (IEEE).