Abstract Hard real-time systems are usually required to provide an absolute guarantee that all tasks will always complete by their deadlines. In this paper we address fault tolerant hard real-time systems, and introduce the notion of a probabilistic guarantee. Schedulability analysis is used together with sensitivity analysis to establish the maximum fault frequency that a system can tolerate. The fault model is then used to derive a probability (likelihood) that, during the lifetime of the system, faults will not arrive faster than this maximum rate. The framework presented is a general one that can accommodate transient ‘software’ faults, tolerated by recovery blocks or exception handling; or transient ‘hardware’ faults dealt with by state restoration and re-execution. 1 Introduction Scheduling work in hard real-time systems is traditionally dominated by the notion of absolute guarantee. Static analysis is used to determine that all deadlines are met even under the worst-case load conditions. With fault-tolerant hard realtime systems this deterministic view is usually preserved even though faults are, by their very nature, stochastic. No fault tolerant system can, however, cope with an arbitrary number of errors in a bounded time. The scheduling guarantee is thus predicated on a fault model. If the faults are no worse than that de?ned in the fault model then all deadlines are guaranteed. The disadvantage of this separation of scheduling guarantee and fault model is that it leads to simplistic analysis; either the system is schedulable or it is not. In this paper we bring together scheduling issues and errors to justify the notion of a probabilistic guarantee even for a hard real-time system. By ‘probabilistic guarantee’ we mean a scheduling guarantee with an associated probability. Hence,

a guarantee of 99.95% does not mean that 99.95% of deadlines are met. Rather it implies that the probability of all deadlines being met during a given period of operation is 99.95%. Instead of starting with the fault model and using scheduling tests to see if this is feasible, we start with the scheduling analysis to derive a threshold interval between errors that can be tolerated and then employ the fault model to assign a probability to this threshold value. To provide the ?exibility needed to program fault tolerance, ?xed priority preemptive scheduling will be used [13]. The faults of interest are those that are transient. Castillo at al [6] in their study of several systems indicate that the occurrences of transient faults are 10 to 50 times more frequent than permanent faults. In some applications this frequency can be quite large; one experiment on a satellite system observed 35 transient faults in a 15 minute interval due to cosmic ray ions [5]. We attempt to keep the framework as general as possible by accommodating ‘software’ faults tolerated by either exception handling or some form of recovery block, and ‘hardware’ faults dealt with by state restoration and re-execution. Error latencies will be assumed to be short. Other authors have studied the probability of meeting deadlines in fault-tolerant systems. However, only some facets of this problem have been considered. For instance, Hou and Shin [9] have studied a related problem, the probability of meeting deadlines when tasks are replicated in a hardware-redundant system. However, they only consider permanent faults without repair or recovery. A similar problem was studied by Shin et al [18]. Kim et al [12] consider another related problem: the probability of a real-time controller meeting a deadline when subject to permanent faults with repair. The rest of the paper is organised as follows. Section 2 brie?y describes the scheduling analysis that is applicable to non-fault-tolerant systems. Section 3 presents the fault model and the framework for the subsequent analysis. In Section 4 the scheduling analysis for a fault tolerant system is presented. This enables the threshold fault interval (TFI) to be derived. Section 5 then uses the fault model and the TFI to assign a probability to the threshold. Conclusions are presented in Section 6. 2 Standard Scheduling Analysis For the standard ?xed priority approach, it is assumed that there is a ?nite number (N ) of tasks ( .. N ). Each task has the attributes of minimum inter arrival time, T , worst-case execution time, C , deadline, D and priority P . Each task undertakes a potentially unbounded number of invocations; each must be ?nished by the deadline (which is measured relative to the task’s invocation/release time). All tasks are deemed to start their execution at time 0. We assume a single processor platform T . For this restriction, an optimal set and restrict the model to tasks with D of priorities can be derived such that Di < Dj ) Pi > Pj for all tasks i; j [15]. Tasks may be periodic or sporadic (as long as two consecutive releases are separated by at least T ). Once released, a task is not suspended other than by the possible action of a concurrency control protocol surrounding the use of shared data. A task,1

however, may be preempted at any time by a higher priority task. System overheads such as context switches and kernel manipulations of delay queues etc can easily be incorporated into the model [11, 4] but are ignored here. The worst-case response time (completion time) Ri for each task (i) is obtained from the following [10, 1]:

Ri = Ci + Bi +

X

&

j 2hp(i)

Ri C Tj j

'

(1)

where hp(i) is the set of higher priority tasks (than i), and Bi is the maximum blocking time caused by a concurrency control protocol protecting shared data. The most common and effective concurrency control protocol assigns a ceiling priority to each shared data area. This ceiling is the maximum priority of all tasks that use the shared data area. When a task enters the protected object that contains the shared data, its priority is temporarily increased to this ceiling value. As a consequence (on a single processor system): 1. Mutual exclusion is assured (by the protocol itself). 2. Each task is only blocked once during each invocation. 3. Deadlocks are prevented (by the protocol itself). The value of Bi is simply the maximum computation time of any protected object that has a ceiling equal or greater than Pi and is used by a task with a priority lower than Pi . To solve equation (1) a recurrence relation is produced:

r0

n+1 i

= Ci + B i +

X

&

j 2hp(i)

rin C Tj j

'

(2)

where ri is given an initial value of Ci (although more ef?cient initial values can be found). The value r n can be considered to be a computational window into which an amount of computation Ci is attempting to be placed. It is a monotonically nonn n decreasing function of n. Note that when ri becomes equal to ri then this value n is the worst-case response time, Ri [4]. However if ri becomes greater than Di then the task cannot be guaranteed to meet its deadline, and the full task set is thus unschedulable. Table 1 describes a simple 4 task system, together with the response times that are calculated by equation (2). Priorities are ordered from 1, with 4 the lowest value, and blocking times have been set to zero for simplicity. Scheduling analysis is independent of time units and hence simple integer values are used (they can be interpreted as milliseconds). To illustrate how these values are obtained consider ; ri is given the initial value of 30, ri is then just the addition of all the computation times (30 + 35 + 25 + 30 = 120), so ri is assigned 120. With this value gives rise to another hit (of 30)+1 4 0 1 2 1

Task1 2 3 4

P1 2 3 4

T100 175 200 300

C30 35 25 30

D B R100 175 200 300 0 0 0 0

Schedulable 30 TRUE 65 TRUE 90 TRUE 150 TRUE

Table 1: Example Task Set and hence ri is 150. This value is then stable and hence is the required response time. All tasks are released at time 0. For the purpose of schedulability analysis, we can assume that their behaviour is repeated every LCM, where LCM is the least common multiple of the task periods. When faults are introduced it will be necessary to know for how long the system will be executing. Let L be the lifetime of the system. For convenience we assume L is an integer multiple of the LCM. This value may however be very large (for example LCM could be 200ms, and L ?fteen years!).3

3 Fault Model We assume that a single transient fault will cause just one error, and that this error will manifest itself in just a single task. With ‘software’ faults this is a reasonable assumption. With ‘hardware’ faults we are concerned with errors that manifest themselves in the processing unit (including internal busses, cache etc) rather than in memory where the error latencies may be very large. We assume that only the executing task is affected1 . Faults that affect the kernel must either be masked or lead to permanent damage that can only be catered for by replication at the system level. To make the subsequent analysis simpler we assume perfect error recognition coverage; a probabilistic (non zero) measure of coverage could be used with a straightforward effect upon the analysis. We make the common homogeneous Poisson process (HPP) assumptions that the fault arrival rate is constant and that the distribution of the fault-count for any ?xed time interval can be approximated using a Poisson probability distribution. This is an appropriate model for a random process where the probability of an event does not change with time and the occurrence of one fault event does not affect the probability of another such event. A HPP process depends only on one parameter, viz, the expected number of events, , in unit time; here events are transient faults with = 1=MTBF , where MTBF is the Mean Time Between transient Faults2.An alternative model, namely all non-terminated tasks are affected could also have been used. This would make no fundamental difference to the analysis but would complicate the scheduling equations used in Section 4. 2 MTBF usually stands for mean time between failures, but as the systems of interest are fault tolerant many faults will not cause system failure. Hence we use the term MTBF to model the arrival1

Per the de?nition of a Poisson Distribution,

gives the probability of n events during an interval of duration t. If we take an event to be an occurrence of a transient fault and Y to be the random variable representing the number of faults in the lifetime of the system (L), then the probability of zero faults is given by

? t( n Prn(t) = e n! t)

Pr(Y = 0) = e? L

and the probability of at least one fault

Pr(Y > 0) = 1 ? e? LOther useful values are:

Pr(Y = 1) = e? L L Pr(Y < 2) = e? L(1 + L)

(3)

We are concerned, in this paper, with the probability of the system being schedulable. We shall write Pr(S ) and Pr(U ) to denote the probability of schedulability and unschedulability. Of course Pr(S ) = 1 ? Pr(U ). The analysis given in the next section will determine the threshold fault interval. This gives the sustainable frequency at which faults can occur and the system still meet all its deadlines. Let this frequency be represented by the minimum time interval allowed between faults, TF . It follows that if W is the shortest interval between fault arrivals during a mission then 3

Pr(U ) = Pr(U jno faults):Pr(no faults) + Pr(U jW TF and there are faults):Pr(W TF and there are faults) + Pr(U jW<TF and there are faults):Pr(W<TF and there are faults)Since we are dealing with systems which are schedulable ‘under no faults’ we can assume Pr(U jno faults) is zero. Also TF has been de?ned so that Pr(U jW TF ) is zero. Hence

Pr(U ) = Pr(U jW < TF ):Pr(W < TF )In this paper we will make the conservative assumption that Pr(U jW < TF ) is one. And hence we are left with the evaluation of Pr(W < TF ), i.e. the probabilityof transient faults. 3 To simplify notation and avoid the need to mention special cases in the remainder of the paper, we will regard the events of no faults and one fault as being subsets of the event W TF . The simplest formal mechanism to achieve this is to de?ne W to be an improper random variable, taking the value W =1 when there is exactly one fault, or no faults at all, during the system lifetime. Thus W is always realised with some value, so that Pr(W<TF ) + Pr(W TF ) = 1.

that at least two faults arrive so close together in time that they cannot both be tolerated. This is done in Section 5. Although this assumption is conservative (and hence safe) it is clearly possible to give less pessimistic values. The above formulation will allow such values to be combined with the estimates of Pr(W < TF ) given in Section 5. Issues concerned with implementing the features suggested by the Fault Model are well addressed by Fetzer and Cristian[8].Typical Values of Key Parameters

Before proceeding with the analysis it is worth noting the ranges in value of the key parameters of the model. In most applications of interest, the “lifetime” over which a probability of failure is required is the duration of one mission. Mission times for civil aircraft are typically 3-20 hours, but for satellites 15 years of execution may be expected. The iteration periods for control loops are as short as 20ms, other loops and signals may have T values of a few seconds. Precise values for MTBF are not generally known, but in a friendly operating environment perhaps 100 hours is not unreasonable. In more hostile conditions, 20 seconds may be more typical. Although TF is derived from the characteristics of the task set under consideration, it is worth noting that very small values are unlikely (as a task will not make progress if it suffers repetitive faults), and faults spaced out beyond the LCM of the task periods will easily be catered for; hence: 200ms < TF < 5 Seconds. Table 2 summarizes these viable ranges for the key parameters (in hours and hours? ).1

Parameter L T

TF

10? ? 10? 10? ? 10 10? ? 10?6 2 5

3 ? 10

Range5 2

2 2

Table 2: Typical Values of Key Parameters

4 Schedulability Analysis for Fault Tolerant Execution Let Fk be the extra computation time needed by k if an error is detected during its execution. This could represent the re-execution of the task, the execution of an exception handler or recovery block, or the partial re-execution of a task with checkpoints. In the scheduling analysis the execution of task i will be affected by a fault in i or any higher priority task. We assume that any extra computation for a task will be executed at the task’s (?xed) priority.

Hence if there is just a single fault, equation (1) will become [16, 2]4 :

Ri = C i + Bi +

X

&

j 2hp(i)

Ri C + max F Tj j k2hep i k( )

'

(4)

where hep(i) is the set of tasks with priority equal or higher than i , that is hep(i) = hp(i) + i . This equation can again be solved for Ri by forming a recurrence relation. If all Ri values are still less than the corresponding Di values then a deterministic guarantee is furnished. Given that a fault tolerant system has been built it can be assumed (although this would need to be veri?ed) that it will be able to tolerate a single isolated fault. And hence the more realistic problem is that of multiple faults; at some point all systems will become unschedulable when faced with an arbitrary number of fault events. To consider maximum arrival rates, ?rst assume that Tf is a known minimum arrival interval for fault events. Also assume the error latency is zero (this restriction will be removed shortly). Equation (4) becomes [16, 2]:

Ri = Ci + Bi +

X

&

j 2hp(i)

Ri C + Ri max F Tj j Tf k2hep i k( )

'

&

'

(5)

R Thus in interval (0 Ri ] there can be at most Tfi fault events, each of which can induce Fk amount of each computation. The validity of this equation comes from noting that fault events behave identically to sporadic tasks, and they are represented in the scheduling analysis in this way [1]. Note the equation is not exact (but it is suf?cient): faults need not always induce a maximum re-execution load. There is a useful analogy between release jitter and error latency. If a fault can lie dormant for time Af , then this may cause two errors to appear to come closer together than Tf . This will increase the impact of the fault recovery. Equation (5) can be modi?ed to include error latency in the same way that release jitter is incorporated into the standard analysis [1]:

Table 3 gives an example of applying equation (6). Here full re-execution is required following a fault. Two different fault arrival intervals are considered. For one the system remains schedulable, but for the shorter interval the ?nal task cannot be guaranteed. In this simple example, blocking and error latency are assumed to be zero. Note that for the ?rst three tasks, the new response times are less than the shorter Tf value, and hence will remain constant for all Tf values greater than 200. The above analysis has assumed that the task deadlines, D s, remain in effect even during a fault handling situation. Some systems allow a relaxed deadline4

We assume that in the absence of faults, the task set is schedulable.

Task1 2 3 4

P1 2 3 4

T100 175 200 300

C30 35 25 30

D100 175 200 300

F30 35 25 30

R R Tf = 300 Tf = 20060 100 155 275 60 100 155 UNSCH

Table 3: Example Task Set - Tf = 300/200 when faults occur (as long as faults are rare). This is easily accommodated into the analysis.

Limits to Schedulability

Having formed the relation between schedulability and Tf , it is possible to apply sensitivity analysis to equation (6) to ?nd the minimum value of Tf that leads to the system being just schedulable. As indicated earlier, let this value be denoted as TF (it is the threshold fault interval). Sensitivity analysis [19, 14, 13, 17] is used with ?xed priority systems to investigate the relationship between values of key task parameters and schedulability. For an unschedulable system it can easily generate (using simple branch and bound techniques) factors such as the percentage by which all C s must be reduced for the system to become schedulable. Similarly for schedulable systems, sensitivity analysis can be used to investigate the amount by which the load can be increased without jeopardising the deadline guarantees. Here we apply sensitivity analysis to Tf to obtain TF . When the above task set is subject to sensitivity analysis it yields a value of TF of 275. The behaviour of the system with this threshold fault interval is shown in Table 4. A value of 274 would cause to miss its deadline.4

Task1 2 3 4

P1 2 3 4

T100 175 200 300

C30 35 25 30

D100 175 200 300

TF = 27560 100 155 275

R

Table 4: Example Task Set - TF set at 275

5 Evaluating

Pr(W<TF )

We need to calculate the probability that during the lifetime, L, of the system no two faults will be closer than TF . Two approaches are considered. The attraction of the ?rst is that it shows that a relatively intuitive and uncomplicated approach yields upper and lower bounds on Pr(W<TF ) which, for a wide range of parameter values, provide a maximum approximation error which cannot be much greater bound than a factor of 3 (since upper bound 3). With the second approach a more cumlower bersome but exact formulation is derived. Despite the inclusion of this latter exact formulation, we believe that, given that it is often rather the order of magnitude of the failure probability that is the primary concern (rather than an exact value), the mathematically signi?cantly easier reasoning of the ?rst, bounding approach retains some importance.5.1 Upper and Lower Bounds for Evaluating

We are concerned with two faults being closer than TF over the mission time L. Since in practice L TF we can assume, without loss of generality, that L is an even integer multiple of TF . Let mishap be the undesirable event of two faults indeed occurring closer than TF i.e.

Pr(W<TF )

Pr(mishap during L) Pr(W < TF )We derive the required upper and lower bounds via the following theorems: Theorem 1 If L=(2TF ) is a positive integer then

Pr(mishap during L) < 1 + e? TF (1 + TF )h

h

i L

TF ?1

? 2 e?

h

2

TF

(1 + 2 TF )

i L2

TF

Theorem 2 If L=(2TF ) is a positive integer then

Pr(mishap during L) > 1 ? e? TF (1 + TF )Proof of Theorem 1

i L

TF

Let the mission time be split into a series of ‘even’ time intervals with boundaries 0, 2TF , 4TF , : : : , L, as shown in Figure 1. Similarly a set of ‘odd’ intervals starting at times TF , 3TF , 5TF , : : : , L?TF can be de?ned (extending the lifetime slightly to L+TF , the end point of the last odd interval, by continuing the same HPP fault model). Each set has L=2TF intervals. Let a mishap be said to lie in an interval if both of its faults occur during that interval. It follows from the geometry of these intervals that mishap during L

)

mishap in some even interval[s]; or mishap in some odd interval[s]

This property comes directly from the de?nition of the intervals; if a mishap (two faults closer than TF ) occurs it must lie in either an even or an odd interval 5.

Pr(mishap during L) < Pr(mishap in some even interval[s]or mishap in some odd interval[s])1st

Actually the intersection of the two events on the right hand side has non-zero probability. One way that they can occur together is that a single mishap could lie in the overlap between an even and an odd interval. Call these overlaps ‘halfintervals’: they are of length TF , and there are6 TL ?1 of them, respectively starting F at times TF , 2TF , : : : , L?TF . So from the basic axioms of probability

Pr(mishap in some even interval[s];

or mishap in some odd interval[s]) < Pr(mishap in some even interval[s]) + Pr(mishap in some odd interval[s]) ? Pr(mishap in some half-interval[s])

Now, given the symmetry of the construction and the HPP process assumption,

Pr(mishap in some even interval[s]) = Pr(mishap in some odd interval[s])Hence

Pr(mishap during L) < 2Pr(mishap in some even interval[s]) ? Pr(mishap in some half-interval[s]) Pr(mishap in some even interval[s]) = 1 ? Pr(no mishap in 2TF ) TF :2

(7)

The event “mishap in a particular even interval” is independent of events in all other even intervals, and it has the same probability for every even interval. ThusL

(8)

The difference between the two sides here is the (normally tiny since L TF ) probability that the second fault of the ?rst mishap occurs during L<t L+TF . 6 Our term ‘half-interval’ excludes the early half of the ?rst even interval, and the latter half of the last odd interval, since they do not arise as overlaps.5

For an interval of length 2TF not to contain a mishap, it is suf?cient (but not necessary) that it contain 0 or 1 fault. Hence, from equation (3)

Pr(no mishap in 2TF ) > e?Combining equations (8) and (9) yields

2

TF

(1 + 2 TF ) :h2

(9)i L2

Pr(mishap in some even interval[s]) < 1 ? e?By a similar argument for the half-intervalsh

TF

(1 + 2 TF )

TF

:(10)

Pr(mishap in some half-interval[s]) = 1 ? e? TF (1 + TF )

i L

TF ?1

;(11)

and now combining equations (7), (10), and (11) delivers the theorem statement.

2

Proof of Theorem 2 In a similar way to the previous proof, consider the series of intervals of length TF starting at times 0, TF , 2TF , 3TF , : : : , L?TF . There are TLF of these, a mishap in any one of which implies a mishap during L (but not vice versa). Hence

Pr(mishap during L) > Pr(mishap in some interval[s])

but

Pr(mishap in some interval[s]) = 1 ? Pr(no mishap in any interval)Proof follows directly (as in proof of Theorem 1).

2

Both the upper and lower (exact) bounds are in mathematically non-intuitive forms, but simple approximations can be derived for most of the parameter range within which the probability of mishap is small enough to be of interest. Corollary 3 An approximation for the upper bound on Pr(W<TF ) given by Theorem 1 is LTF , provided that TF , LTF are small, and L TF .3 2 2 2

Corollary 4 An approximation for the lower bound on Pr(W<TF ) given by Theorem 2 is LTF , provided only that TF , LTF are small.1 2 2 2

Proof of Corollary 3 The term e? TF (1 + TF ) can be approximated by a Taylor series, where terms ( TF ) and beyond are ignored. Thus3

T e? TF (1 + TF ) 1 ? 2 F2

2

Another approximation comes from noting that for small xz

2

1? z4 2

2

!x

2

1 ? xz 2

2

where terms z and higher powers of z can be ignored. Hence, under assumptions TF , LTF small, and L TF , we can writeh

Applying (12) and (13) to the conclusion of Theorem 1, Corollary 3 is proved.

2

Corollary 4 follows by a similar argument. Strictly, the bounds in Theorems 1 and 2 have only been proved here for L an even multiple of TF . However, the realistic assumption L TF allows the approximations given in the two corollaries still to extend to other values of TF and L. L In fact, where TF is an exact integer, this L TF assumption is not actually required, either for the derivation of the exact bounds in Theorems 1 and 2, nor for the lower-bound approximation of Corollary 4. For high accuracy in Corollary 4, we need only the assumption of small TF , LTF . Corollary 3 is the exception, relying on the L TF assumption at one place in its derivation: the exponent to the F square-bracketed term in (12) is ‘out by 1’ and we needed TL to be small in order to justify effectively ignoring this fact. For very short mission times, such that we do not have L TF , we can in fact ‘retreat slightly’ to a slacker upper bound for Pr(mishap during L) by using 1 as an upper bound for this square-bracketed term in a modi?ed version of Theorem 1, thus avoiding the awkward exponent TL ?1. F L Then, for positive integral TF , the resulting equivalent of Corollary 3 produces an accurate approximation 2 LTF to this slacker upper bound on Pr(W<TF ), without any requirement that L TF ; i.e., under the assumptions only that TF , LTF L are small, and that TF is a positive integer, but without now the requirement that L TF , the methods of this subsection are able to provide bounds on Pr(W<TF ) bound which are approximately in the ratio upper bound 4. lower The important upper bound approximation of Corollary 3 can be written in the form ( L)( TF ). It will often be the case that TF < 10? ; indeed this constraint allowed the approximations to deliver useful values. But L can vary quite considerably from 10? or less in friendly environments to 10 or more in long-life, hostile domains. Clearly, low probability levels for this latter case will be extremely dif?cult to achieve by the scheduling approach de?ned in this paper.2 2 2 2 2 2 3 2 2 2 3

L10 10 1011 2 4

1 1.1 1.1 1.1

10? 10? 10?

4

3 2

1

10? 1.1 10? 1.1 10? 1.1 10? 1.1 10?2

8 7 6 4

10? 1.1 10? 1.1 10? 1.1 10? 1.1 10?4

12 11 10 8

Table 5: Upper bound on Non-Schedulability due to Faults. The example introduced in Section 4 had a TF value of 275ms. Table 5 gives the upper bound on the probability guarantee for various values of and L.2 2 3 1

When L<10? , L approximates the probability of any fault happening during the mission of duration L. So, ( TF )? represents the gain that is achieved by the use of fault tolerance, under the other assumptions stated. So, for example in Table 5, when = 10? and L = 1 the gain is approximately 10 .2 6

5.2

Exact Formulation for Evaluating

Pr(W<TF )

This solution conditional given n enables us to complete the exact derivation of the ?nal, unconditional Pr(W<TF ) relatively straightforwardly by using the ‘chain rule’ of conditional probability to ‘uncondition on n’. Another fundamental property of the homogeneous Poisson process is that the distribution of n, the count of the number of events occurring within a ?xed time interval, upon which the probabilities are conditioned, is Poisson with parameter equal to its mean, which in our case is L. Then, working for convenience with the ‘probability of no mishap

Unlike the bounding argument used in the last section, our exact derivation of the probability Pr(W<TF ) proceeds in two stages, ?rst conditioning on the total number n of faults seen in the lifetime L of the system. It is a well known property of the HPP process [7] that if we condition on the number n of events occurring within a speci?ed time interval and then de?ne X ; X ; : : : ; Xn as ordered positions of these n points within that interval, expressed as proportions of its length, then the Xi are (conditionally given n) jointly distributed as the order statistics of an i.i.d. random sample from a uniform distribution on the unit interval 0; 1]. This being accepted, we now ?rst ?x u with 0 u 1 and ask the question ‘What is the probability, P say, that no two of these points are closer than u (conditionally given n)?’. We can obtain the answer by n-dimensional integration. This is reported in an extended version of this paper available as a technical report [3], which says essentially that P is just the nth power of the total amount of ‘slack’ remaining within the unit interval after our u-separation constraint is imposed.1 2

in a time interval of length L’, we have

Pr(W TF ) =

1 X

n=0

Pn; TF=L : e? L:( )

(

L)n n!

= e ? L >1 + L +> :

8 > > <

L dXe TF

n=2L dXe TF

1 ? (n?1) TF

n

L

L : ( n!)

9 > n> = > > ;

= e ? L >1 + L +> :

8 > > <

n=2

(L ? (n?1)TF )n> n!> ;

n

9 > > =

(14)

We remark ?rstly that (14) is essentially a function of just two arguments, L, TF , rather than three (as are the bounds derived in Section 5.1). Thinking now of the function mathematically in these terms, without much concerning ourselves about physical interpretation of the arguments, if we agree to con?ne ourselves to the ranges 0 < L < 1, and 0 TF < 1, then we remark that the expression (14) continues to give the correct mathematical Poisson process probability at all points of this domain, including the value of 1 obtained at TF =0. (This is on the understanding that the d1e occurring as the upper limit of a sum denotes a sum to in?nity in the usual sense of a mathematical limit.) The purpose of stating this last point about the argument domain now to be assumed for this function is related to the practical computation problem associated with (14) which we address brie?y in [3]. Note that, apart from this TF =0 case, the expression (14) represents a ?nite sum throughout the domain identi?ed, although, for certain argument values, the number of terms summed can be astronomically large, which can make a simple-minded numerical computation rather slow. Moreover, some of these awkward parameter ranges may be of real practical interest to us in our application (see end of Section 3). Note that we can use the common notation for the ‘positive part function’ h+ , associated with any real-valued function h, to obtain the following slightly different expression, valid throughout the argument domain we have just speci?ed (including TF = 0).A few remarks about this exact expression

Pr(W TF5.3

) = e?

(

L

1+ L+

1 X

n=2

( L ? (n?1) TF )n n!+

)

(15)

Some Numerical Results on

Pr(W<TF )

We decided to test the accuracy of our numerical approximations experimentally, and found that, over the physically realistic parameter ranges of concern to us, the approximations de?ned are extremely accurate, even at very low order in the Taylor series. This enabled us to produce Figure 2, a contour plot indicating the dependence of the exact value of Pr(W<TF ) on its two arguments L, TF . The function plotted is, in fact, the log odds of Pr(W<TF ), chosen in order to ensure that

there are some contours near each extreme, Pr(W<TF ) = 0 and Pr(W<TF ) = 1. In the top right hand corner the contours bunch too closely as the probability of a ‘mishap during L’ becomes extremely close to absolute certainty. (It is dif?cult to imagine a situation in which the precise values of these large probabilities would be of practical interest.) The rectangular box indicates a subdomain of the arguments over which we have also plotted the accuracy of our Taylor series approximation to this exact Pr(W<TF ) function. The technical report[3] contains plots of the percentage inaccuracy that results from the truncation of the approximation after one or two terms.Contours of log10[P(W<TF)/P(W>TF)]0 7 1 5.5 4 2.5 1 0.5 2 3.55 6.58 17 9.5

.

187.94

2

3

14.5 13 11.5 8.5 10 5.5 7 4 2.5

4

5

log10(λTF)6 7 14.5 8.5 11.5 10 7 8

13

5.5

9

10 17.5 11 8 7 6 5 4 3 2 1 0 1 2 3 16 14.5 13 11.5 10 8.5

M

log10(λL)

Figure 2: Plots of Exact Value. Notice the log-log scale. We can illustrate in more detail the interpretation of our numerical results and plots brie?y by examining one particular case. Assume L = 10? and TF = 10? . That is, our system encounters faults with an MTBF of 100 times its lifetime. It is guaranteed to be schedulable provided that it does not, during its lifetime, experience two faults separated by less than one thousandth of the lifetime duration. In such circumstances, we would clearly expect the system to be schedulable with a high probability, P say. This is a log-odds contour plot, so the proximity to the -7 contour indicates that the odds in favour of a system being schedulable with these parameter values are approximately 10 to 1. In fact, the bounds on the probability of schedulability, in this situation, obtained by the ‘order-of-magnitude’ argument of Section 5.1, are 0:4999967 10? and 1:500477 10? . The approximations to these bounds, obtained in the two corollaries in Section 5.1, are 0:5 10? and 1:5 10? , exactly. The Taylor series approximation allows, in this case, almost arbitrarily accurate calculation of the true value of P with comparatively few terms of the series. In fact we proved that all even-order partial sums, up to the 1000th -order2 5 7 7 7 7 7

sum, are lower bounds on P , and all odd-order partial sums, up to the 1001th -order sum, are upper bounds. With these particular arguments, the modulus of the fourth order term in the series is less than 10? , so the sum to only three terms would give an accuracy guaranteed to be better than approximately 11 or 12 signi?cant decimal ?gures. To eight signi?cant ?gures, the value of P is :99948496 10? , corresponding to a log-odds very close to ?7 in Figure 2 (at coordinates (?2; ?5)). The series approximation gives ?rst and second order Taylor approximations of 10? and :99948495 10? , respectively. (These numbers are both exact.) See [3] for a detailed derivation of these results concerning high numerical accuracy.19 7 7 7

6 Conclusion We have developed the notion of a probabilistic scheduling guarantee and shown how it can be derived from the stochastic behaviour of fault events. It is reasonable to assume that a fault tolerant system will be designed so as to remain schedulable when dealing with a single fault. The main result of the paper is thus the derivation of a probabilistic guarantee for systems experiencing multiple faults. To do this it has been necessary to formulate a prediction of the likelihood of faults occurring closer together than some speci?ed distance in time. It has also been necessary to use sensitivity analysis to determine the limits to schedulability; that is, the minimum tolerable interval between faults. Although exact analysis is given for the likelihood of faults occurring quicker than the rate obtained from the sensitivity analysis, perhaps the main result of this paper is a simple derived upper bound for this probability (as given in Corollary 3). A typical outcome of this analysis is that in a system that has a life time of 10 hours with a mean time between transient faults of 1000 hours and a tolerance of faults that do not appear closer than 1/100 of an hour, the probability of missing a deadline is upper bounded by 1.5 10? . A lower bound is also derived (Corollary 4) and this yields a value of 0.5 10? . For these parameters the exact analysis produces a value very close to 1.0 10? .7 7 7

Interestingly (and perhaps not totally intuitively) the upper, lower and exact formulations for the probabilistic scheduling guarantee all indicate that the threshold value derived from the scheduling and sensitivity analysis has a linear relationship to the probabilistic guarantee. If the threshold value TF is halved, the probability of missing a deadline is halved. Similarly the length L of execution of the system has a linear impact. The main obstacle to the use of some of the analysis given in this paper is the lack of empirical data concerning fault arrival times. In the future we aim to address fault clustering and less favourable fault process models. We also aim to move away from the conservative assumption that the system is unschedulable (with probability 1) when faults arrive closer than the threshold value.