Fault Tolerance for Real-Time Systems: Analysis and Optimization of Roll-back Recovery with Checkpointing

In the modern world of today, we are surrounded with many electronic devices which offer previously unseen performance. These devices constitute a large part of the everyday consumer electronics such as laptops, tablets, smart phones, etc., but are also used in wide variety of domains such as automotive industry, avionics, medicine, etc. The constant demand for high performance has resulted in a rapid development of semiconductor technologies. Technology scaling has pushed the boundaries enabling fabrication of miniature devices. With such miniature devices, it is possible to integrate an entire system on a single chip, commonly referred to as System-on-Chip. For example, in a recent smart phone,... (More)

Popular Abstract in English

In the modern world of today, we are surrounded with many electronic devices which offer previously unseen performance. These devices constitute a large part of the everyday consumer electronics such as laptops, tablets, smart phones, etc., but are also used in wide variety of domains such as automotive industry, avionics, medicine, etc. The constant demand for high performance has resulted in a rapid development of semiconductor technologies. Technology scaling has pushed the boundaries enabling fabrication of miniature devices. With such miniature devices, it is possible to integrate an entire system on a single chip, commonly referred to as System-on-Chip. For example, in a recent smart phone, the size of such chip is less than one square centimeter, and within this area, the number of transistors (the fundamental building block of modern electronic devices) is over one billion. This example shows that the gains of technology scaling are enormous. However, this comes at a cost, and that is that devices manufactured in the latest technologies may be affected by errors which cause malfunction. Lately, soft errors have been named as one of the most serious threats to computer systems designed in the latest semiconductor technologies.

Soft errors occur as a result of a particular type of faults, known as transient faults. Transient faults have a limited lifetime, namely these faults occur, remain present for a short time, but disappear afterwards. However, these faults, despite their short duration, often result in soft errors that may lead to a system failure. Therefore, soft errors have a significant impact on the reliability of the computer systems manufactured in the latest semiconductor technologies. While in the past soft errors were only a threat for devices which operate in harsh environments such as nuclear plants where high level of radiation exists, or avionics where at higher altitudes the cosmic radiation is higher, nowadays soft errors are a threat for all devices irrespective of the operational environment. The reason for this is that the major source of soft errors is the radiation of alpha-particles which are emitted from the package of the device itself. In the digital world where everything revolves around bits (a binary digit ``1'' or ``0''), a transient fault can be interpreted as the outside force that flips a bit from ``1'' to ``0'', or vice-versa, and the flipped bit is the representation of a soft error. If a soft error occurs in your smart phone, tablet or laptop you easily handle it by restarting the device. However, what happens if such an error occurs in a vital component of an airplane or a nuclear reactor? Can we simply restart?

The field of research which tries to give answers to the previous questions is called Fault Tolerance. As the name suggests, fault tolerance enables correct operation of a device even in the presence of faults (errors). As a research topic, fault tolerance has been established along with the rise of the very first devices used in safety-critical applications such as avionics. Lately, the popularity of fault tolerance has been increased, especially when the manufacturing process has moved down to deep sub-micron semiconductor technologies where the size of the transistors has shrunk substantially, and their operation has become more susceptible to soft errors.

To enable correct operation in the presence of errors, fault tolerance provides techniques that are capable of error-detection, i.e. detect the presence of errors, and error-recovery, i.e. recover the system from errors. Usually, this is achieved by introducing a hardware and time redundancy.

Hardware redundancy techniques cope with errors by designing devices or computer systems, such that multiple copies (replicas) of the physical building blocks are used. Every block performs a given operation and provides some kind of an output based on some inputs. If two identical copies process the same inputs, it is expected that they would both produce the same outputs. While these techniques ensure correct operation in the presence of errors, the main drawback is that these techniques are rather expensive. The cost of a device which contains multiple replicas is higher.

In contrast to the expensive hardware redundancy techniques, time redundancy techniques cope with errors by repeating the same operation utilizing the given hardware resources. The correct output is obtained by repeating the same operation at least twice. This increases the time required to obtain the final outcome, and therefore it results in much higher time overhead.

Roll-back Recovery with Checkpointing (RRC) is a well-known fault tolerance technique that efficiently copes with soft errors. Unlike traditional time redundancy techniques, where upon error detection the program is restarted from the beginning, RRC stores checkpoints (intermediate states of the execution of the program), and when errors are detected, it forces the program to roll-back to the latest stored checkpoint. The advantage of this technique over other fault tolerance techniques is that it does not require a substantial amount of hardware redundancy. However, the major drawback of RRC is that it introduces time overhead that depends on the number of checkpoints that are used. Thus, RRC introduces a time overhead that may have a negative impact on the computer system where it is used.

In general, computer systems are classified into real-time systems (RTSs) and non-RTSs, depending on the requirement to meet time constraints. For RTSs, the correct operation is defined as producing the correct output while satisfying a given time constraint (deadline). Depending on the consequences when deadlines are violated, RTSs are divided into soft and hard RTS. For soft RTSs, the consequences are not very severe. One example of a soft RTS can be a mobile phone where eventual deadline violation results in a dropped call. On the other hand, violating the deadlines in hard RTSs usually results in catastrophic consequences. An example of a hard RTS can be the braking control in a vehicle. RTSs are also affected by soft errors, and therefore there is a need to employ fault tolerance in RTSs as well. However, special consideration should be taken when employing fault tolerance in RTSs, due to the fact that fault tolerance usually introduces a time overhead.

The time overhead due to usage of fault tolerance in RTSs, may result in a missed deadline. To mitigate this effect, it is important to optimize the usage of fault tolerance in RTSs. The optimization objectives differ among soft and hard RTSs. For soft RTSs, where eventual deadline violation results in some performance degradation, it is more important to minimize the average execution time (the average time needed for the operation to complete), while for hard RTSs, where it is crucial to meet the deadlines, it is more important to maximize the probability that the deadlines are met.

During the early design stage of an RTS, a designer of an RTS receives a specification of the RTS that is to be implemented. During this stage, the designer needs to explore different fault tolerance techniques and choose the one that satisfies the given specification requirements. To assist the designer in the decision making process, in this thesis, we provide an optimization framework for RRC when used in RTSs. By using this framework, the designer of an RTS can first decide if RRC is a suitable fault tolerance technique for the RTS that is to be implemented, and then if RRC is applicable, the designer can acquire knowledge on the number of checkpoints that need to be used and how these checkpoints need to be distributed. The proposed optimization framework considers multiple optimization objectives that are important for RTSs. In particular, for soft RTSs the optimization framework considers optimization of RRC with respect to AET. For hard RTSs, the optimization framework considers optimization of RRC with the goal to maximize the Level of Confidence (LoC), i.e. the probability that the deadlines are met. Since a specification of an RTS that is to be implemented may include some reliability requirements, in this thesis, we have introduced the concept of Guaranteed Completion Time, i.e. a completion time that satisfies a given reliability (LoC) constraint. The Guaranteed Completion Time varies with the number of checkpoints used in RRC. Therefore, the optimization framework considers optimization of RRC with respect to Guaranteed Completion Time. (Less)

Abstract

Increasing soft error rates in recent semiconductor technologies enforce the usage of fault tolerance. While fault tolerance enables correct operation in the presence of soft errors, it usually introduces a time overhead. The time overhead is particularly important for a group of computer systems referred to as real-time systems (RTSs) where correct operation is defined as producing the correct result of a computation while satisfying given time constraints (deadlines). Depending on the consequences when the deadlines are violated, RTSs are classified into soft and hard RTSs. While violating deadlines in soft RTSs usually results in some performance degradation, violating deadlines in hard RTSs results in catastrophic consequences. To... (More)

Increasing soft error rates in recent semiconductor technologies enforce the usage of fault tolerance. While fault tolerance enables correct operation in the presence of soft errors, it usually introduces a time overhead. The time overhead is particularly important for a group of computer systems referred to as real-time systems (RTSs) where correct operation is defined as producing the correct result of a computation while satisfying given time constraints (deadlines). Depending on the consequences when the deadlines are violated, RTSs are classified into soft and hard RTSs. While violating deadlines in soft RTSs usually results in some performance degradation, violating deadlines in hard RTSs results in catastrophic consequences. To determine if deadlines are met, RTSs are analyzed with respect to average execution time (AET) and worst case execution time (WCET), where AET is used for soft RTSs, and WCET is used for hard RTSs. When fault tolerance is employed in both soft and hard RTSs, the time overhead caused due to usage of fault tolerance may be the reason that deadlines in RTSs are violated. Therefore, there is a need to optimize the usage of fault tolerance in RTSs.

To enable correct operation of RTSs in the presence of soft errors, in this thesis we consider a fault tolerance technique, Roll-back Recovery with Checkpointing (RRC), that efficiently copes with soft errors. The major drawback of RRC is that it introduces a time overhead which depends on the number of checkpoints that are used in RRC. Depending on how the checkpoints are distributed throughout the execution of the job, we consider the two checkpointing schemes: equidistant checkpointing, where the checkpoints are evenly distributed, and non-equidistant checkpointing, where the checkpoints are not evenly distributed. The goal of this thesis is to provide an optimization framework for RRC when used in RTSs while considering different optimization objectives which are important for RTSs.

The purpose of such an optimization framework is to assist the designer of an RTS during the early design stage, when the designer needs to explore different fault tolerance techniques, and choose a particular fault tolerance technique that meets the specification requirements for the RTS that is to be implemented. By using the optimization framework presented in this thesis, the designer of an RTS can acquire knowledge if RRC is a suitable fault tolerance technique for the RTS which needs to be implemented. The proposed optimization framework includes the following optimization objectives.

For soft RTSs, we consider optimization of RRC with respect to AET. For the case of equidistant checkpointing, the optimization framework provides the optimal number of checkpoints resulting in the minimal AET. For non-equidistant checkpointing, the optimization framework provides two adaptive techniques that estimate the probability of errors and adjust the checkpointing scheme (the number of checkpoints over time) with the goal to minimize the AET.

While for soft RTSs analyses based on AET are sufficient, for hard RTSs it is more important to maximize the probability that deadlines are met. To evaluate to what extent a deadline is met, in this thesis we have used the statistical concept Level of Confidence (LoC). The LoC with respect to a given deadline defines the probability that a job (or a set of jobs) completes before the given deadline. As a metric, LoC is equally applicable for soft and hard RTSs. However, as an optimization objective LoC is used in hard RTSs. Therefore, for hard RTSs, we consider optimization of RRC with respect to LoC. For equidistant checkpointing, the optimization framework provides (1) for a single job, the optimal number of checkpoints resulting in the maximal LoC with respect to a given deadline, and (2) for a set of jobs running in a sequence and a global deadline, the optimization framework provides the number of checkpoints that should be assigned to each job such that the LoC with respect to the global deadline is maximized. For non-equidistant checkpointing, the optimization framework provides how a given number of checkpoints should be distributed such that the LoC with respect to a given deadline is maximized.

Since the specification of an RTS may have a reliability requirement such that all deadlines need to be met with some probability, in this thesis we have introduced the concept Guaranteed Completion Time which refers to a completion time such that the probability that a job completes within this time is at least equal to a given reliability requirement. The optimization framework includes Guaranteed Completion Time as an optimization objective, and with respect to the Guaranteed Completion Time, the framework provides the optimal number of checkpoints, while assuming equidistant checkpointing, that results in the minimal Guaranteed Completion Time. (Less)

@phdthesis{4a5448ba-1d49-411f-b6eb-68689cf7d9ce,
abstract = {Increasing soft error rates in recent semiconductor technologies enforce the usage of fault tolerance. While fault tolerance enables correct operation in the presence of soft errors, it usually introduces a time overhead. The time overhead is particularly important for a group of computer systems referred to as real-time systems (RTSs) where correct operation is defined as producing the correct result of a computation while satisfying given time constraints (deadlines). Depending on the consequences when the deadlines are violated, RTSs are classified into soft and hard RTSs. While violating deadlines in soft RTSs usually results in some performance degradation, violating deadlines in hard RTSs results in catastrophic consequences. To determine if deadlines are met, RTSs are analyzed with respect to average execution time (AET) and worst case execution time (WCET), where AET is used for soft RTSs, and WCET is used for hard RTSs. When fault tolerance is employed in both soft and hard RTSs, the time overhead caused due to usage of fault tolerance may be the reason that deadlines in RTSs are violated. Therefore, there is a need to optimize the usage of fault tolerance in RTSs.<br/><br>
<br/><br>
To enable correct operation of RTSs in the presence of soft errors, in this thesis we consider a fault tolerance technique, Roll-back Recovery with Checkpointing (RRC), that efficiently copes with soft errors. The major drawback of RRC is that it introduces a time overhead which depends on the number of checkpoints that are used in RRC. Depending on how the checkpoints are distributed throughout the execution of the job, we consider the two checkpointing schemes: equidistant checkpointing, where the checkpoints are evenly distributed, and non-equidistant checkpointing, where the checkpoints are not evenly distributed. The goal of this thesis is to provide an optimization framework for RRC when used in RTSs while considering different optimization objectives which are important for RTSs. <br/><br>
<br/><br>
The purpose of such an optimization framework is to assist the designer of an RTS during the early design stage, when the designer needs to explore different fault tolerance techniques, and choose a particular fault tolerance technique that meets the specification requirements for the RTS that is to be implemented. By using the optimization framework presented in this thesis, the designer of an RTS can acquire knowledge if RRC is a suitable fault tolerance technique for the RTS which needs to be implemented. The proposed optimization framework includes the following optimization objectives.<br/><br>
<br/><br>
For soft RTSs, we consider optimization of RRC with respect to AET. For the case of equidistant checkpointing, the optimization framework provides the optimal number of checkpoints resulting in the minimal AET. For non-equidistant checkpointing, the optimization framework provides two adaptive techniques that estimate the probability of errors and adjust the checkpointing scheme (the number of checkpoints over time) with the goal to minimize the AET.<br/><br>
<br/><br>
While for soft RTSs analyses based on AET are sufficient, for hard RTSs it is more important to maximize the probability that deadlines are met. To evaluate to what extent a deadline is met, in this thesis we have used the statistical concept Level of Confidence (LoC). The LoC with respect to a given deadline defines the probability that a job (or a set of jobs) completes before the given deadline. As a metric, LoC is equally applicable for soft and hard RTSs. However, as an optimization objective LoC is used in hard RTSs. Therefore, for hard RTSs, we consider optimization of RRC with respect to LoC. For equidistant checkpointing, the optimization framework provides (1) for a single job, the optimal number of checkpoints resulting in the maximal LoC with respect to a given deadline, and (2) for a set of jobs running in a sequence and a global deadline, the optimization framework provides the number of checkpoints that should be assigned to each job such that the LoC with respect to the global deadline is maximized. For non-equidistant checkpointing, the optimization framework provides how a given number of checkpoints should be distributed such that the LoC with respect to a given deadline is maximized.<br/><br>
<br/><br>
Since the specification of an RTS may have a reliability requirement such that all deadlines need to be met with some probability, in this thesis we have introduced the concept Guaranteed Completion Time which refers to a completion time such that the probability that a job completes within this time is at least equal to a given reliability requirement. The optimization framework includes Guaranteed Completion Time as an optimization objective, and with respect to the Guaranteed Completion Time, the framework provides the optimal number of checkpoints, while assuming equidistant checkpointing, that results in the minimal Guaranteed Completion Time.},
author = {Nikolov, Dimitar},
isbn = {978-91-7623-147-0},
issn = {1654-790X},
keyword = {soft errors,fault tolerance,real-time systems,checkpointing,reliability},
language = {eng},
pages = {232},
publisher = {Lund University},
school = {Lund University},
title = {Fault Tolerance for Real-Time Systems: Analysis and Optimization of Roll-back Recovery with Checkpointing},
year = {2014},
}