3 Radiation Effects Transient faults (or soft errors)Occur when particles strike a device causing the deposit or removal of energy which inverts transistor stateUsually observed as a bit-flipIn order to study these effects in the lab, some form of fault injection can be used

4 Hardware Fault-InjectionUsing radiation beam or electromagnetic interferenceSimilar to what a device would experience in harsh environmentUsing probes to introduce voltage or current changesAdvantageClosely resembles real-world effects on deviceDisadvantagesPossible to damage device under testDevice under test must be modified to perform injection

5 Software Fault-InjectionCompile-time injectionCorrupts an application’s instructions during compilationRuntime injectionUses a trigger mechanism to inject faults during executionFaults can be targeted at any software-visible componentsAdvantageDevice under test does not need to be modifiedDisadvantagePossible to disturb processing workload in unintended ways

6 Simulation Fault-InjectionFault injection can be performed in simulation of systemAdvantagesInjections are transparent to target systemSimulation offers greatest amount of controllability and observabilityDisadvantagesBuilding simulation for target device is not a trivial taskFaults in physical system may not manifest in simulationPython

7 Fault Tolerance Usually involves some form of redundancyHardware Fault-ToleranceMemory and caches can be protected with ECC or parityTMR is one of the most common forms of HW FTExample of TMR (Triple Modular Redundancy) shown below

8 Fault Tolerance Hardware Fault-Tolerance (cont’d)Hardware devices can also be fabricated using processes that are less susceptible to radiation effectsProcess of radiation hardening devices can be prohibitively expensive and time consumingRadHard devices are generations behind their COTS counterparts in terms of performance and power consumptionSoftware Fault-ToleranceVery cost-effective approach compared to hardware FTDoes not require any modification to device architectureLeverages high-performance, low-power commercial off-the-shelf (COTS) components

11 OverviewDetailed Verilog model created for a microprocessor architecture, similar in complexity to the Alpha or AMD AthlonCreated a methodology for performing fault injection on a detailed latch-level simulation of a complex processorStudied the propagation and/or masking of faults from the micro-architectural level to the architectural level

13 Fault-Injection MethodologyA time at which to inject fault is first selectedRandomly selected from start pointsThen the bit to corrupt is randomly selectedInjected faults are a single bit-flip of a state elementThe trial is monitored for up to 10,000 cyclesAt each cycle, architectural state is verified against non-injected golden executionTrials are placed into four categories depending on the outcomeEach experiment consists of 25,000-30,000 trials

14 Trial Outcome CategoriesMicro-architectural state matchOccurs when every bit of state in the machine is equivalent to a non-fault-injected simulationTerminationPremature termination of the workload (execution error)Silent data corruptionTrials that result in software-visible register or memory corruption (data error)Gray areaTrial that does not result in failure (termination or silent data corruption) or micro-architectural state match

16 ResultsThis chart shows which types of state (relative to their contribution of overall state) contribute to silent data corruption and terminated resultsRegister file corruption is the leading cause of silent data corruption (data errors) and terminated (execution errors) outcomes

17 ResultsAlthough noise is present in the graph, a correlation between processor utilization and benign fault rate can be seenAs the number of valid instructions (those that will commit results) in the pipeline decreases the benign fault rate increasesBenign faults do not affect program correctness

18 ShortfallsSome instructions of the Alpha ISA were not implemented in the processor model10,000 cycle limit for monitoring is quite lowCertainly not enough time for most benchmarks to completeCertain components were ignored for fault injectionThese include caches and prediction structuresCorrupted registers were considered application failuresHowever, I have observed in my research that the majority of faults targeted at registers do not affect program execution or outputIn my research I use the Simics cycle-accurate system simulation environment to perform fault injections into the register file of the Freescale P2020 dual-core PowerPC-based processor

20 Simics Simulation Fault-Injection ResultsSimics simulation does not have the same level of detail needed to perform fault injection at the micro-architectural level, but does allow for register file fault-injectionThe chart below shows results obtained when injecting single-bit faults into each of the general purpose registers, during a matrix multiplication application

23 Process-Level RedundancySimilar to TMR hardware fault-tolerance schemeCreates a set of redundant processes for an application and compares each output to ensure correct executionLeverages multiple processing cores by allowing the operating system to schedule redundant processes to available coresBiggest challenge is maintaining determinismTransparency can be achieved by maintaining user-expected process semanticsDoes not require any modifications to target application, operating system, or device architectureImportant for legacy binaries whose source is no longer available

24 Sphere of ReplicationSpecifies the boundary for fault detection and containmentData entering the SoR is replicatedAll execution within the SoR is redundantAny data leaving the SoR is compared to check for faultsAny execution outside the SoR is not protectedA typical hardware-centric SoR is shown on the leftPLR’s software-centric SoR is shown on the right

26 Maintaining Process SemanticsExample semantics:Each application is assigned a process identifier (PID) which exists throughout execution and returned to the operating system after completionWhen an application exits, it returns the correct exit codeA signal that is sent to a valid PID will have the intended effects (e.g. SIGKILL will kill the process)Figurehead processOriginal process becomes figurehead process after redundant processes are createdDoes not perform any real work

27 Maintaining Process SemanticsFigurehead process (cont’d)Sleeps and waits for redundant processes to completeReceives application exit value and exits correctlyResponsible for forwarding incoming signals to all redundant processesMonitor processCertain signals are not easily forwardedA SIGKILL signal would kill the figurehead process, but leave behind all redundant processesMonitor process polls the state of figurehead processIf figurehead is killed or stopped, monitor process will kill or stop redundant processes

28 Maintaining Determinism & TransparencySystem call emulation unitResponsible for input replication, output comparison, and system call emulationResponsible for ensuring that redundant processes interacting with the system appear as if only the original process is executingSystem calls that return nondeterministic data (such as the system time) must be emulated to ensure all processes use the same dataMaster vs. slave processesSystem calls that modify any system state are only executed by the master processOther system calls are performed once for the master process and replicated for the slave processes

29 Fault DetectionThe system call emulation unit is responsible for providing fault detection and recoveryA fault causing the application to hang can be detected by a watchdog timer attached to the emulation unitThe timer begins when a processes enters the unitIf the rest of processes do not enter the unit within a specified amount of time, an execution error is signaledFaults causing control-flow errors can also be detected if all processes do not request the same system call when entering the emulation unit

30 Fault RecoveryIf an output mismatch occurs, a majority vote can be used to kill process producing incorrect dataBad process is then replaced by forking correct processA watchdog timeout can occur in two casesIf a faulty process calls the emulation unit while other processes are executing, it is killed and replaced by forking a correct process at the next system callIf a faulty process hangs while the other processes are waiting in the emulation unit, it is killed and replaced by a correct processIf a process fails, it is simply replaced by duplicating one of the remaining processes

31 Results PLR eliminates all failed, abort, and incorrect casesOutput comparison converts abort and incorrect cases to mismatchesPLR detects failed cases, converting them into sighandler casesA small number of failed cases are detected as mismatch with PLRThe mismatch is caught before the application can failSome floating-point benchmarks actually caused correct outcomes to become mismatches with PLR enabledThe specdiff tool included with the benchmarks uses a tolerance when checking output data, whereas PLR’s output comparison checks raw data

33 ShortfallsFunctionality of system call emulation unit is detailed, however not many implementation details are providedReplicating results would be hard to accomplish without more specific implementation detailsFaults occurring during PLR code or operating system execution are not protected againstOnly supports single-threaded applicationsMay not function as intended if using more redundant processes than physical cores availableTimeouts assume all processes are running concurrently

34 Conclusions Simulation Fault-Injection Process-Level RedundancyAllowed for injections to target areas not accessible to software or hardware fault-injection toolsShowed that many faults are masked before they are even visible to softwareProcess-Level RedundancySoftware fault-tolerance schemeSimilar to triple modular redundancy hardware schemeTransparent to system and target applicationDoes not require any user intervention to apply protectionAble to detect all application failures and incorrect output