Functional safety implementations in modern MCUs

Implementation of safety measures is on the rise in today’s automotive world in order to minimize the hazards in case of system malfunction. Today’s automobiles run various safety critical applications like ABS, electronic power steering, air bag sensors, Radar sensing, and other chassis related applications. All these safety critical automotive operations need compliance with ISO 26262 (ASILx) and IEC 61508 (SILx) standards as their safe operation is directly linked to human and social safety.

This article discusses the key functional safety features present in modern semiconductor devices, allowing customers to run safety relevant tasks in their applications. Later we will give some examples using Freescale Semiconductor devices such as the MPC5675K, MPC5643L, and MPC574xx.Functional safety requirementsFunctional safety is related to minimizing the hazards resulting from a faulty system. The faults in a system may occur because of hardware/software errors, permanent/transient errors, or because of random/systematic errors. The following are the possible reactions when an error occurs:

Fail-dangerous: Possibly causes a hazard in the case of a failure

Fail-inconsistent: Provided results will be noticeably inconsistent in the case of a failure

Fail-stop: Completely stops itself in the case of a failure

Fail-safe: Returns to or stays in a safe state in the case of a failure

Fail-operational: Continues to work correctly in the case of a failure

Fail-silent: Will not disturb anyone in the case of a failure

Fail-indicate: Indicates to its environment that it has failed

The implementation of functional safety in a system typically means "mapping" the first three types of reactions above into any of the last four reactions which ensure minimal hazards results from the system failure.

The next section discusses various functional safety implementations available in system-on-chips (SoCs) that allow device operation in any of the last four reactions listed above in case of system failure.

General safety implementationBefore discussing the key modules related to functional safety, let us first briefly discuss general industry standard implementations: 1) Checker core: Ensuring safe operation of the core in an SoC is one of the prime requirements for functional safety. Generally, this is taken care of by implementing a checker core which executes the same instructions as the main core and the address and data bus from the cores are compared in a checker unit to detect operational deviations. Depending on the nature of errors, there may be a reset or maskable/non-maskable interrupts generated by the system. From a software view point, the system behaves as a single-core. (See figure 1 below for a block diagram.)

Apart from the core, other safety relevant modules like eDMA (enhanced direct memory access), interrupt controller, cache, RAM, etc. can be similarly replicated in system maintaining the physical separation on the die so that common cause faults (CCFs) do not affect the operation of both the modules similarly.

2) Safe clock mechanism: In order to keep the system independent of external clocks during safe operation, there is an implementation of safe clock in Freescale automotive SoCs. This safe clock is provided by an internal RC oscillator which is available as soon as the device comes out of reset. The availability of this clock ensures that the system has a clock to operate even if the internal PLL fails for some reason. For the same reason, all the safety critical modules should run only on the safe clock. This IRC oscillator is trimmable for maintaining the clock consistency across PVT (process, voltage, and temperature).

3) ECC implementation in memory: All memory storage operations can be protected by implementing ECC (error correction code) with SECDED (single error correct and double error detect) with a Hamming distance of 4. The ECC is implemented on data, address, and control signals and is stored along with the data in the memory during writes. When the read operation is initiated, the ECC is re-calculated on the address, data, and control signals and is verified with stored ECC.Key safety implementationLet us now discuss in-depth the key safety features available on some Freescale devices meant for automotive safety applications.

End-to-end ECC (E2EECC) protection In Freescale MPC574x devices, for instance, instead of the general ECC implementation, there is an E2EECC implementation which allows detection of data corruption on all data paths between the "masters" and any "client" with at least 99% coverage. The mechanism is as follows:

Data from the masters is encoded using ECC-SECDED code. This data encoding includes coverage of addressing information.

At the client side, the control signals and address decoding are monitored to verify the correctness of data initiated from master.

The above approach ensures that there is no data corruption occurring on the data paths. There is a central Memory Error Management Unit present in the system which collects and reports error events associated with ECC logic used on SRAM, peripheral system RAM, and flash memory. When any correctable (single-bit ECC) or uncorrectable (multiple-bit ECC) errors occur, the MEMU receives an error signal which causes an event to be recorded and corresponding error flags to be set and reported to FCCU (fault collection and control unit).

Of course, SIL4 and Fault Tolerance are not the same concept, but SIL4 requires in almost all cases a fault tolerant architecture (as you say as well). The issue is that the standards often allow to minimize the safety risk because they define SIL as function of probability of occurence, severity and controllability. Very subjective. And because relying on a Hazrd and Risk Aalysis, most likely not covering all the risks (as there is not always a historical track record, especially in automotive). Not to speak of the unpredictable nature of the environment and the operator.

The other issue is that SIL thinking is rooted in the times when things were mostly linear (analog, continuous domain), where probabilities and graceful degradation (still) apply. Digital electronics and software are however in the digital (discrete, non-linear) domain. One small bit flip and 20 nanoseconds later, the system can have failed. The statespace is so large that fault tree analysis techniques can never go to this level of detail. The point is also that software is like a virtual machine sitting on top of a discrete state machine sitting on top of a semiconductor device (that is again in the continuous domain). The hidden assumption for software is not so much that it is error-free (more or less true when using formal methods), but that the hardware is always fault-free. Hence we have a hierarchy of levels. At the chip level, reliability margins apply, at the discrete level micro-redundancy applies, at the software level, block level redundancy and at the system level macro-level redundancy applies. There is an additional level that takes into account residual common mode failures and that requires diversity as well. We have developed a criterion, called ARRL (Assured Reliability and Resilience Level) that takes this analysis onto account. Draft white paper on request (I need an email to send it to).

The benefit of this approach is that it becomes possible to characterise components (or subsystem entities) in terms of how they deal with failures and one can reuse them from one domain to another also in the contact of safety critical systems (in essence the components carry a contract with them). One can also define rules on how to reach higher ARRL (and hence SIL) levels by composition. Note also that SIL and ARRL are complementary. They meet in the middle (just like a HARA and FMEA do).

The point I wanted to make is that there is no reason why MCU can't be made "fault tolerant" by default. Gates are almost for free these days. And while lockstepping CPUs can help, they are not a miracle solution for safety. They basically only alow to detect that there is a fault, but not to correct the fault. Safety comes from masking out such internal faults so that the system continous to deliver its service. Using 2 such chips (2 oo 4) is a higher level remedy (but watch out for common mode failures, e.g. power issues). The other in my view more interestig approach is already in use in space (and as far as I know in high-end server chips like IBM's Power7). Make the logic cells fault tolerant (triplicate the gates). A very nice and recent example is Microsemi's SmartFusion-22. And it is not expensive. Certainly less expensive than developing a fault detection and correction architecture around traditional chips.

Fault tolerant and SIL4 are not equivalent terms. Fault tolerant refers to the ability of a system or function to operate correctly even though one or more of its component parts are malfunctioning. SIL4 refers to required or achieved probability or rate of failure of a safety system or function. Fault tolerant systems vary by how many simultaneous faults they can detect correct and by how many of those faults they can correct. It is only implicit that higher SIL levels generally require greater degress of fault tolerance.

I am confused by this statement. Functional Safety refers to the part of the overall safety of a system or function that depends on a system or function operating correctly in response to its inputs. Thus, Functional Safety depends on hazard and on what the correct function is. These microcontrollers >are< fault tolerant to the degree to which they are capable. ECC-SECDED means that the microcontroller can tolerate up to 2 simultaneous bit flip faults in any word at any time, 1 bit flip results in no effect, 2 bit flips results in a trigger than can be used to safe the microcontroller. That is fault tolerant, but whether that is fault tolerant enough depends on the particular Functional Safety requirements that are placed upon the microcontroller. Dual lock step cores are fault tolerant, 1 one fault is detectable. That is enough for some Functional Safety cases, but not in others. Triplication can provide 1 fault correction, but end to end triplication is exceptionally complex, and in distributed functions exposes the system to Byzantine faults. The trend to accomplish guarantees of 1 fault correction is not triplication, but Quadruple Modular Redundancy (2oo4); that is, the pairing of lock-step microcontrollers, or implementation of 2 pairs of lock-step cores in a microcontroller (see FSL's QUASAR project)

Freescale is committed to helping system manufacturers more easily achieve system compliance with functional safety standards (ISO 26262 and IEC 61508). Through our new SafeAssure functional safety program, engineers can easily identify Freescale hardware and software solutions that are optimally designed to support functional safety implementations. There’s more info about these as well as our safety processes and support at Freescale.com/safeassure
-Aaron McDonald, Freescale

Although very relevant article, In terms of functional safety this is not fault tolerant. It allow to fail "safely" (like when driving 200 km/hr).
Extra cost for triplication and voting is very minimal (giving today's silicon dimensions) and could seriously reduce the development cost of fault tolerance support. The default approach should be fault tolerant (SIL4) so that when there is a failure, the system drops in SIL3. Still fully functional but only a second failure leads to a fail-safe stop.
eric.verhulst (at=@) altreonic.com