New High-Reliability MCUs & Some of the 'Why' Behind Them

The TMS570 design features dual lock-step CPUs, which means that a hard or soft fault in one of the CPUs is almost assured to cause a miss-compare between the two CPUs.

When one creates a high-reliability microcontroller (MCU) design, there are many things that can impact reliability, including electro-migration and other thermal-induced and stress-induced hardware faults, soft errors or SEUs (single event upsets), timing errors in the logic due to poor system clock specification and construction, HIRF (high-intensity radiated fields), and radiated susceptibility and conducted susceptibility, power integrity issues, and software issues... to name but a few.

One of the larger areas for failure in the integrated circuits (ICs) used in embedded environments, such as those found in automotive applications, is reliability due to electro-migration and other thermally induced effects. Subsystems like on-board flash memory can fail when exposed to high temperatures, which can cause the charge to "leak" out of their floating gates. Sophisticated MCUs with features like hardware floating-point and DSP functions can be adversely affected by high temperatures when running at speeds over 100MHz. This is due to the large number of logic gates coupled with the high clock speed drawing more power causing internal heating of the die. There are several techniques IC manufacturers can use in order to combat these effects -- high-temperature IC processes are each chip maker's "secret sauce" and are unique to each foundry.

Sad to relate, one can invest $5 billion in a one-of-a kind wafer-fab, but still not really know how well a chip will do without building and selling millions of these devices and seeing how temperature and aging affects them. This leaves the architecture of the MCU as another method of reducing this risk.

Another area of concern is soft-errors. There are whole cities located in areas where the dust naturally creates soft errors. Even exposing a board to air, during service, can coat the board with radioactive particles. This can also happen in hospitals, where radiology departments -- along with a host of other diagnostic tests and therapies -- can potentially put some radioactive content in the air. Aircraft can suffer from the effects of cosmic rays, as can automotive systems at high northern latitudes and high altitudes. Even items used in underground mining can be exposed to radioactive particles.

Electrical noise upsetting logic can also be an issue. No matter how good a de-coupling network one uses, capacitors all effectively have a series inductance and series resistance built in, thereby making them non-ideal (see also Become a Decoupling Capacitor Network Guru). Also, HRIF, susceptibility, and power-integrity issues can all be potential issues. No enclosure, power supply, or circuit board is completely immune from these effects. Timing errors due to poor timing source choice issues can also present issues. Some implementations turn out much better than others, but at some level all can potentially have failure modes.

Software is yet another area where problems can arise. Some systems even employ dissimilar methods of implementing the software and hardware and comparing results to detect issues in development, thereby reducing the risk of design-related errors impacting operation.

With all the above issues having the potential to cause problems in a design, and something invariably does go wrong in some form or another, MCU vendors like Texas Instruments (TI) are looking to other methods to further reduce risk for safety-critical systems. To that end, MCUs like the TMS570 series are coming into use.

The TMS570 design features dual lock-step CPUs, which means that a hard or soft fault in one of the CPUs is almost assured to cause a miss-compare between the two CPUs. In turn, this will either halt or reboot the system. Additionally, these MCUs feature ECC/EDAC (Error Correcting Code/Error Detection and Correction) on all RAM and Flash memory. This allows multiple soft or hard errors in different words in the Flash and RAM to be corrected on-the-fly.

If you are interested in learning more, the TMS570 Microcontroller USB Kit can be used to quickly evaluate code development and performance of the TMS570 MCU.

Max -- Believe the Space Shuttle Computers actually used 3 voting, and one hot spare, plus a tertiary back up.

This is pretty interesting for a sub $10 part, as it gives one a safety certified CPU / OS / and Tools at quite a reasonable price (Heaven Knows Cars are Expensive These days) Speed is up to 180MHZ for an ARM R4 Core with Floating Point, so it should offer enough Zip to do many of the calculations to do things like boost fuel economy, cut emissions, etc)

For many applications one just wants to detect a fault and restart / halt -- as one may not know if a mechanical fault(most common at the system level), power supply fault(most common electrical), or some other fault has happened.

(Obvious you have not done much work on your own car, or gotten into a helicopter you have had to help work on, and head up a mountain)

Believe in one (Hard) the operations in the two CPU's occur at the same time, in Soft there is a Time Delay (to prevent a common error, such as power rail noise, or ionizing radiation, or other error(soft or hard) from producing incorrect results. (Lockstep refered originally to prisoners marching at close interval)(In the Royal Marine's this was known as Half-Interval March)

Another approach to reliability is to implement the application with two different types of designs. You can have different programmers implement the design differently and this reduces the possibility of a software bug failing in the same way when a single deisgn is just copied to two CPUs. Another approach is to use a different technology (perhaps an FPGA) to implement the second design. This reduces the chance of a bug showing up in both implementations at the same time even more.

@wmwmurray01: Obvious you have not done much work on your own car, or gotten into a helicopter you have had to help work on, and head up a mountain

Guilty as charged -- cars are one of those things that I understand theoretically -- but don;t have a clue what I'm duing when I'm lying underneath one with oil dripping on my head from the big watchmacallit next to the doohickey