Embedded Networked Systems are increasingly called upon to control vast sections of the industrial infrastructure in the modern economy. Some systems require extraordinary safety and reliability to eliminate, as much as possible, failures that can result in dramatic financial losses or loss of life. Familiar examples of these safety critical applications are mass transportation, power generation and oil drilling/transport. Embedded systems are also used in applications where the results of failures are not catastrophic, but can still result in significant losses in process or manufacturing efficiency. When faults are detected and failures avoided significant material losses or manufacturing efficiency losses can be avoided. Additionally, a networked system is not really safe if it is not secure. Malicious users can hijack an embedded system or an embedded system can become the (perhaps unintentional) target of a virus or worm. These types of attacks can damage or render inoperable an entire system or complex. Clearly in many cases both advanced reliability and security capabilities will be requirements in networked embedded designs.

Perhaps looking at an example design can best illustrate some of the key aspects and implementation options when improved reliability and security are required. Process control systems are one of the most useful examples to consider, particularly since the discovery of network transmitted worms that attack not only traditional PC operating systems, but embedded control systems as well (like the so-called Stuxnet computer worm). A block diagram of an example embedded process control system is shown in Figure 1, below.

Figure 1 Example embedded networked process control system(Click Here to see a larger, more detailed version of this image)

An Industrial Ethernet Switch is used to connect the controller to the network via an upstream node and a downstream node. A system controller manages the overall operation of the Process Control System, including the Ethernet Switch and the power subsystem. A separate Equipment Controller, supervised by the System Controller, manages the equipment interface. The Equipment Controller implements any low level control loop processes required by the system. Higher-level process management resides within the System Controller under supervision via the network, perhaps by a centralized system that manages the entire manufacturing or chemical processing complex. This separation of control functions simplifies the implementation of the real time aspects of both the equipment control and network traffic management (For example, interrupt response time, memory bandwidth allocation and active task priority determination.) Let's look at ways to make this example system more reliable and secure.

System failure ratesAll systems will have the possibility of failing, since it is impossible to design a system with an absolute zero failure rate. Thus each application should be designed with a target acceptable failure rate level. The IEC 61508 standard specifies acceptable failure rates for a variety of Safety Integrity Levels (SILs) based on the consequences of a system failure. The specification originally applied solely at the system level but has also been applied to product and components by addressing Electrical, Electronic, and Programmable Electronics for both hardware and software. We will assume that our design falls within SIL Level 2 (perhaps because the controller manages a hazardous liquid as part of its function).

Table 1: IEC 61508 Safety Integrity Levels(Click Here to see a larger, more detailed version of this image)

Looking at the example design shown in Figure 1, we can imagine some possible failure modes and their effect on the overall system. An error in the Equipment Controller might allow hazardous liquid to build-up in the system until a rupture takes place, creating a life threatening system failure. Similarly an error in the system controller might miss warnings from the equipment controller that could also result in life threatening failures. An error in the Ethernet Switch (a constant message broadcast for example) could bring down the entire network and threaten the entire complex, not just a single node. Note that the System Controller also manages the power supply subsystem, (not an unusual feature of embedded controllers) so an error associated with the power supply could cause a dramatic system failure. This is also a potential weakness for a malicious attacker to exploit if they wanted to inflict permanent damage on the system.

We also need to look at possible failure modes when remote code updates or other sensitive messages are sent over the network. Without a sufficient level of data protection, transmission errors or malicious attacks could alter program code execution, incorrectly adjust trigger levels or capture sensitive operating parameters. Standard error detection functions (like a Cyclical Redundancy Check or CRC) can be used to protect messages from transmission errors. The Ethernet Switch will automatically check messages for errors using this technique. If required the System Controller can implement additional Error Detection and Correction functions. Cryptographic protocols and standard encryption algorithms can be used to improve the security of network traffic within the system by securing the data in transit and authenticating remote facilities.

Single event upsets (SEUs) as a source of errorsThe Single Event Upset phenomenon was first discovered in 1979 by Intel and Bell Labs as failures in DRAMs and is attributed to stray alpha particles or neutrons 'flipping' the memory cell. In 1999 Sun Microsystems noticed errors in cached SRAMs for mission critical servers. In space and aviation applications the effects of radiation on electronics is well understood as operational altitudes have a higher neutron flux. However, the SEU phenomenon is increasingly becoming a concern at sea level as well. The continuous drive to smaller semiconductor geometries reduces the charge at each SRAM cell and the ever increasing content of electronics in fielded systems increases the likelihood of SEU related SRAM errors. Note that Flash memories, which require a significantly higher energy level to 'flip' state, are immune to these types of SEU events.

Mitigation of errors via redundancy and design diversityIn safety critical systems redundancy is mandatory to operate properly in the event of a failure. There are two well-known techniques that are widely utilized -- Dual Modular Redundancy (DMR) and Triple Modular Redundancy (TMR). In the case of Dual Modular Redundancy, duplicate designs work in parallel. Each processing element receives the same input and a fail-safe certification engine checks for consistency. If a fault is identified then prevention must be taken to avoid a failure. Triple modular redundancy creates three duplicate designs and the results of each output are presented to a voting circuit such that the output state that receives the most votes is set. This can withstand the complete failure of one sub-system and allows a supervisor circuit to attempt to fix the fault, or alert an operator.

A design diversity methodology is sometimes employed to further improve reliability. Using this methodology parallel designs are not just duplicated but will perform the same function using a different implementation. For example, an FPGA might be used for one of the designs and the parallel design might use an MCU. This diversity in the target implementations increases reliability even more since errors related to complex design or implementation 'bugs' will not be duplicated in dramatically different targets.