Monday, September 29, 2014

Testing alone is insufficient to ensure safety in critical systems. Other technical approaches and software development process management approaches must also be used to assure sufficient software integrity.

Consequences:
Relying upon just system functional testing to achieve safety can be expected to eventually lead to an unsafe situation in a widely released product. Even if system functional testing is completely representative of situations that will happen in practice, such testing normally won’t be long enough to see all of the infrequent events that will occur with a much larger fleet of vehicles deployed for a much longer period of time.

Accepted Practices:

Specifically identify and follow a process to design in safety rather than attempting to test it in after the product has already been built. The MISRA Guidelines describe an example of an automotive-specific process.

Include defined activities beyond hiring smart designers and performing extensive functional testing. While details might vary depending upon the project, as an example, an acceptable set of practices for critical software by the late 1990s would have included the following (assuming that MISRA Safety Integrity Level 3 were an appropriate categorization of the functions): precisely written functional specifications, use of a restricted language subset (e.g., MISRA C), a way of ensuring compilers produced correct code, configuration management, change management, automated build processes, automated configuration audits, unit testing to a defined level of coverage, stress testing, static analysis, a written safety case, deadlock analysis, justification/demonstration of test coverage, safety training of personnel, and availability of written documentation for assessment of safety (auditability of the process). (The required level of care today is, if anything, even more rigorous for such systems.)

Discussion:
There is a saying about quality: “You can’t test in quality; you have to design it in from the start.” It is well known that the same is true of safety.

Assuring safety requires more than just using capable designers and performing extensive testing (although those two factors are important). Even the best designers – like all humans – are imperfect, and even the most extensive system-level functional testing cannot hope to find everything that can go wrong in a large deployed fleet such as an automobile. It should be apparent than everyone can make a mistake, even careful designers. But beyond that, system level functional testing (e.g., driving a car around in a variety of circumstances) cannot be expected to find all the defects in software, because there are just too many situations that can occur to experience them all in testing. This is especially true if a combination of events that causes a software failure just happens to be one that the testers didn’t think of putting into the test plan. (Test plans have bugs and gaps too.) Therefore, it has long been recognized that creating safe software requires more than just trying hard to get the design right and trying really hard to test well.

Accepted practices require a holistic approach to safety, including executing a well-defined process, having a written plan to achieve safety, using techniques to ensuring safety such as fault tree analysis, and auditing the process to ensure all required steps are being performed.

An accepted way of ensuring that safety has been considered appropriately is to have a written document that argues why a system is safe (sometimes called a safety case or safety argument). The safety case should give quantitative arguments as to why safety is inherent in the system. An argument that says “we tested for X hours” would be insufficient – unless it also said “and that covered 99.999% of all anticipated operating scenarios as well as thoroughly exercising every line of code” or some other type of argument that testing was thorough. After all, running a car in circles around a track is not the same level of testing as a cross-country drive over mountains. Or one that goes to Alaska in the winter and Death Valley in the summer. Or one that does so with 1000 cars to catch situations in which things inside one of those many cars just happen to line up in just the wrong way to cause a system failure. But even with the significant level of testing done by automotive companies, the safety case must also include things such as the level of peer reviews conducted, whether fault tree analysis revealed single points of failure, and so on. In other words, it’s inadequate to say “we tried really hard” or “we are really smart” or “we spent a whole lot of time testing.” It is essential to also justify that broad coverage was achieved using a variety of relevant techniques.

Selected Sources:
Beatty, in a paper aimed at educating embedded system practitioners, explains that code inspections and testing aren’t sufficient to detect many common types of errors in complex embedded systems (Beatty 2003, pg. 36). He identifies five areas that require special attention: stack overflows, race conditions, deadlocks, timing problems, and reentrancy conditions. He states that “All of these issues are prevalent in systems that employ multitasking real-time designs.”

Lists of techniques that could be applied to ensure safety beyond just testing have been well known for many years, with a relatively comprehensive example being IEC 61508 Part 7.

Even if you could test everything (which you can’t), dealing with low-probability faults that can be expected to affect a huge deployed fleet of automobiles just takes too long. “It is impossible to gain confidence about a system reliability of 100,000 years by testing,” (written in reference specifically to drive-by-wire automobiles and their requirement for a mean-time-to-failure of 1 billion hours) (Kopetz 2004, p. 32, emphasis per original)

Butler and Finelli wrote the classical academic reference on this point, stating that attaining software needed for safety critical applications will “inevitably lead to a need for testing beyond what is practical” because the testing time must be longer than the acceptable catastrophic software failure rate. (Butler 1993, p. 3, paper entitled “The infeasibility of quantifying the reliability of life-critical real-time software.”))

Knutson gives an overview of software safety practices, and makes it clear that testing isn’t enough to create a safe system: “Even if we are wary of these dangerous assumptions, we still have to recognize the limitations inherent in testing as a means of bringing quality to a system. First of all, testing cannot prove correctness. In other words, testing can show the existence of a defect, but not the absence of faults. The only way to prove correctness via testing would be to hit all possible states, which as we’ve stated previously, is fundamentally intractable.” (Knutson 2000, pg. 34). Knutson suggests peer reviews as a technique beyond testing that will help.

Kendall presents a case study for an electronic throttle control (with mechanical fail-safes) using a two-CPU approach (a “sub Processor” and a “Main Processor”). The automotive supplier elected to follow the IEC 1508 draft standard (a draft of the IEC 61508 standard), also borrowing elements from the MISRA software guidelines. Steps that were performed include: preliminary hazard analysis with mapping to MISRA SILs, review of standards and procedures to ensure they were up to date with accepted practices; on-site audits of development processes; FMEA by an independent agency; FTA by an independent agency; Markov modeling (a technique for analyzing failure probabilities); independent documentation review; mathematical proofs of correctness; and safety validation testing. (Kendall 1996) Important points from this paper relevant to this case include: “it is well accepted that software cannot be shown to be suitable for [its] intended use by testing alone” (id. pg. 6); “Software robustness must be demonstrated by ensuring the process used to develop it is appropriate, and that this process is rigorously followed.” (id., pg. 6); “safety validation must consider the effect of the vehicle under as many failure conditions as is possible to generate.” (id., p. 7).

Roger Rivett from Rover Group wrote a paper in 1997 based on a collaborative government-sponsored research effort that specifically addresses how automotive manufacturers should proceed to ensure the safety of vehicles. He makes an important point that rigorous use of good software practice is required in addition to testing (Rivett 1997, pg. 3). He has four specific conclusions for achieving a level of “good practice” for safety: use a quality management system, use a safety integrity level approach; be compliant with a sector standard (e.g., MISRA Software Guidelines), and use a third party assessment to ensure that high-integrity levels have been achieved. (Rivett 1997, pg. 10).

MISRA Development Guidelines, section 3.6.1, provides a set of points that make it clear that testing is necessary, but not sufficient, to establish safety (MISRA Guidelines, pg. 49):

MISRA Testing Guidance (MISRA Software Guidelines, p. 49)

This last point of the MISRA Guidelines is key – testing can discover if something is unsafe, but testing alone cannot prove that a system is safe.

"Testing on its own is not adequate for assessing safety-related software." (MISRA report 2 pg. iv) In particular, system-level testing (such as at the vehicle level), cannot hope to uncover all the possible faults or exceptional situations can will result in mishaps.

Thursday, September 11, 2014

Some systems base their safety arguments on the presence of “fail-safe” behaviors. In other words, if a failure occurs, the argument is that the system will respond in a safe way, such as by shutting down in a safe manner. If you have fail-safe mechanisms, you need to test them with a full range of faults within the intended fault model to make sure they work properly.

Consequences:

Failing to specifically test for mitigation of single points of failure means that there is no way to be sure that the mitigation really works, putting safety of the system into doubt.

As an example, if a hardware watchdog timer is not turned on, it won’t reset the system, but there might be no way to tell whether the watchdog timer is on or not (or set to the wrong value, or otherwise used improperly) without specifically testing whether the watchdog works or not. Thus, you can’t take credit for having a watchdog timer unless you have actually tested that it works for each fault that matters (or, if there are many such faults, argue that you have attained sufficient coverage with the tests that are run).

Accepted Practices:

Each and every fail-safe mechanism and fault management mechanism must be tested, preferably on a fully integrated system. Such tests may be difficult to perform in normal functional testing and may require intentional fault injection from the outside of the system (e.g., breaking a sensor) or fault injection at test points inside the system (e.g., intentionally killing a task using special test support infrastructure).

Discussion:

Fault injection is the process of intentionally inducing a hardware or software fault and determining its effect upon the system.

Fault management mechanisms, and especially fail-safe mechanisms, are often the key points upon which an argument as to the safety of a system rests. As an example, a safety case based on a watchdog timer detecting task failures requires that the watchdog timer actually work. While it is of course important to make sure that the system has been designed properly, there is no substitute for testing whether the watchdog timer is actually turned on during system test. (To revisit a point on system testing made elsewhere in my postings – system testing is not sufficient to ensure safety, but thorough system testing is certainly an important thing to do.) It is similarly important to specifically test every fault mode that must be handled by the system to ensure fault handling is done correctly.

Some examples of fault tests that should be performed include: killing each task independently to ensure that the death of any task is caught by the watchdog (and, by extension, cannot cause an unsafe system state); overloading the system to ensure that it behaves safely in an unanticipated CPU overload situation; checking that diagnostic fail-safes detect the faults they are supposed to and react by putting the system into a safe state; disabling sensors; disabling actuators; and others.

Another perspective on this topic is that ensuring safety usually involves arguing that all single points of failure have been mitigated to make the system safe. To demonstrate that the reasoning is accurate, a system must have corresponding failures injected to make sure that the mitigation approaches actually work, since the system’s safety case rests upon that assumption. This might include intentionally corrupting bits in memory, corrupting computations that take place, corrupting stack contents, and so on.

It is important to note that ordinary system functional testing tends to do a poor job at exercising fault mitigation mechanisms. As an example, if a particular task is never supposed to die, and testing has been thorough, then that task won’t die during normal functional testing (if it did, the system would be defective!). The point of detecting task death is to handle situations you missed in testing. But that means the mechanism to detect task death and perform a restart hasn’t been tested by normal system-level functional tests. Therefore, testing fail-safe mechanisms requires special techniques that intentionally introduce faults into the system to activate those fail-safes.

Selected Sources:

Safety critical systems are deemed safe only if they can withstand the occurrence of any single point fault. But, there is no way to know if they will really do that unless testing includes actually injecting representative single point faults to see if the system will respond in a safe manner. You can’t know if a system is safe if you don’t actually test its safety capabilities, and doing so requires fault injection. For example, if you expect a watchdog to detect failed tasks, you need to kill each and every task in turn to see if the watchdog really works. Arlat correctly states that “physical fault injection will always be needed to test the actual implementation of a fault tolerant system” (Arlat 1990, pg. 180)

The need to actually test fail-safe mechanisms to see if they really work should be readily apparent to any engineer. Pullum discusses this topic by suggesting the use of fault injection (intentionally causing faults as a testing technique) in the context of “verification of integration of fault and error processing mechanisms” for creating dependable systems (Pullum 2001, pg. 93).

“Fault injection is important to evaluating the dependability of computer systems. … It is particularly hard to recreate a failure scenario for a large complex system.” (Hsueh et al., 1997 pg. 75, speaking about the need for fault injection as part of testing a system). Mariani refers to the IEC 61508 safety standard and concludes that “fault-injection will be mandatory for soft error sensitivity verification” for safety critical systems (Mariani03, pg. 60). “A fault-tolerant computer system’s dependability must be validated to ensure that its redundancy has been correctly implemented and the system will provide the desired level of reliable service. Fault injection – the deliberate insertion of faults into an operational system to determine its response – offers an effective solution to this problem.” (Clark 1995, pg. 47).

Fault injection must include all possible single-point faults, not just faults that can be conveniently injected via the pins or connectors of a component. Rimen et al. compared internal vs. external fault injection, and found that that only 9%-12% of bit flip faults that occur inside a microcontroller could be tested via external pin fault injection (Rimen et al. 1994, p. 76). In 1994, Karlsson reported on the effectiveness of using a radioactive isotope to inject faults into a microcontroller (Karlsson 1994). Later fault injection work by Karlsson’s research group was performed on automotive brake-by-wire applications, sponsored by Volvo (Aidemark 2002), clearly demonstrating the applicability of fault injection as a relevant technique for safety critical automotive systems. And other similar work found defects in a safety critical automotive network protocol. (Ademaj 2003)

A test specifically on an engine control program using fault injection caused “permanently locking the engine’s throttle at full speed.” (Vinter 2001).

There are numerous other scholarly works in this area. An early example is Bossen (1981). Some others include: Arlat et al. (1989), Barton et al. (1990), Benso et al. (1999), Han (1995), and Kanawati (1995). As a more recent example, Baumeister et al. performed fault injection on an automotive braking controller via irradiating it and measuring the errors, finding that unprotected SRAM and unprotected microcontroller paths were both sensitive to upsets (Baumeister 2012, pg. 5)

By the late 1990s fault injection tools had become quite sophisticated, and were capable of injecting faults while a system was running at full speed even if source code was not available (e.g., Carreira 1998).

An example of a testing approach along these lines is E-GAS (E-GAS), which includes numerous tests based on auto manufacturer experience to ensure that various faults will be handled safely.

It is important to note that while mitigation techniques such as watchdog timers are a good practice if implemented properly, they are not sufficient to guarantee safety in the face of random errors. For example, Gunneflo presents experimental evidence indicating that watchdog effectiveness is less than perfect, and depends heavily on the particular software being run. Gunneflo recommends: “To accurately estimate coverage and latency for watch-dog mechanisms in a specific system, fault injection experiments must be carried out with the final implementation of the system using the real software.” (Gunneflo 1989, pg. 347). In other words, even if you have a watchdog timer, you need to perform fault injection to understand whether there are holes in your fault tolerance approach.

Monday, September 1, 2014

Every line of critical embedded software should be peer reviewed via a process that includes a physical face-to-face meeting and that produces an auditable peer review report.

Consequences:
Failing to perform peer reviews can reasonably be expected to increase the defect rate in software for several reasons. All real-world projects have limited time and resources, so by skipping or skimping on peer reviews developers have missed an easy chance to eliminate defects. With inadequate reviews, developers are spread thin chasing down bugs found during testing. Additionally, peer reviews can find defects that are impractical to find in most types of testing, especially in cases of fault management or handling unexpected/infrequent operating scenarios.

Accepted Practices:

Every line of code must be reviewed by at least one independent, technically skilled person. That review must include actually reading the entirety of the code rather than just looking at selected portions.

Peer reviews must be documented so that it is possible to audit the fact that they took place and the effectiveness of the reviews. At a minimum this includes recording the name of the reviewer(s), the code reviewed, the date of the review, and the number of defects found. If no auditable documentation of software quality is available for incorporated components (e.g., safety certification or peer review reports), then new peer reviews must be performed on that third-party code.

Discussion:
Peer reviews involve having an independent person – other than the author – look at source code and other design documents. The main purposes of the review are to ensure that code conforms to style guidelines and to find defects missed by the author of the code. Running a static analysis tool is not a substitute for a peer review, and neither is an in-person discussion that solely discusses the output of a static analysis tool. A proper peer review requires having an independent person (or, strongly preferable, a small group of independent reviewers) read the code in its entirety to ensure quality. The everyday analogy to a peer review is having someone else proof-read something you’ve written. It is nearly impossible to see all our own mistakes whether we are writing software or writing English prose.

It is well known that more formal reviews provide more efficient and effective results, with the gold standard being what is known as a Fagan Style Inspection (a “code inspection”) that involves a pre-review, a formal meeting with defined roles, a written review report, and follow up actions. Regardless of the type of review, accepted practice is to record the results of reviews and audit them to make sure every single line of code has been reviewed when written, and re-reviewed when a module has been modified.

MISRA requires a “structure program review” for SIL 2 and above. (MISRA Report 2 p. ix). MISRA specifically lists “Fagan Inspection” as a type of review (MISRA Software Guidelines p. 12), and devotes two appendices of a report on verification and validation to “walkthroughs,” listing structured walkthroughs, code inspections, Fagan inspections, and peer reviews (MISRA Report 6 pp. 132-136). MISRA points out that walkthroughs (their general term for peer reviews) “are acknowledged to be an effective process for identifying errors in programs – indeed they can be more effective than computer-based testing for certain types of error.”

MISRA also points out that fixing a bug may make things worse instead of better, and says that code reviews and analysis should be used to validate bug fixes. (MISRA Report 5 p. 135)
494. Peer reviews are somewhat labor intensive, and might account for 10% of the effort on a project. However, it is common for good peer reviews to find 50% or more of the defects in a code base, and thus finding defects via peer review is much cheaper than finding them via testing. Ineffective reviews can be diagnosed by the fact that they find far fewer defects. Acceptable peer reviews normally find defects that would be missed by testing, especially in parts of the code that are difficult to test thoroughly (for example, exception and failure management code).

Selected Sources:
McConnell devotes Chapter 24 to a discussion of reviews and inspections (McConnell 1993). Boehm & Basili summarized best practices for reducing software defects, and included the following point relevant to peer reviews: “Peer reviews catch 60 percent of the defects.” (Boehm 2001, pg. 137).

Ganssle lists four steps that should be the first steps taken to improve software quality. They are: “1. Buy and use a version control system; 2. Institute a Firmware Standards Manual; 3. Start a program of Code Inspection; 4. Create a quiet environment conducive to thinking.” #3 is his term for peer reviews, indicating his recommendation for formal code inspections. He also says that he knows companies that have made all these changes to their software process in a single day. (Ganssle 2000, p. 13). (Ganssle’s #2 item is coding style, discussed in Section 8.6. ).

MISRA Software Guidelines list the following as techniques on a one-picture overview of the software lifecycle: “Walkthrough, Fagan Inspection, Code Inspection, Peer Review, Argument, etc.” (MISRA Guidelines 1994, pg. 20) indicating the importance of formal peer reviews in a safety critical software lifecycle. Integrity Level 2 (which is only somewhat safety critical) and higher integrity levels require a “structured program review” (pg. 29). That document also gives these rules: “3.5.2.2 Before dynamic testing begins the code should be reviewed in accordance with the software verification plan to ensure that it does conform to the design specification” (pg. 56) and “3.5.2.3 Code reviews and/or walkthroughs should be used to identify any inconsistencies with the specifications” (pg. 56) and “4.3.4.3 The communication of information regarding errors to design and development personnel should be as clear as possible. For example, errors found during reviews should be fully recorded at the point of detection.”

MISRA C rule 116 states: “All libraries used in production code shall be written to comply with the provisions of this document, and shall have been subject to appropriate validation” (MISRA C pg. 55). Within the context of embedded systems, an operating system such as OSEK would be expected to count as a “library” in that it is code included in the system that is relied upon for safety, and thus should have been subject to appropriate validation, which would be expected to include peer reviews. If there is no evidence of peer review or safety certification, the system designer should perform peer reviews on the OS code (which is an excellent reason to use a safety certified OS!)

Fagan-style inspections are a formal version of a “peer review,” which involves multiple software developers looking at software and other design artifacts to find defects. Fagan-style inspections originated at IBM (Fagan 1976). A later paper presented updated techniques, concluding that “inspections increase productivity and improve final program quality. Furthermore, improvements in process control and project management are enabled by inspections.” (Fagan 1986). It is widely recognized that Fagan-style inspections are a best practice, and that some sort of effective peer review technique is an accepted practice.

Fagan-style Formal Inspections are recommended by the FAA (FAA 2000, p. J-23). IEC 61503-3 highly recommends performing some sort of design review on all software at all SILs, and recommends Fagan inspections at SIL4. (p. 91).

About Me

I've done embedded systems for big industry, the US military, startup companies, and now Carnegie Mellon University. I'm the author of the book Better Embedded System Software, which goes into more detail on most of the topics discussed in my corresponding blog.As with any blog, these posts often contain speculative and partially formed thoughts, and should not be interpreted as a fully considered opinion unless stated otherwise.Key pages:Academic home page at CMUEmbedded Software Blog Checksum and CRC Blog