Monday, November 17, 2014

The majority of time creating software is typically spent making sure that you got the software right (or if it's not, it should be!). But, sometimes, this focus on making software right gets in the way of actually ensuring the software works. Instead, in many cases it's more important to make sure that software is not wrong. This may seem like just a bit of word play, but I've come to believe that it reflects a fundamental difference in how one views software quality.

Consider the following code concept for stopping four wheel motors on a robot:

(Feel free to make obvious assumptions, such as leftFront being an const uint_t, and modify to your favorite naming conventions and style guidelines. None of that is the point here.)

Which code is better? Well, from our intro to programming course probably we want to write the first type of code. It is more flexible, "elegant," and uses the oh-so-cool concept of iteration. On the other hand, the second type of code is likely to get our fingers smacked with a ruler in a freshman programming class. Where's the iteration? Where is the flexibility? What if there are more than four items that need to be turned off? Where is the dazzling display of (not-so-advanced) computer science concepts?

But hold on a moment. It's a four wheeled robot we're building. Are you really going to change to a six wheeled robot? (Well, maybe, and I've even worked a bit with such robots. But sticking on two extra wheels isn't all that common after the frame has been welded!) So what are you really buying by optimizing for a change that is unlikely to happen?

Let's look at it a different way. Many times I'm not interested in elegance, clever use of computer science concepts, or even number of lines of code. What I am interested in is that the code is actually correct. Have you ever written the loop type of code and gotten it wrong? Be honest! There is a reason that "off by one error" has its own wikipedia entry. How long did it take you to look at the loop and make sure it was right? You had to assume or go look up the value of maxWheels, right? If you are writing unit tests, you need to somehow figure out a way to test that all the motors got turned off -- and account for the fact that maxWheels might not even be set to the correct value.

But, with the second set of code, it's pretty easy to see that all four wheels are getting turned off. There's no loop to get wrong. (You do have to make sure that the four wheel consts are set properly.) There is no loop to test. Cyclomatic complexity is lower because there is no looping branch. You don't have to mentally execute the loop to make sure there is no off-by-one error. You don't have to ask whether the loop can (or should) execute zero times. Instead, with the second set of codes you can just look at it and count up that the wheels are being turned off.

Now of course this is a pretty trivial example, and no doubt one that at least someone will take exception to. But the point I'm making is the following. The first code segment is what we were all taught to do, and arguably easier to write. But the second code segment is arguably easier to check for correctness. In peer reviews someone might miss a bug in the first code. I'd say that missing a bug is less likely in the second code segment.

Am I saying that you should avoid loops if you have 10,000 things to initialize? No, of course not. Am I saying you should cut and paste huge chunks of code to avoid a loop? Nope. And maybe you actually do want to use the loop approach in your situation even with 3 or 4 things to initialize for a good reason. And if you get lots of lines of code that is likely to be more error-prone than a well-considered loop. All that I'm saying is that when you write code, consider carefully the tradeoff between writing concise, but clever, code and writing code that is hard to get wrong. Importantly, this includes the risk of a varying style causing an increased risk of getting things wrong. How that comes out will depend on your situation. What I am saying is think about what is likely to change, what is unlikely to change, and what the path is to least risk of having bugs.

By the way, did you see the bug? If you did, good for you. If not, well, then I've already proved my point, haven't I? The wheel number should start at 0 or the limit should be maxWheels+1. Or the "<" should be "<=" -- depending on whether wheel numbers are base 0 or base 1. As it is, assuming maxWheels is 4 as you'd expect, only 3 out of 4 wheels actually get turned off. By the way, this wasn't an intentional trick on the reader. As I was writing this article, after a pretty long day, that's how I wrote the code. (It doesn't help that I've spent a lot of time writing code in languages with base-1 arrays instead of base-0 arrays.) I didn't even realize I'd made that mistake until I went back to check it when writing this paragraph. Yes, really. And don't tell me you've never made that mistake! We all have. Or you've never tried to write code at the end of a long day. And thus are bugs born. Most are caught, as this one would have been, but inevitably some aren't. Isn't it better to write code without bugs to begin with instead of find (most) bugs after you write the code?

Instead of trying to write code and spending all our effort to prove that whatever code we have written is correct, I would argue we should spend some of our effort on writing code that can't be wrong. (More precisely, code that is difficult to get wrong in the first place, and easy to detect is wrong.) There are many tools at our disposal to do this beyond the simple example shown. Writing code in a style that makes very rigorous static analysis tools happy is one way. Many other style practices are designed to help with this as well, chief among them being various type checking strategies, whether automated or implemented via naming conventions. Avoiding high cyclomatic complexity is another way that comes to mind. So is creating a statechart before actually writing the code, then using a switch statement that traces directly to the statechart. Avoiding premature optimization is also important in avoiding bugs. The list goes on.

So the next time you are looking at a piece of code, don't ask yourself "do I think this is right?" Instead, ask yourself "how easy is it to be sure that it is not wrong?" If you have to think hard about it -- then maybe it really is incorrect code even if you can't see a bug. Ask whether it could be rewritten in a way that is more obviously not wrong. Then do it.

Certainly I'm not the first to have this idea. But I see small examples of this kind of thing so often that it's worth telling this story once in a while to make sure others have paused to consider it. Which brings us to a quote that I've come to appreciate more and more over time. Print it out and stick it on the water cooler at work:

"There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies and the other way is to make it so complicated that there are no obvious deficiencies."

(I will admit that this quote is a bit clever and therefore not a sterling example of making a statement easy to check for correctness. But then again he is the one who got the Turing Award, so we'll allow some slack for clever wording in his acceptance essay.)

Monday, October 13, 2014

Consequences:
A poor safety culture dramatically elevates the risk of creating an unsafe product. If an organization cuts corners on safety, one should reasonably expect the result to be an unsafe outcome.

Accepted Practices:

Establish a positive safety culture in which all stakeholders put safety first, rigorous adherence to process is expected, and all developers are incentivized to report and correct both process and product problems.

Discussion:
A “safety culture” is the set of attitudes and beliefs employees have to attaining safety. Key aspects of such a culture include a willingness to tell management that there are safety problems, and an insistence that all processes relevant to safety be followed rigorously.

Part of establishing a healthy safety culture in an organization is a commitment to improving processes and products over time. For example, when new practices become accepted in an industry (for example, the introduction of a new version of the MISRA C coding style, or the introduction of a new safety standard such as ISO 26262), the organization should evaluate and at least selectively adopt those practices while formally recording the rationale for excluding and/or slow-rolling the adoption of new practices. (In general, one expects substantially all new accepted practices in an industry to be adopted over time by a company, and it is simply a matter of how aggressively this is done and in what order.)

Ideally, organizations should identify practices that will improve safety proactively instead of reactively. But regardless, it is unacceptable for an organization building safety critical systems to ignore new safety-relevant accepted practices with an excuse such as “that way was good enough before, so there is no reason to improve” – especially in the absence of a compelling proof that the old practice really was “good enough.”

Another aspect of a healthy safety culture is aggressively pursuing every potential safety problem to root cause resolution. In a safety-critical system there is no such thing as a one-off failure. If a system is observed to behave incorrectly, then that behavior must be presumed to be something that will happen again (probably frequently) on a large deployed fleet. It is, however, acceptable to log faults in a hazard log and then prioritize their resolution based on risk analysis such as using a risk table (Koopman 2010, ch. 28).

Along these lines, blaming a person for a design defect is usually not an acceptable root cause. Since people (developers and system operators alike) make mistakes, saying something like “programmer X made a mistake, so we fired him and now the problem is fixed” is simply scapegoating. The new replacement programmer is similarly sure to make mistakes. Rather, if a bug makes it through a supposedly rigorous process, the fact that the process didn’t prevent, detect, and catch the bug is what is broken (for example, perhaps design reviews need to be modified to specifically look for the type of defect that escaped into the field). Similarly, it is all too easy to scapegoat operators when the real problem is a poor design or even when the real problem is a defective product. In short, blaming a person should be the last alternative when all other problems have been conclusively ruled out – not the first alternative to avoid fixing the problem with a broken process or broken safety culture.

Believing that certain classes of defects are impossible to the degree that there is no point even looking for them is a sure sign of a defective safety culture. For example, saying that software defects cannot possibly be responsible for safety problems and instead blaming problems on human operators (or claiming that repeated problems simply didn’t happen) is a sure sign of a defective safety culture. See, for example, the Therac 25 radiation accidents. No software is defect free (although good ones are nearly defect free to begin with, and are improved as soon as new hazards are identified). No system is perfectly safe under all possible operating conditions. An organization with a mature safety culture recognizes this and responds to an incident or accident in a manner that finds out what really happened (with no preconceptions as to whether it might be a software fault or not) so it can be truly fixed. It is important to note that both incidents and accidents must be addressed. A “near miss” must be sufficient to provoke corrective action. Waiting for people to die (or dozens of people to die) after multiple incidents have occurred and been ignored is unacceptable (for an example of this, consider the continual O-ring problems that preceded the Challenger space shuttle accident).

The creation of safe software requires adherence to a defined process with minimal deviation, and the only practical way to ensure this is by having a robust Software Quality Assurance (SQA) function. This is not the same as thorough testing, nor is it the same as manufacturing quality. Rather than being based on testing the product, SQA is based on defining and auditing how well the development process (and other aspects of ensuring system safety) have been followed. No matter how conscientious the workers, independent checks, balances, and quantifiable auditing results are required to ensure that the process is really being followed, and is being followed in a way that is producing the desired results. It is also necessary to make sure the SQA function itself is healthy and operational.

Selected Sources:
Making the transition from creating ordinary software to safety critical software is well known to require a cultural shift that typically involves a change from an all-testing approach to quality to one that has a balance of testing and process management. Achieving this state is typically referred to as having a “safety culture” and is necessary step in achieving safety. (Storey 1996, p. 107) Without a safety culture it is extremely difficult, if not impossible, to create safe software. The concept of a “safety culture” is borrowed from other, non-software fields, such as nuclear power safety and occupational safety.

MISRA Software Guidelines Section 3.1.4 Assessment recommends an independent assessor to ensure that required practices are being followed (i.e., an SQA function).

MISRA provides a section on “human error management” that includes: “it is recommended that a fear free but responsible culture is engendered for the reporting of issues and errors” (MISRA Software Guidelines p. 58) and “It is virtually impossible to prevent human errors from occurring, therefore provision should be made in the development process for effective error detection and correction; for example, reviews by individuals other than the authors.”

Abstract:
Investigations into potential causes of Unintended Acceleration (UA) for
Toyota vehicles have made news several times in the past few years. Some
blame has been placed on floor mats and sticky throttle pedals. But, a
jury trial verdict was based on expert opinions that defects in Toyota's
Electronic Throttle Control System (ETCS) software and safety
architecture caused a fatal mishap. This talk will outline key events
in the still-ongoing Toyota UA litigation process, and pull together the
technical issues that were discovered by NASA and other experts. The
results paint a picture that should inform future designers of safety
critical software in automobiles and other systems.

Bio:
Prof. Philip Koopman has served as a Plaintiff expert witness on
numerous cases in Toyota Unintended Acceleration litigation, and
testified in the 2013 Bookout trial. Dr. Koopman is a member of the ECE
faculty at Carnegie Mellon University, where he has worked in the broad
areas of wearable computers, software robustness, embedded networking,
dependable embedded computer systems, and autonomous vehicle safety.
Previously, he was a submarine officer in the US Navy, an embedded CPU
architect for Harris Semiconductor, and an embedded system researcher at
United Technologies. He is a senior member of IEEE, senior member of
the ACM, and a member of IFIP WG 10.4 on Dependable Computing and Fault
Tolerance. He has affiliations with the Carnegie Mellon Institute for
Software Research (ISR) and the National Robotics Engineering Center
(NREC).

I am getting an increasing number of requests to do this talk in person, both as a keynote speaker and for internal corporate audiences. Audiences tell me that while the video is nice, an in-person experience of both the presentation and small-group follow-up discussions has a lot more impact for organizations who need help in coming to terms with creating high quality software and safety critical systems. If you are interested please get in touch for details: koopman@cmu.edu

Other info:

Download copy of full-resolution video file set of talk, Box.com 340 MB .zip file of a web directory with interactive split-screen viewing format. Experts only! Please do not ask me for support -- it works for me but I don't have any details about this format beyond saying to unzip it and open Default.html in a web browser.)

One or more of these download sites might be blocked by company networks, so if you get an error message please try both links at home. If they still don't work, send me an e-mail and I'll see what I can do.

If you are planning on using the materials in a course or similar, I would appreciate it if you let me know so I can track adoption. If you need a variation from the CC BY 4.0 license (for example, to incorporate materials in a situation that is at odds with the license terms) please contact me and it can usually be arranged.

Monday, September 29, 2014

Testing alone is insufficient to ensure safety in critical systems. Other technical approaches and software development process management approaches must also be used to assure sufficient software integrity.

Consequences:
Relying upon just system functional testing to achieve safety can be expected to eventually lead to an unsafe situation in a widely released product. Even if system functional testing is completely representative of situations that will happen in practice, such testing normally won’t be long enough to see all of the infrequent events that will occur with a much larger fleet of vehicles deployed for a much longer period of time.

Accepted Practices:

Specifically identify and follow a process to design in safety rather than attempting to test it in after the product has already been built. The MISRA Guidelines describe an example of an automotive-specific process.

Include defined activities beyond hiring smart designers and performing extensive functional testing. While details might vary depending upon the project, as an example, an acceptable set of practices for critical software by the late 1990s would have included the following (assuming that MISRA Safety Integrity Level 3 were an appropriate categorization of the functions): precisely written functional specifications, use of a restricted language subset (e.g., MISRA C), a way of ensuring compilers produced correct code, configuration management, change management, automated build processes, automated configuration audits, unit testing to a defined level of coverage, stress testing, static analysis, a written safety case, deadlock analysis, justification/demonstration of test coverage, safety training of personnel, and availability of written documentation for assessment of safety (auditability of the process). (The required level of care today is, if anything, even more rigorous for such systems.)

Discussion:
There is a saying about quality: “You can’t test in quality; you have to design it in from the start.” It is well known that the same is true of safety.

Assuring safety requires more than just using capable designers and performing extensive testing (although those two factors are important). Even the best designers – like all humans – are imperfect, and even the most extensive system-level functional testing cannot hope to find everything that can go wrong in a large deployed fleet such as an automobile. It should be apparent than everyone can make a mistake, even careful designers. But beyond that, system level functional testing (e.g., driving a car around in a variety of circumstances) cannot be expected to find all the defects in software, because there are just too many situations that can occur to experience them all in testing. This is especially true if a combination of events that causes a software failure just happens to be one that the testers didn’t think of putting into the test plan. (Test plans have bugs and gaps too.) Therefore, it has long been recognized that creating safe software requires more than just trying hard to get the design right and trying really hard to test well.

Accepted practices require a holistic approach to safety, including executing a well-defined process, having a written plan to achieve safety, using techniques to ensuring safety such as fault tree analysis, and auditing the process to ensure all required steps are being performed.

An accepted way of ensuring that safety has been considered appropriately is to have a written document that argues why a system is safe (sometimes called a safety case or safety argument). The safety case should give quantitative arguments as to why safety is inherent in the system. An argument that says “we tested for X hours” would be insufficient – unless it also said “and that covered 99.999% of all anticipated operating scenarios as well as thoroughly exercising every line of code” or some other type of argument that testing was thorough. After all, running a car in circles around a track is not the same level of testing as a cross-country drive over mountains. Or one that goes to Alaska in the winter and Death Valley in the summer. Or one that does so with 1000 cars to catch situations in which things inside one of those many cars just happen to line up in just the wrong way to cause a system failure. But even with the significant level of testing done by automotive companies, the safety case must also include things such as the level of peer reviews conducted, whether fault tree analysis revealed single points of failure, and so on. In other words, it’s inadequate to say “we tried really hard” or “we are really smart” or “we spent a whole lot of time testing.” It is essential to also justify that broad coverage was achieved using a variety of relevant techniques.

Selected Sources:
Beatty, in a paper aimed at educating embedded system practitioners, explains that code inspections and testing aren’t sufficient to detect many common types of errors in complex embedded systems (Beatty 2003, pg. 36). He identifies five areas that require special attention: stack overflows, race conditions, deadlocks, timing problems, and reentrancy conditions. He states that “All of these issues are prevalent in systems that employ multitasking real-time designs.”

Lists of techniques that could be applied to ensure safety beyond just testing have been well known for many years, with a relatively comprehensive example being IEC 61508 Part 7.

Even if you could test everything (which you can’t), dealing with low-probability faults that can be expected to affect a huge deployed fleet of automobiles just takes too long. “It is impossible to gain confidence about a system reliability of 100,000 years by testing,” (written in reference specifically to drive-by-wire automobiles and their requirement for a mean-time-to-failure of 1 billion hours) (Kopetz 2004, p. 32, emphasis per original)

Butler and Finelli wrote the classical academic reference on this point, stating that attaining software needed for safety critical applications will “inevitably lead to a need for testing beyond what is practical” because the testing time must be longer than the acceptable catastrophic software failure rate. (Butler 1993, p. 3, paper entitled “The infeasibility of quantifying the reliability of life-critical real-time software.”))

Knutson gives an overview of software safety practices, and makes it clear that testing isn’t enough to create a safe system: “Even if we are wary of these dangerous assumptions, we still have to recognize the limitations inherent in testing as a means of bringing quality to a system. First of all, testing cannot prove correctness. In other words, testing can show the existence of a defect, but not the absence of faults. The only way to prove correctness via testing would be to hit all possible states, which as we’ve stated previously, is fundamentally intractable.” (Knutson 2000, pg. 34). Knutson suggests peer reviews as a technique beyond testing that will help.

Kendall presents a case study for an electronic throttle control (with mechanical fail-safes) using a two-CPU approach (a “sub Processor” and a “Main Processor”). The automotive supplier elected to follow the IEC 1508 draft standard (a draft of the IEC 61508 standard), also borrowing elements from the MISRA software guidelines. Steps that were performed include: preliminary hazard analysis with mapping to MISRA SILs, review of standards and procedures to ensure they were up to date with accepted practices; on-site audits of development processes; FMEA by an independent agency; FTA by an independent agency; Markov modeling (a technique for analyzing failure probabilities); independent documentation review; mathematical proofs of correctness; and safety validation testing. (Kendall 1996) Important points from this paper relevant to this case include: “it is well accepted that software cannot be shown to be suitable for [its] intended use by testing alone” (id. pg. 6); “Software robustness must be demonstrated by ensuring the process used to develop it is appropriate, and that this process is rigorously followed.” (id., pg. 6); “safety validation must consider the effect of the vehicle under as many failure conditions as is possible to generate.” (id., p. 7).

Roger Rivett from Rover Group wrote a paper in 1997 based on a collaborative government-sponsored research effort that specifically addresses how automotive manufacturers should proceed to ensure the safety of vehicles. He makes an important point that rigorous use of good software practice is required in addition to testing (Rivett 1997, pg. 3). He has four specific conclusions for achieving a level of “good practice” for safety: use a quality management system, use a safety integrity level approach; be compliant with a sector standard (e.g., MISRA Software Guidelines), and use a third party assessment to ensure that high-integrity levels have been achieved. (Rivett 1997, pg. 10).

MISRA Development Guidelines, section 3.6.1, provides a set of points that make it clear that testing is necessary, but not sufficient, to establish safety (MISRA Guidelines, pg. 49):

MISRA Testing Guidance (MISRA Software Guidelines, p. 49)

This last point of the MISRA Guidelines is key – testing can discover if something is unsafe, but testing alone cannot prove that a system is safe.

"Testing on its own is not adequate for assessing safety-related software." (MISRA report 2 pg. iv) In particular, system-level testing (such as at the vehicle level), cannot hope to uncover all the possible faults or exceptional situations can will result in mishaps.

Thursday, September 11, 2014

Some systems base their safety arguments on the presence of “fail-safe” behaviors. In other words, if a failure occurs, the argument is that the system will respond in a safe way, such as by shutting down in a safe manner. If you have fail-safe mechanisms, you need to test them with a full range of faults within the intended fault model to make sure they work properly.

Consequences:

Failing to specifically test for mitigation of single points of failure means that there is no way to be sure that the mitigation really works, putting safety of the system into doubt.

As an example, if a hardware watchdog timer is not turned on, it won’t reset the system, but there might be no way to tell whether the watchdog timer is on or not (or set to the wrong value, or otherwise used improperly) without specifically testing whether the watchdog works or not. Thus, you can’t take credit for having a watchdog timer unless you have actually tested that it works for each fault that matters (or, if there are many such faults, argue that you have attained sufficient coverage with the tests that are run).

Accepted Practices:

Each and every fail-safe mechanism and fault management mechanism must be tested, preferably on a fully integrated system. Such tests may be difficult to perform in normal functional testing and may require intentional fault injection from the outside of the system (e.g., breaking a sensor) or fault injection at test points inside the system (e.g., intentionally killing a task using special test support infrastructure).

Discussion:

Fault injection is the process of intentionally inducing a hardware or software fault and determining its effect upon the system.

Fault management mechanisms, and especially fail-safe mechanisms, are often the key points upon which an argument as to the safety of a system rests. As an example, a safety case based on a watchdog timer detecting task failures requires that the watchdog timer actually work. While it is of course important to make sure that the system has been designed properly, there is no substitute for testing whether the watchdog timer is actually turned on during system test. (To revisit a point on system testing made elsewhere in my postings – system testing is not sufficient to ensure safety, but thorough system testing is certainly an important thing to do.) It is similarly important to specifically test every fault mode that must be handled by the system to ensure fault handling is done correctly.

Some examples of fault tests that should be performed include: killing each task independently to ensure that the death of any task is caught by the watchdog (and, by extension, cannot cause an unsafe system state); overloading the system to ensure that it behaves safely in an unanticipated CPU overload situation; checking that diagnostic fail-safes detect the faults they are supposed to and react by putting the system into a safe state; disabling sensors; disabling actuators; and others.

Another perspective on this topic is that ensuring safety usually involves arguing that all single points of failure have been mitigated to make the system safe. To demonstrate that the reasoning is accurate, a system must have corresponding failures injected to make sure that the mitigation approaches actually work, since the system’s safety case rests upon that assumption. This might include intentionally corrupting bits in memory, corrupting computations that take place, corrupting stack contents, and so on.

It is important to note that ordinary system functional testing tends to do a poor job at exercising fault mitigation mechanisms. As an example, if a particular task is never supposed to die, and testing has been thorough, then that task won’t die during normal functional testing (if it did, the system would be defective!). The point of detecting task death is to handle situations you missed in testing. But that means the mechanism to detect task death and perform a restart hasn’t been tested by normal system-level functional tests. Therefore, testing fail-safe mechanisms requires special techniques that intentionally introduce faults into the system to activate those fail-safes.

Selected Sources:

Safety critical systems are deemed safe only if they can withstand the occurrence of any single point fault. But, there is no way to know if they will really do that unless testing includes actually injecting representative single point faults to see if the system will respond in a safe manner. You can’t know if a system is safe if you don’t actually test its safety capabilities, and doing so requires fault injection. For example, if you expect a watchdog to detect failed tasks, you need to kill each and every task in turn to see if the watchdog really works. Arlat correctly states that “physical fault injection will always be needed to test the actual implementation of a fault tolerant system” (Arlat 1990, pg. 180)

The need to actually test fail-safe mechanisms to see if they really work should be readily apparent to any engineer. Pullum discusses this topic by suggesting the use of fault injection (intentionally causing faults as a testing technique) in the context of “verification of integration of fault and error processing mechanisms” for creating dependable systems (Pullum 2001, pg. 93).

“Fault injection is important to evaluating the dependability of computer systems. … It is particularly hard to recreate a failure scenario for a large complex system.” (Hsueh et al., 1997 pg. 75, speaking about the need for fault injection as part of testing a system). Mariani refers to the IEC 61508 safety standard and concludes that “fault-injection will be mandatory for soft error sensitivity verification” for safety critical systems (Mariani03, pg. 60). “A fault-tolerant computer system’s dependability must be validated to ensure that its redundancy has been correctly implemented and the system will provide the desired level of reliable service. Fault injection – the deliberate insertion of faults into an operational system to determine its response – offers an effective solution to this problem.” (Clark 1995, pg. 47).

Fault injection must include all possible single-point faults, not just faults that can be conveniently injected via the pins or connectors of a component. Rimen et al. compared internal vs. external fault injection, and found that that only 9%-12% of bit flip faults that occur inside a microcontroller could be tested via external pin fault injection (Rimen et al. 1994, p. 76). In 1994, Karlsson reported on the effectiveness of using a radioactive isotope to inject faults into a microcontroller (Karlsson 1994). Later fault injection work by Karlsson’s research group was performed on automotive brake-by-wire applications, sponsored by Volvo (Aidemark 2002), clearly demonstrating the applicability of fault injection as a relevant technique for safety critical automotive systems. And other similar work found defects in a safety critical automotive network protocol. (Ademaj 2003)

A test specifically on an engine control program using fault injection caused “permanently locking the engine’s throttle at full speed.” (Vinter 2001).

There are numerous other scholarly works in this area. An early example is Bossen (1981). Some others include: Arlat et al. (1989), Barton et al. (1990), Benso et al. (1999), Han (1995), and Kanawati (1995). As a more recent example, Baumeister et al. performed fault injection on an automotive braking controller via irradiating it and measuring the errors, finding that unprotected SRAM and unprotected microcontroller paths were both sensitive to upsets (Baumeister 2012, pg. 5)

By the late 1990s fault injection tools had become quite sophisticated, and were capable of injecting faults while a system was running at full speed even if source code was not available (e.g., Carreira 1998).

An example of a testing approach along these lines is E-GAS (E-GAS), which includes numerous tests based on auto manufacturer experience to ensure that various faults will be handled safely.

It is important to note that while mitigation techniques such as watchdog timers are a good practice if implemented properly, they are not sufficient to guarantee safety in the face of random errors. For example, Gunneflo presents experimental evidence indicating that watchdog effectiveness is less than perfect, and depends heavily on the particular software being run. Gunneflo recommends: “To accurately estimate coverage and latency for watch-dog mechanisms in a specific system, fault injection experiments must be carried out with the final implementation of the system using the real software.” (Gunneflo 1989, pg. 347). In other words, even if you have a watchdog timer, you need to perform fault injection to understand whether there are holes in your fault tolerance approach.

Monday, September 1, 2014

Every line of critical embedded software should be peer reviewed via a process that includes a physical face-to-face meeting and that produces an auditable peer review report.

Consequences:
Failing to perform peer reviews can reasonably be expected to increase the defect rate in software for several reasons. All real-world projects have limited time and resources, so by skipping or skimping on peer reviews developers have missed an easy chance to eliminate defects. With inadequate reviews, developers are spread thin chasing down bugs found during testing. Additionally, peer reviews can find defects that are impractical to find in most types of testing, especially in cases of fault management or handling unexpected/infrequent operating scenarios.

Accepted Practices:

Every line of code must be reviewed by at least one independent, technically skilled person. That review must include actually reading the entirety of the code rather than just looking at selected portions.

Peer reviews must be documented so that it is possible to audit the fact that they took place and the effectiveness of the reviews. At a minimum this includes recording the name of the reviewer(s), the code reviewed, the date of the review, and the number of defects found. If no auditable documentation of software quality is available for incorporated components (e.g., safety certification or peer review reports), then new peer reviews must be performed on that third-party code.

Discussion:
Peer reviews involve having an independent person – other than the author – look at source code and other design documents. The main purposes of the review are to ensure that code conforms to style guidelines and to find defects missed by the author of the code. Running a static analysis tool is not a substitute for a peer review, and neither is an in-person discussion that solely discusses the output of a static analysis tool. A proper peer review requires having an independent person (or, strongly preferable, a small group of independent reviewers) read the code in its entirety to ensure quality. The everyday analogy to a peer review is having someone else proof-read something you’ve written. It is nearly impossible to see all our own mistakes whether we are writing software or writing English prose.

It is well known that more formal reviews provide more efficient and effective results, with the gold standard being what is known as a Fagan Style Inspection (a “code inspection”) that involves a pre-review, a formal meeting with defined roles, a written review report, and follow up actions. Regardless of the type of review, accepted practice is to record the results of reviews and audit them to make sure every single line of code has been reviewed when written, and re-reviewed when a module has been modified.

MISRA requires a “structure program review” for SIL 2 and above. (MISRA Report 2 p. ix). MISRA specifically lists “Fagan Inspection” as a type of review (MISRA Software Guidelines p. 12), and devotes two appendices of a report on verification and validation to “walkthroughs,” listing structured walkthroughs, code inspections, Fagan inspections, and peer reviews (MISRA Report 6 pp. 132-136). MISRA points out that walkthroughs (their general term for peer reviews) “are acknowledged to be an effective process for identifying errors in programs – indeed they can be more effective than computer-based testing for certain types of error.”

MISRA also points out that fixing a bug may make things worse instead of better, and says that code reviews and analysis should be used to validate bug fixes. (MISRA Report 5 p. 135)
494. Peer reviews are somewhat labor intensive, and might account for 10% of the effort on a project. However, it is common for good peer reviews to find 50% or more of the defects in a code base, and thus finding defects via peer review is much cheaper than finding them via testing. Ineffective reviews can be diagnosed by the fact that they find far fewer defects. Acceptable peer reviews normally find defects that would be missed by testing, especially in parts of the code that are difficult to test thoroughly (for example, exception and failure management code).

Selected Sources:
McConnell devotes Chapter 24 to a discussion of reviews and inspections (McConnell 1993). Boehm & Basili summarized best practices for reducing software defects, and included the following point relevant to peer reviews: “Peer reviews catch 60 percent of the defects.” (Boehm 2001, pg. 137).

Ganssle lists four steps that should be the first steps taken to improve software quality. They are: “1. Buy and use a version control system; 2. Institute a Firmware Standards Manual; 3. Start a program of Code Inspection; 4. Create a quiet environment conducive to thinking.” #3 is his term for peer reviews, indicating his recommendation for formal code inspections. He also says that he knows companies that have made all these changes to their software process in a single day. (Ganssle 2000, p. 13). (Ganssle’s #2 item is coding style, discussed in Section 8.6. ).

MISRA Software Guidelines list the following as techniques on a one-picture overview of the software lifecycle: “Walkthrough, Fagan Inspection, Code Inspection, Peer Review, Argument, etc.” (MISRA Guidelines 1994, pg. 20) indicating the importance of formal peer reviews in a safety critical software lifecycle. Integrity Level 2 (which is only somewhat safety critical) and higher integrity levels require a “structured program review” (pg. 29). That document also gives these rules: “3.5.2.2 Before dynamic testing begins the code should be reviewed in accordance with the software verification plan to ensure that it does conform to the design specification” (pg. 56) and “3.5.2.3 Code reviews and/or walkthroughs should be used to identify any inconsistencies with the specifications” (pg. 56) and “4.3.4.3 The communication of information regarding errors to design and development personnel should be as clear as possible. For example, errors found during reviews should be fully recorded at the point of detection.”

MISRA C rule 116 states: “All libraries used in production code shall be written to comply with the provisions of this document, and shall have been subject to appropriate validation” (MISRA C pg. 55). Within the context of embedded systems, an operating system such as OSEK would be expected to count as a “library” in that it is code included in the system that is relied upon for safety, and thus should have been subject to appropriate validation, which would be expected to include peer reviews. If there is no evidence of peer review or safety certification, the system designer should perform peer reviews on the OS code (which is an excellent reason to use a safety certified OS!)

Fagan-style inspections are a formal version of a “peer review,” which involves multiple software developers looking at software and other design artifacts to find defects. Fagan-style inspections originated at IBM (Fagan 1976). A later paper presented updated techniques, concluding that “inspections increase productivity and improve final program quality. Furthermore, improvements in process control and project management are enabled by inspections.” (Fagan 1986). It is widely recognized that Fagan-style inspections are a best practice, and that some sort of effective peer review technique is an accepted practice.

Fagan-style Formal Inspections are recommended by the FAA (FAA 2000, p. J-23). IEC 61503-3 highly recommends performing some sort of design review on all software at all SILs, and recommends Fagan inspections at SIL4. (p. 91).

Monday, August 25, 2014

Critical embedded software should use static checking tools with a defined and appropriate set of rules, and should have zero warnings from those tools.

Consequences:
While rigorous peer reviews can catch many defects, some misuses of language are easy for humans to miss but straightforward for a static checking tool to find. Failing to use a static checking tool exposes software to a needless risk of defects. Ignoring or accepting the presence of large numbers of warnings similarly exposes software to needless risk of defects.

Accepted Practices:

Using a static checking tool that has been configured to automatically check as many coding guideline violations as practicable. For automotive applications, following all or almost all (with defined and justified exceptions) of the MISRA C coding standard rules is an accepted practice.

Ensuring that code checks “clean,” meaning that there are no static checking violations.

In rare instances in which a coding rule violation has been formally approved, use pragmas to formally document the deviation and direct the static checking tool not to issue a warning.

Discussion:
Static checking tools look for suspicious coding structures and data use within a piece of software. Traditionally, they look for things that are “warnings” instead of errors. The distinction is that an error prevents the compiler from being able to generate code that will run. In contrast, a warning is an instance in which code can be compiled, but in which there is a substantial likelihood that the code the compiler generates will not actually do what the designer wants it to do. Reasons for a warning might include ambiguities in the language standard (the code does something, but it’s unclear whether what it does is what the language standard meant), gaps in the language standard (the code does something arbitrary because the language standard does not standardize behavior for this case), and dangerous coding practices (the code does something that is probably a bad idea to attempt). In other words, warnings point out potential code defects. Static analysis capabilities vary depending upon the tool, but in general are all designed to help find instances of poor use of a programming language and violations of coding rules.

An analogous example to a static checking tool is the Microsoft Word grammar assistant. It tells you when it thinks a phrase is incorrect or awkward, even if all the words are spelled correctly. This is a loose analogy because creativity in expression is important for some writing. But safety critical computer code (and English-language writing describing the details of how such systems work) is better off being methodical, regular, and precise, rather than creatively expressed but ambiguous.

Static checking tools are an important way of checking for coding style violations. They are particularly effective at finding language use that is ambiguous or dangerous. While not every instance of a static checking tool warning means that there is an actual software defect, each warning given means that there is the potential for a defect. Accepted practice for high quality software (especially safety critical software) is to eliminate all warnings so that the code checks “clean.” The reasons for this include the following. A warning may seem to be OK when examined, but might become a bug in the context of other changes made to the code later. A multitude of warnings that have been investigated and deemed acceptable may obscure the appearance of a new warning that indicates an actual bug. The reviewer may not understand some subtle language-dependent aspect of a warning, and thus think things are OK when they are actually not.

Selected Sources:
MISRA Guidelines require the use of “automatic static analysis” for SIL 3 automotive systems and above, which tend to be systems that can kill or severely injure at least one person if they fail (MISRA Guidelines, pg. 29). The guidelines also give this guidance: “3.5.2.6 Static analysis is effective in demonstrating that a program is well structured with respect to its control, data, and information flow. It can also assist in assessing its functional consistency with its specification.”

McConnell says: “Heed your compiler's warnings. Many modern compilers tell you when you have different numeric types in the same expression. Pay attention! Every programmer has been asked at one time or another to help someone track down a pesky error, only to find that the compiler had warned about the error all along. Top programmers fix their code to eliminate all compiler warnings. It's easier to let the compiler do the work than to do it yourself.” (McConnell, pg. 237, emphasis added).

References:

McConnell, Code Complete, Microsoft Press, 1993.

MISRA, (MISRA C), Guideline for the use of the C Language in Vehicle Based Software, April 1998.

MISRA, Development Guidelines for Vehicle Based Software, November 1994 (PDF version 1.1, January 2001).

Monday, August 11, 2014

Critical embedded software should follow a well-defined set of coding guidelines, enforced with comprehensive static checking tools, with essentially no deviations. MISRA C is an example of an accepted set of such coding guidelines.

Consequences:
Coding style guidelines exist to make it more difficult to make mistakes, and also to make it easier to detect when mistakes have been made. Failing to establish or follow formal, written coding guidelines makes it more difficult to understand code, leading to less effective code reviews and a reasonable expectation of increased levels of software defects. Failing to follow the language usage rules defined by a coding style guideline leads to using the language in a way that can normally be expected to result in poorly defined or incorrect software behaviors, increases the risk of software defects.

Accepted Practices:

All projects should follow a written coding style guideline document.

Coding guidelines should address formatting, commenting, name use, and other similar topics.

Coding guidelines should address good language usage practices to create understandable code and reduce the chance of introducing software defects.

Coding guidelines should specifically address which language features and usage patterns should be avoided as being error-prone, dangerous, or undefined by the language standard.

Coding guidelines should be followed with essentially no exception. Exceptions should require formal review with written approval and annotations in the source code. If guidelines are inappropriate, the guidelines should be changed.

Discussion:
Style in software can be considered analogous to style in writing. Compilers enforce some basic programming language construction rules that allow the source code to be compiled into executable software. Style, on the other hand, has more to do with how ideas are expressed within the constraints of the programming language used. Some style considerations have to do with variable naming conventions, indentation, and physical organization aspects of lines of code. Other style considerations have to do with the use of the programming language itself. Some constructs in a programming language are ambiguous or easily misunderstood. And some constructions in software, while correct in terms of language definition, are very likely to indicate a software defect.

A classic example of a subtle defect in the C programming language is:
“if (x = y) { …}” The programmer almost always means to compare “x” and “y” for equality, but the C programming language is defined such that this code instead copies the value of “y” into “x” and then tests to see if the result of that copy operation was non-zero. The correct code would be “if (x == y) { … }” which adds a second “=” to make the operation a comparison instead of an assignment operation. Using “=” instead of “==” in conditionals is a common mistake when creating C programs, easy to confuse visually, and is therefore prohibited by typical style guidelines even though the single “=” version of the code is a valid language construct. A loose analogy might be a prohibition against using multiple negatives in an English language sentence because it is too difficult to not not not not not not (sic) make a mistake with such a sentence even though the meaning is unambiguous if the sentence is carefully (and correctly) analyzed.

It is accepted practice to have a defined set of coding guidelines that cover all relevant aspects of programming language use. Such guidelines are typically tailored for each project, but once defined should be followed rigorously. Guidelines typically cover code formatting, commenting, use of names, language use conventions, and other relevant aspects. While guidelines can be tailored per project, there are nonetheless a number of generally accepted practices for reducing the chance of software defects (such as forbidding a single “=” in a conditional evaluation as just discussed).

Of particular concern in safety critical software are language use rules to avoid ambiguity and hazardous language constructs. It is a generally accepted practice to outright ban hazardous and error-prone language structures to avoid the chance of defects, even if doing so makes software a bit less convenient to write and those structures would otherwise be a legal use of the language. In other words, an essential aspect of coding style for safety critical systems is outlawing code structures that are technically valid but are too dangerous or error-prone to use. The result is a written document that defines coding style in general, and language usage rules in particular. These rules must be applied rigorously and with essentially no exceptions. (The “no exceptions” part is feasible because it is acceptable to tailor the rules to the particular project. So it is not a matter of applying arbitrary rules and making exceptions, but rather picking rules that make sense for the situation and then rigorously sticking to them.) In short, every safety critical software project must have a coding style guide and must follow it rigorously to achieve acceptable levels of safety.

It is often the case that it makes sense to adopt an existing set of language use rules rather than make up your own. The MISRA C set of coding rules was specifically created for safety critical automotive software, and is the most well known C programming language subset for safety critical software. A typical practice when writing safety critical C code is to start with MISRA C, create a defined set of which rules will be followed (usually this is almost all of them), and then follow those rules rigorously. Exceptions to adopted rules should be very few, granted only after a formal written review process, and documented in the code as to the type of exception and reason for granting it. Preferably, automated tools (widely available for MISRA C as discussed in Jones 2002, pg. 56) are used to enforce the rules in addition to a required peer review of code.

It is accepted practice to adopt new coding style rules when better practices come into use, and apply those coding style rules to existing code when that code is being updated and incorporated into a new product.

Selected Sources:
The MISRA C guidelines (MISRA C 1998) are specifically designed for safety critical systems at SIL 2 and above. (MISRA Report 2 p. ix) They consist of a list of rules about coding practices to use and practices to avoid. They concentrate primarily on language use rather than code formatting. While MISRA C was originally developed for automotive applications, it was being set forth as a more general standard for adoption in other domains by 2002 (e.g., Jones 2002). Over time, MISRA C has transition beyond just automotive applications to mainstream use for high integrity software in other areas. A predecessor of MISRA C is the list of rules in the book Safer C (Hatton, 1995).

More general coding style guidelines abound. It is easy to find a coding style guideline that can be adapted for the specifics of a project. Examples include chapter 18 of McConnell (1993).

NASA says that it is important that all levels of the project agree to the coding standards, and that they are enforced. (NASA 2004 pg. 146)

Enforcing coding style involves the use of static checking in addition to formal peer reviews. Beyond the general consensus in the software community that following a defined coding style is a good idea, Nagappan and Ball found that “there exists a strong positive correlation between the static analysis defect density and the pre-release defect density determined by testing. Further, the predicted pre-release defect density and the actual pre-release defect density are strongly correlated at a high degree of statistical significance.” (Nagappan 2005, abstract) In other words, modules that fail to follow a coding style as determined by static analysis have more bugs.

McConnell says: “Heed your compiler's warnings. Many modern compilers tell you when you have different numeric types in the same expression. Pay attention! Every programmer has been asked at one time or another to help someone track down a pesky error, only to find that the compiler had warned about the error all along. Top programmers fix their code to eliminate all compiler warnings. It's easier to let the compiler do the work than to do it yourself.” (McConnell 1993, pg. 237).

MISRA Software Guidelines require the use of “automatic static analysis” for SIL 3 automotive systems and above, which tend to be systems that can kill or severely injure at least one person if they fail (MISRA Guidelines 1994, pg. 29). The guidelines also give this guidance: “3.5.2.6 Static analysis is effective in demonstrating that a program is well structured with respect to its control, data, and information flow. It can also assist in assessing its functional consistency with its specification.” IEC 61508 highly recommends (which more or less means “requires” as an accepted practice) static analysis at SIL 2 and above (IEC 61508-3 pg. 83).

Finally, an automotive manufacturer has published data showing that they expect one “major bug” for every 30 coding rule violations (Kawana 2004):

(Figure from Kawana 2004)

Note: As with my other posts in the last few months this was written regarding practices about a decade ago. There are newer sources for coding style information available now such as an updated version of MISRA C. There is also the the ISO-26262 standard, which is intended to replace the MISRA software guidelines.. But we'll save discussing those for another time.

Wednesday, July 23, 2014

Somewhere in your embedded system is the stack (or several
of them for some multitasking systems). If you blow up the stack, your software
will crash. Or worse, especially if
you don’t have memory protection. For a critical system you need to make sure
the stack has some elbow room and make sure that you know when you have a stack
size problem.

Also, you shouldn’t ever use recursion in a safety critical
embedded system. (This shouldn’t even need to be said, but apparently it does.)

Consequences: The
consequences of not understanding maximum stack depth can be a seemingly random
memory corruption as the stack overwrites RAM variables. Whether or not this
actually causes a program crash depends upon a number of factors, including
chance, and in the worse case it can cause unsafe program behavior without a
crash.

Accepted
Practices:

Compute
worst case stack depth using a static analysis tool.

Include a
stack sentinel or a related technique in the supervisor task and perform a
graceful shutdown or reset prior to an actual overflow.

Avoid all
recursion to ensure that worst case stack depth is bounded.

Discussion:

The “stack” is an area in memory used for storing certain
types of data. For example, in the C programming language this is where
non-static local variables go. The stack gets bigger every time a subroutine is
called, usually holding the subroutine return address, parameter values that
have been passed, and local variables for each currently active subroutine. As
nested subroutines are called, the stack keeps getting bigger as more
information is added to the stack to perform each deeper and deeper call. When
each subroutine is completed, the stack information is returned to the stack
area for later use, un-nesting the layers of stack usage and shrinking the
stack size. Thus the maximum size of the stack is determined by how many
subroutines are called on top of each other (the depth of the subroutine call
graph), as well as the storage space needed by each of those subroutines, plus
additional area needed by any interrupt service routines that may be active at
the same time.

Some processors have a separate hardware stack for
subroutine return information and a different stack for parameters and
variables. And many operating systems have multiple stacks in support of
multiple tasks. But for the most part similar ideas apply to all embedded
controllers, and we’ll just discuss the single-stack case to keep things simple.

It is common for stacks to grow “downward” from high memory
locations to low locations, with global variables, Real Time Operating System
(RTOS) task information, or other values such as the C heap being allocated
from the low memory locations to high memory locations. There is an unused
memory space in between the stack and the globals. In other words, the stack
shares limited RAM space with other variables. Because RAM space is limited,
there is the possibility of the stack growing so large that it overlaps variable
storage memory addresses so that the stack corrupts other memory (and/or loads
and stores to global memory corrupt the stack). To avoid this, it is accepted
practice to determine the worst case maximum stack depth and ensure such
overlap is impossible.

Maximum stack depth can be determined by a number of means.
One way is to periodically sample the stack pointer while the program is
running and find the maximum observed stack size. This approach is a starting
point, but one should not expect that sampling will happen to catch the
absolute maximum stack depth. Catching the worst case can be quite difficult
because interrupt service routines also use the stack and run at times that are
generally unpredictable compared to program execution flow. (Moreover, if you
use a timer interrupt to sample the stack pointer value, you’ll never see the
stack pointer when the timer interrupt is masked, leaving a blind spot in this
technique.) So this is not how you should determine worst case stack depth.

A much better approach is to use static analysis tools specifically
designed to find the worst case set of subroutine calls in terms of stack
depth. This technique is effective unless the code has structures that confound
the analysis (e.g., if a program uses recursion, this technique generally
doesn’t work). When the technique does work, it has the virtue of giving an
absolute bound without having to rely upon whether testing happened to have
encountered the worst case. You should use this technique if at all possible.
If static analysis isn’t possible because of how you have written your code,
you should change your code so you actually can do static analysis for maximum
stack depth. In particular, this means you should never use recursion in a critical system both because it
makes static analysis of stack depth impossible. Moreover, recursive routines
are prone to stack overflows even if they are bug free, but happen to be fed an
exceptional input value that causes too many levels of recursion. (In general,
using recursion in any small-microcontroller embedded system is a bad idea for
these reasons even if the system is not safety critical.)

Beyond static analysis of stack depth (or if static analysis
isn’t possible and the system isn’t safety critical), an additional accepted
practice is to use a “stack sentinel” to find the “high water mark” of the
stack (I’ve also heard this called a “stack watermark” and a “stack guard”).
With this technique each memory location in the stack area of memory is
initialized to a predetermined known value or pattern of known values. As the
stack grows in size, it will overwrite these known sentinel values with other
values, leaving behind new patterns in memory that remain behind, like
footprints in fresh snow. Thus, it is easy to tell how big the stack has ever
been since the system was started by looking for how far into the unused memory
space the “footprints” of stack usage trample past the current stack size. This
both gives the maximum stack size during a particular program run, and the
overall system maximum stack size assuming that the worst case path has been
executed. (Actually identifying the worst case path need not be done – it is
sufficient to have run it at some time during program execution without knowing
exactly when it happened. It will leave its mark on the stack memory contents
when it does happen.)

To convert this idea into a run-time protection technique,
extra memory space between the stack and other data is set up as a sacrificial
memory area that is there to detect unexpected corruptions before the stack can
corrupt the globals or other RAM data. The program is run, and then memory
contents are examined periodically to see how many stack area memory words are
left unchanged. As part of the design validation process, designers compare the
observed worst case stack depth against the static analysis computed worst case
stack depth to ensure that they understand their system, have tested the worst
case paths, and have left adequate margin in stack memory to prevent stack
overflows if they missed some special situation that is even worse than the
predicted worst case.

This sentinel technique should also be used in testing and
in production systems to periodically check the sentinels in the sacrificial
memory area. If you have computed a worst case stack depth at design time and
you detect that the computed depth has been exceeded at run time, this is an
indication that you have a very serious problem.

If the stack overflows past the sacrificial memory space
into global variables, the system might crash. Or that might just corrupt
global variables or RTOS system state and have who-knows-what effect. (In a
safety critical system “who-knows-what” equals “unsafe.”) Naively assuming that you will always get a
system crash detectable by the watchdog on a stack overflow is a dangerous
practice. Sometimes the watchdog will catch a stack overflow. But there is no
guarantee this will always happen. Consider that a stack-smashing attack is the
security version of an intentional stack overflow, but is specifically designed
to take over a system, not just merely crash it. So a crash after a stack
overflow is by no means a sure thing.

Avoiding stack overflow problems is a matter of considering
program execution paths to avoid a deep sequence of calls, and accounting for
interrupts adding even more to the stack. And, again, even though we shouldn’t
have to say this – never using recursion.

While there isn’t a set number for how much margin to leave
in terms of extra memory past the computed maximum stack size, there are two
considerations. First, you want to leave enough room to have an ample
sacrificial area so that a problem with stack depth is unlikely to have enough
time to go all the way through the sacrificial area and touch the globals
before it is detected by a periodic timer tick checking sentinels. (Note: we
didn’t say check current stack depth; we said periodically check sentinels to
see if the stack has gotten too big between checks.) Also, you want to leave some extra margin to
account for the possibility you just never encountered the worst-case stack
depth in analysis or testing. I’ve heard designers say worst case stack depth
of 50% of available stack memory is a good idea. Above 90% use of stack memory
(10% sacrificial memory area set aside) is probably a bad idea. But the actual
number will depend on the details of your system.

Selected Sources:

Stack sentinels and avoidance of recursion are an entrenched
part of embedded systems folklore and accepted practices. Douglass mentions
watermarking in Section 9.8.6 (Douglass 2002).

NASA recommends using stack guards (essentially the same as
the technique that I call “stack sentinels”) to check for stack overflow or
corruption (NASA 2004, p. 93).

Stack overflow errors are well known to corrupt memory and
cause arbitrarily bad behavior of a running program. Regehr et al. provide an
overview of research in that area relevant to embedded microcontrollers and
ways to mitigate those problems (Regehr 2006). He notes that “the potential for
stack overflow in embedded systems is hard to detect by testing.” (id., p.
776), with the point being that it is reasonable to expect that a system which
has been tested but not thoroughly analyzed will have run-time stack overflows
that corrupt memory.

Monday, June 30, 2014

Accesses to variables shared among multiple threads of execution must be protected via disabling interrupts, using a mutex, or some other rigorously applied concurrency management approach.

Consequences:

Incorrect
or lax use of concurrency techniques can be expected to lead to concurrency
bugs. Such bugs are usually difficult to detect via testing and difficult to
reproduce once detected. A fraction of any such bugs can be reasonably expected
to make it past even extensive testing into production fleets.

Accepted Practices:

Aggressively minimize the use of globally shared variables. Every variable shared between tasks is a chance for a concurrency defect.

For every access to a shared variable, treat the entire time that a copy of the variable is “live” in the computation as a critical section, protecting access via masking interrupts or some other well defined technique.

Avoid concurrency management techniques that are “home brewed” or otherwise not part of well proven practice. Similarly, do not modify well known techniques to optimize efficiency or obtain other perceived benefits. Such techniques are extremely difficult to get right, and altering techniques or using non-standard techniques can be reasonably expected to introduce defects.

Declare every shared variable “volatile” to ensure that reads and writes do not result in stale data being used due to compiler optimization attempts to improve computation speed.

Keep critical sections as short as possible to minimize negative effects on scheduling. (The largest critical section forms a minimum length on blocking time for scheduling purposes).

Discussion:

One aspect of a modern real time embedded system is that it must appear to do many things at once. For example, an engine controller must look at many inputs, set throttle angle and perform diagnostic checks all at the same time. Typically many of these tasks are written as relatively independent pieces of software that must work together, and they must all appear to run at the same time.

In reality there is only one CPU, so tasks must take turns using that CPU, with that turn taking supervised by the RTOS. And, tasks often must share things such as memory locations and input/output devices. The turn-taking means that, unless software designers are extremely careful, on occasion shared resources can be left in undefined or incorrect states, resulting in concurrency bugs.

Avoiding and fixing concurrency bugs is a major source of design and testing effort on most embedded systems. In part this is because concurrency bugs can be quite subtle, and in part it is because they can be very difficult to activate during testing as well as difficult to isolate even when one is observed (if they are observed at all during testing). The difficulty of detecting and fixing concurrency defects, as well as the reasonable probability that they won’t be seen at all in testing, makes disciplined use of good practices essential.

A variety of techniques are available to avoid concurrency problems. A preferred approach is avoiding situations in which concurrency bugs are possible. For example, avoiding the use of shared global variables avoids associated concurrency defects (because the shared variables simply aren’t there to begin with). But, when sharing can’t be avoided, there are well defined basic techniques that work. Typically such techniques work by “locking” some resource so that other tasks cannot use it. As an analogy, consider a changing room at a clothing store. If you want to make sure that nobody else tries to use the room you are using, you need to “lock” the room when you enter (with an actual door lock, or maybe just by closing the door or curtain all the way), and then “unlock” the room when you leave. If you never lock the door it might be that nothing bad happens for dozens or hundreds of times you try on clothes. But eventually someone will wander into the room by mistake while you are there if you don’t lock the door.

Disabling interrupts is a concurrency design approach that can be thought of as a program “locking” the CPU so that no other task can use it when a shared variable is being accessed. The way this works is that any task wanting to, say, increment a variable first disables interrupts, then increments the variable, then re-enables interrupts. The period of time between the first read of a variable and when the variable is done being updated is known as a “critical section,” and is the time during which no other task can be permitted to access the variable. Disabling interrupts turns off the hardware’s ability to switch tasks or perform anything but the desired computation during the critical section. This ensures that no other task in the system can read or write the shared variable, because disabling interrupts prevents any other task from running. It is essential that every single access to a shared variable disable interrupts for the entire use of the variable for this to be guaranteed to work. If a local copy of the variable is kept and used outside the time during which interrupts are disabled, there are no guarantees as to how the system will behave when that local copy is subsequently used to update the variable. Other techniques are available to manage concurrency beyond disabling interrupts. But, this is a common technique in embedded systems.

Even using these techniques, special care must be used in accessing any shared resource. For example, the keyword volatile must be used for every shared resource to ensure that the most up to date copy is always accessed. (But, even this won’t help if that copy is updated at an unexpected time.)

Douglass
gives a pattern for a critical section in section 7.2, and in section 7.2.6
says that the most common way to prevent context switching during a critical
section is to disable interrupts. In section 7.2.5, Douglass says: “The
designers and programmers must show good discipline in ensuring that every
resource access locks the resource before performing any manipulation of the
source.” (Douglass 2002)

MISRA
recommends that developers “Use Test-and-Set instructions or a signaling
mechanism, such as Dekker/Dijkstra/Lamport Semaphores, to protect and mark as
‘in-use’ any common resources.” (MISRA Report 3 p. 26) In more modern
terminology, this is a recommendation to use a mutex or related semaphore-based
“lock” on data. MISRA also cautions that interrupt enable and disable
instructions must be used with care. (id.)

Sullivan presents results of a study of defect
types, concluding that 11% of high impact memory corruption errors are due to
concurrency defects. (Sullivan 1991, p. 6). This means that while most defects
are easier to track down, a few race conditions and other concurrency defects
can be expected to happen.

Concurrency defects are so difficult to find that specific testing and analysis tools have been developed to find them, prompting the creation of a benchmark suite to evaluate such tools (Jalbert 2011). A significant challenge to creating such a benchmark is the difficulty in reproducing such bugs even when the exact bug is completely understood. (id., first page)

Park et al. provide a summary of work on finding and fixing concurrency defects (Park 2010). Among other things they note that a concurrency defect was responsible in part for the 2003 Northeastern US electricity blackout that left 10 million people without power (id. p. 245), and that such bugs are difficult to reproduce (id.).

About Me

I've done embedded systems for big industry, the US military, startup companies, and now Carnegie Mellon University. I'm the author of the book Better Embedded System Software, which goes into more detail on most of the topics discussed in my corresponding blog.As with any blog, these posts often contain speculative and partially formed thoughts, and should not be interpreted as a fully considered opinion unless stated otherwise.Key pages:Academic home page at CMUEmbedded Software Blog Checksum and CRC Blog