Monday, October 13, 2014

Consequences:
A poor safety culture dramatically elevates the risk of creating an unsafe product. If an organization cuts corners on safety, one should reasonably expect the result to be an unsafe outcome.

Accepted Practices:

Establish a positive safety culture in which all stakeholders put safety first, rigorous adherence to process is expected, and all developers are incentivized to report and correct both process and product problems.

Discussion:
A “safety culture” is the set of attitudes and beliefs employees have to attaining safety. Key aspects of such a culture include a willingness to tell management that there are safety problems, and an insistence that all processes relevant to safety be followed rigorously.

Part of establishing a healthy safety culture in an organization is a commitment to improving processes and products over time. For example, when new practices become accepted in an industry (for example, the introduction of a new version of the MISRA C coding style, or the introduction of a new safety standard such as ISO 26262), the organization should evaluate and at least selectively adopt those practices while formally recording the rationale for excluding and/or slow-rolling the adoption of new practices. (In general, one expects substantially all new accepted practices in an industry to be adopted over time by a company, and it is simply a matter of how aggressively this is done and in what order.)

Ideally, organizations should identify practices that will improve safety proactively instead of reactively. But regardless, it is unacceptable for an organization building safety critical systems to ignore new safety-relevant accepted practices with an excuse such as “that way was good enough before, so there is no reason to improve” – especially in the absence of a compelling proof that the old practice really was “good enough.”

Another aspect of a healthy safety culture is aggressively pursuing every potential safety problem to root cause resolution. In a safety-critical system there is no such thing as a one-off failure. If a system is observed to behave incorrectly, then that behavior must be presumed to be something that will happen again (probably frequently) on a large deployed fleet. It is, however, acceptable to log faults in a hazard log and then prioritize their resolution based on risk analysis such as using a risk table (Koopman 2010, ch. 28).

Along these lines, blaming a person for a design defect is usually not an acceptable root cause. Since people (developers and system operators alike) make mistakes, saying something like “programmer X made a mistake, so we fired him and now the problem is fixed” is simply scapegoating. The new replacement programmer is similarly sure to make mistakes. Rather, if a bug makes it through a supposedly rigorous process, the fact that the process didn’t prevent, detect, and catch the bug is what is broken (for example, perhaps design reviews need to be modified to specifically look for the type of defect that escaped into the field). Similarly, it is all too easy to scapegoat operators when the real problem is a poor design or even when the real problem is a defective product. In short, blaming a person should be the last alternative when all other problems have been conclusively ruled out – not the first alternative to avoid fixing the problem with a broken process or broken safety culture.

Believing that certain classes of defects are impossible to the degree that there is no point even looking for them is a sure sign of a defective safety culture. For example, saying that software defects cannot possibly be responsible for safety problems and instead blaming problems on human operators (or claiming that repeated problems simply didn’t happen) is a sure sign of a defective safety culture. See, for example, the Therac 25 radiation accidents. No software is defect free (although good ones are nearly defect free to begin with, and are improved as soon as new hazards are identified). No system is perfectly safe under all possible operating conditions. An organization with a mature safety culture recognizes this and responds to an incident or accident in a manner that finds out what really happened (with no preconceptions as to whether it might be a software fault or not) so it can be truly fixed. It is important to note that both incidents and accidents must be addressed. A “near miss” must be sufficient to provoke corrective action. Waiting for people to die (or dozens of people to die) after multiple incidents have occurred and been ignored is unacceptable (for an example of this, consider the continual O-ring problems that preceded the Challenger space shuttle accident).

The creation of safe software requires adherence to a defined process with minimal deviation, and the only practical way to ensure this is by having a robust Software Quality Assurance (SQA) function. This is not the same as thorough testing, nor is it the same as manufacturing quality. Rather than being based on testing the product, SQA is based on defining and auditing how well the development process (and other aspects of ensuring system safety) have been followed. No matter how conscientious the workers, independent checks, balances, and quantifiable auditing results are required to ensure that the process is really being followed, and is being followed in a way that is producing the desired results. It is also necessary to make sure the SQA function itself is healthy and operational.

Selected Sources:
Making the transition from creating ordinary software to safety critical software is well known to require a cultural shift that typically involves a change from an all-testing approach to quality to one that has a balance of testing and process management. Achieving this state is typically referred to as having a “safety culture” and is necessary step in achieving safety. (Storey 1996, p. 107) Without a safety culture it is extremely difficult, if not impossible, to create safe software. The concept of a “safety culture” is borrowed from other, non-software fields, such as nuclear power safety and occupational safety.

MISRA Software Guidelines Section 3.1.4 Assessment recommends an independent assessor to ensure that required practices are being followed (i.e., an SQA function).

MISRA provides a section on “human error management” that includes: “it is recommended that a fear free but responsible culture is engendered for the reporting of issues and errors” (MISRA Software Guidelines p. 58) and “It is virtually impossible to prevent human errors from occurring, therefore provision should be made in the development process for effective error detection and correction; for example, reviews by individuals other than the authors.”

Abstract:
Investigations into potential causes of Unintended Acceleration (UA) for
Toyota vehicles have made news several times in the past few years. Some
blame has been placed on floor mats and sticky throttle pedals. But, a
jury trial verdict was based on expert opinions that defects in Toyota's
Electronic Throttle Control System (ETCS) software and safety
architecture caused a fatal mishap. This talk will outline key events
in the still-ongoing Toyota UA litigation process, and pull together the
technical issues that were discovered by NASA and other experts. The
results paint a picture that should inform future designers of safety
critical software in automobiles and other systems.

Bio:
Prof. Philip Koopman has served as a Plaintiff expert witness on
numerous cases in Toyota Unintended Acceleration litigation, and
testified in the 2013 Bookout trial. Dr. Koopman is a member of the ECE
faculty at Carnegie Mellon University, where he has worked in the broad
areas of wearable computers, software robustness, embedded networking,
dependable embedded computer systems, and autonomous vehicle safety.
Previously, he was a submarine officer in the US Navy, an embedded CPU
architect for Harris Semiconductor, and an embedded system researcher at
United Technologies. He is a senior member of IEEE, senior member of
the ACM, and a member of IFIP WG 10.4 on Dependable Computing and Fault
Tolerance. He has affiliations with the Carnegie Mellon Institute for
Software Research (ISR) and the National Robotics Engineering Center
(NREC).

I am getting an increasing number of requests to do this talk in person, both as a keynote speaker and for internal corporate audiences. Audiences tell me that while the video is nice, an in-person experience of both the presentation and small-group follow-up discussions has a lot more impact for organizations who need help in coming to terms with creating high quality software and safety critical systems. If you are interested please get in touch for details: koopman@cmu.edu

Other info:

Download copy of full-resolution video file set of talk, Box.com 340 MB .zip file of a web directory with interactive split-screen viewing format. Experts only! Please do not ask me for support -- it works for me but I don't have any details about this format beyond saying to unzip it and open Default.html in a web browser.)

One or more of these download sites might be blocked by company networks, so if you get an error message please try both links at home. If they still don't work, send me an e-mail and I'll see what I can do.

If you are planning on using the materials in a course or similar, I would appreciate it if you let me know so I can track adoption. If you need a variation from the CC BY 4.0 license (for example, to incorporate materials in a situation that is at odds with the license terms) please contact me and it can usually be arranged.

About Me

I've done embedded systems for big industry, the US military, startup companies, and now Carnegie Mellon University. I'm the author of the book Better Embedded System Software, which goes into more detail on most of the topics discussed in my corresponding blog.As with any blog, these posts often contain speculative and partially formed thoughts, and should not be interpreted as a fully considered opinion unless stated otherwise.Key pages:Academic home page at CMUEmbedded Software Blog Checksum and CRC Blog