Toyota Case: Single Bit Flip That Killed

MADISON, Wis. — Could bad code kill a person? It could, and it apparently did.

The Bookout v Toyota Motor Corp. case, which blamed sudden acceleration in a Toyota Camry for a wrongful death, touches the issue directly.

This case -- one of several hundred contending that Toyota's vehicles inadvertently accelerated -- was the first in which a jury heard the plaintiffs' attorneys supporting their argument with extensive testimony from embedded systems experts. That testimony focused on Toyota's electronic throttle control system -- specifically, its source code.

The plaintiffs' attorneys closed their argument by saying that the electronics throttle control system caused the sudden acceleration of a 2005 Camry in a September 2007 accident that killed one woman and seriously injured another on an Oklahoma highway off-ramp. It wasn't loose floor mats, a sticky pedal, or driver error.

An Oklahoma judge announced that a settlement to avoid punitive damages had been reached Thursday evening. This was announced shortly after an Oklahoma County jury found Toyota liable for the crash and awarded $1.5 million of compensation to Jean Bookout, the driver, who was injured in the crash, and $1.5 million to the family of Barbara Schwarz, who died.

During the trial, embedded systems experts who reviewed Toyota's electronic throttle source code testified that they found Toyota's source code defective, and that it contains bugs -- including bugs that can cause unintended acceleration.

"We've demonstrated how as little as a single bit flip can cause the driver to lose control of the engine speed in real cars due to software malfunction that is not reliably detected by any fail-safe," Michael Barr, CTO and co-founder of Barr Group, told us in an exclusive interview. Barr served as an expert witness in this case.

A core group of seven experts, including four from Barr Group, analyzed the Toyota case. Their analysis ultimately resulted in Barr's 800-plus-page report.

In Toyota's own view, though, the automaker had been already exonerated when the National Highway Traffic Safety Administration closed its probe of Toyota models in February 2011. The NHTSA decision came after NASA investigated Toyota's electronic throttle control system and found no electronic causes of unintended acceleration during a 10-month review.

But not everyone in the embedded systems industry thinks NASA had enough time to come up with a complete report. Perhaps more significantly, in its report, NASA itself did not rule out the possibility of software having caused unintended acceleration.

The group of seven experts was given the task of picking up where the NASA investigation left off.

It's certainly the case that tasks can die, and require a system reboot. That's why you have watchdog timers in control system software. In the description of the problem, it appears that several tasks died simultaneoiusly, although we don't know which tasks nor how simultaneous they were.

And it's also not clear whether individual task were monitored correctly, and whether it was the simultaneous nature of the failures that created a case where the reboots didn't occur.

Also, it looks like they found several potential mechanisms, not necessarily THE cause. One way to design around this sort of problem, although nothing will be 100 percent, is to have redundant processes do the same computations, and then compare the control signal at the output. If there's no match, you default to no acceleration.

The last safety measure is of course the driver. If unintended acceleration occurs, certaily in a 2005 car, put the car in neutral and shut off the engine!

Although the quote about the danger of a "single bit flip" seems to have been in the context of software bugs -- it's hard to tell just from the quotes in this interview -- Barr also mentions single event upset. Memory bit errors (so-called "soft error rate") are a more of a hardware & system design issue, at least to the extent that the design includes mirroring, error detection and/or correction or other fail-safe measures.

At modern VLSI geometries, the soft error rate of an SRAM bit cell being bombarded with cosmic radiation at ground level is not as inconsequential as one might think -- especially for critical safety systems.

It makes one wonder how blame can be attributed to software in a system in which the source of the error may have been a random SRAM bit that was flipped by an alpha particle or other natural radiation event. Is the failure being blamed on software, or is it an overall laxity of hardware plus software that failed to prevent all of those 16 million possible ways a software task can die? How much fail-safing & hardware redundancy is enough to adequately protect against these events? In the end, it is a probabalitic issue, and the probability of failure will never be zero.