Toyota Case: Vehicle Testing Confirms Fatal Flaws

MADISON, Wis. — Among the hundreds of cases brought by individuals across the United States claiming their Toyota vehicles accelerated without warning, only Bookout v. Toyota Motor, tried in Oklahoma County, Okla., resulted in a verdict against Toyota. This was also one of the first unintended acceleration cases to go to trial since the Japanese carmaker began recalling millions of vehicles in 2009 over this very issue.

The Oklahoma case was also the first in which plaintiffs' attorneys put the fault squarely on a flaw in the vehicle's electronic throttle control system. They dismissed arguments about floor mats and sticky pedals and focused on the software that controls the electronic throttle. The attorneys supported their argument with extensive testimony from embedded systems experts.

Similar testimony and extensive software analysis reports had been filed previously in other courts looking into unintended acceleration. But none of that material became public, because Toyota paid settlements and obtained gag orders before those cases went to trial. The public and the engineering community had to wait until the Oklahoma trial, where all testimony became public.

A dozen embedded systems experts were allowed to review Toyota's electronic throttle source code in a secure room in Maryland -- described as the size of a small hotel room. The room, with a guard at the door, was disconnected from the Internet. No cellphones, paper, belts, or watches were allowed inside. The experts viewed Toyota's code on five computers in cubicles.

Having spent more than 18 months going in and out of the secure room to study Toyota's code, Michael Barr, CTO of the Barr Group, put together an 800-page report analyzing the 2005 Camry L4's software. On the witness stand, he walked a jury step by step through what the experts discovered in their source-code review. According to Barr's testimony, that review revealed:

A multifunction kitchen-sink Task X designed to execute everything from throttle control to cruise control and many of the fail-safes

That all Task X functions, including fail-safes, are designed to run on the main CPU in the Camry's electronic control module

That the brake override that is supposed to save the day when there is an unintended acceleration is also in Task X

The use of an operating system in which there is no protection against hardware or software faults

A number of other problems

Barr testified that the source-code review indicated "both that task could die by the memory corruption, and that also that one of side effects of that would be that this -- for example, that task died, that many of fail safes would be disabled." But is it possible to prove that the experts' discoveries in that cloak-and-dagger source-code room would manifest themselves in a moving vehicle? How do we know how a car might react to malfunctions or an outright failure in Task X?

The plaintiffs' attorneys noted that they actually conducted vehicle testing. Though Barr wasn't present when the vehicles were tested, he testified that his group's simulations in the source-code room were tested by a gentleman named Mr. Louden, using 2008 and 2005 Camry vehicles. The purpose was to perform the same testing and demonstration (originally done in the source-code room) to determine what the fail-safes would do in a vehicle in response to task death.

Excerpts of the court transcript
EE Times is publishing a portion of the court transcript relevant to vehicle testing. The following Q&A was carried out when Benjamin E. Baker, Jr., representing the plaintiffs, called Barr to the stand.

I'm not a safety expert, but I've had to deal with some safety issues, especially SEMI S2.

The SEMI S2 safety standard requires an EMO button that turns off all power, except that required for safety and logging systems. The EMO circuit has to be entirely electrical: NO SOFTWARE! Even Safety PLC's don't qualify. There's a lot to like about that approach.

On the other hand, my impression (I could be wrong here) is that the newer European safety standards are going away from this approach (for example allowing STO - safe torque off - and networked safety), and are allowing software into the loop, as long as it meets the appropriate SIL level standards, which are development process oriented. I'm a process skeptic.

To give an idea of how industrial safety can be done, the Banner Micro-Screen light curtains used dual MCUs with different architectures and software ("diverse redundant"). When you're using a light curtain to guard something like a hydraulic press that can crush somebody, this type of approach is crucial.

As far as the funky arrangement they had to access the code, that seems like a pretty common practice when outsiders need access to see critical code (at least from a lawyer's perspective on IP / non-disclosure protection).

As far as the error replication, if you don't know the root cause you are only guessing at it, which makes replication a real bear. I'm not surprised that they were not able to reproduce it. The theory that they examined was that it was a Single-Bit-Error, which can have many causes. And without ECC, it was unmitigated. The system design just propagated the fault to a system failure which was an unsafe end state.

Of course, as any practitioner of Ford's 8D problem-solving can tell you, they really have 3 errors. In addition to the unmitigated single-bit-error, they also have a test / validation process that failed to find it. And, third, they have a design process that failed to prevent it happening in the first place. The lawsuit will really only address the first error; it's incumbent on Toyota to address the second two. (And in my experience, Japanese companies tend to go after all three as a matter of course.)

One thing that Toyota should also do is check their Hardware Vendor's design for Metastability. This could be the actual root cause of the bad input/bit flip. With so many car's on the road I would guess they would have a certain chance of this happening. This risk can be modeled very accurately. The predict circuit behavior across all variations of process parameters, supply voltages, operating temperatures and the increasingly important effects of circuit aging is know. I think Blendics has the best one I've seen http://www.blendics.com/index.php/blendics-products/metaace .

Some of the bigger semiconductor companies have ad hock program, but nothing like this.

I think that some of the cost likely should be born by the HW companies, as I've rarely seen too much attention paid to this.

It would be good to interview Jerry Cox the CEO of Blendix. He is a senior professor at WUSTL and also cofounded Growth Networks which was acquired by Cisco. I would guess he is one of the top asynchronous experts in the world.

@Winderer. Agreed. In the Toyota case, what I understood from Michael barr is:

Toyota's engineers sought to protect numerous variables against software- and hardware-caused corruptions (for example, by "mirroring" their contents in a 2nd location), but they failed to mirror several key critical variables.

I worked in medical and there was always a safety CPU or FPGA or safety analog circuitry. Basically they all worked the same. The input and output states were monitored and if there was some illegal combination, the device was put into a safe mode. I worked on safety analog circuits which were fairly simple measurement circuits and comparators with the advantage that analog circuits are conducive to single point failure analysis. It's hard to see how automotive gets away without any these safety methods.

Reading through the released court notes, it appears as they are only discussing a single point of control.

Being that the single point of control code is the target of the discussion, I would assume (and you know what that does to all involved) that they have only implemented a single point of control even though a dual point of mechanical control is in the process of control, as you have stated.

My comments are based on a failsafe system which does not rely on a single point of control, rather a duality of control with a monitoring unit, all being separate devices to insure a failsafe control system.

I have found in the past that just implementing fail safe code on a single MPU/CPU control unit such as a WDT or rolling codes, does not guarantee a failsafe system, but still creates a single point of failure as the court disclosures have show in the articles I read.

They only discuss function X as a single function which is responsible for all failsafe determinations, and only discuss a single MPU./CPU controller (unless I missed something).

I would never design a system such as this in which life or limb were in danger.

Even the system they designed was put through serious certifications and testing, and the error still exposed itself in real world applications.

I would NOT want any of these engineers designing a air/space ship of which I would travel on in the future.

I find it odd that the review engineers had to be sequestered to be able to review the code and determine the possible issues.

I also find that it is odd that they did not setup a know failing system and test the until a failure was seen to determine without a doubt, what the root cause IS, not assuming the failure by causing a most probable failure.