Meeting Reliability For Automotive Applications With PCI Express

Automotive electronics such as powertrain and braking controls, Advanced Driver Assistance Systems (ADAS), and other vehicle operations platforms, where reliability is of utmost importance must meet stringent reliability standards. Even an automotive infotainment system is expected to perform flawlessly even under a variety of temperatures, humidity, and vibrations. Reliability is a key component of functional safety and is critical to achieving the Automotive Safety Integrity Levels (ASIL) certification required for most ADAS systems. System-on-chip (SoC) designers need to approach automotive reliability with even more concern than they would for a high-performance server operating in a traditional data center.

Today’s ADAS require compute power on par with that of a data center server. The architectures for many ADAS involving machine vision are similar to high performance cloud computing systems: an array of powerful processors connected by high bandwidth PCI Express® (PCIe) links. As a result, it is no surprise that PCI Express is becoming prevalent in automotive electronic systems. It is also common to find PCIe WiFi chips, PCIe GPUs, and PCIe ASIC-to-ASIC connections in infotainment systems.

Link integrity mechanisms
The PCI Express protocol includes a very robust link integrity scheme, but it has some reliability limitations which may not be immediately apparent.

Every application packet includes a link-level cyclic redundancy check (LCRC) which is verified immediately upon receipt. An Acknowledged/Not-Acknowledged (ACK/NAK) mechanism handles seamless retransmission of erroneous packets, and includes timeouts to ensure broken links do not go unnoticed. Perhaps the most obvious limitation is that the LCRC can only protect the data actually presented to the PCI Express interface logic – it has no way to confirm that data is actually correct. More subtly, the retransmission of erroneous packets due to NAKs can hide signal integrity problems in the physical interconnect since the application software and even upper-layer hardware are less likely to be aware of the retransmissions. Whether due to a fundamental problem present at design/manufacturing time, or due to degradation over the product lifetime, all but the most severe PCI Express link errors will be largely invisible to software.

To address these shortcomings, SoC designers must first ensure that on-chip data is reliable so known bad data is never sent out on the link and that any bad data received on the link is never passed into the application logic. Secondly, SoC designers must make sure the link itself is reliable, remains available even when degraded, and alerts the application logic to any problems.

On-chip data reliability
There are two sub-areas of on-chip data protection: ‘data at rest’ and ‘data in-flight.’ Protecting data at rest requires some mechanism to ensure data stored in a memory array doesn’t change while ‘resting’ in that array. In the early days of on-chip SRAMs, failure rates and random error rates were high, so designers included protection mechanisms like parity and/or redundancy in attempts to guard against unintended data changes. As CMOS processes matured, these concerns lessened and designers in many markets chose to accept unprotected SRAMs to cut down on area overhead for protection against increasingly less likely error events.

However, with the rapid shrinking of silicon geometries and the change from planar to FinFET transistors, concern over such ‘soft’ or ‘random’ errors appears to be growing again. Fortunately, the increased gate counts possible with modern silicon processes make more advanced techniques such as Error Correcting Code (ECC) feasible – and for automotive applications, arguably mandatory as they provide much stronger protection against data corruption.

While precise details vary by the ECC chosen, today’s SoC designer should be able to get full Single Error Correct, Double Error Detect (SECDED) protection at a cost of around 8 bits of additional storage for every 64 bits of data. The additional logic complexity is outweighed by the additional capability for a system to survive single-bit errors. It is particularly important for the automotive SoC designer to ensure that both correctable and uncorrectable errors are logged and reported to software. By logging both the failed data bit(s) and SRAM line address, application or diagnostic software will have the information necessary to identify potentially failing hardware from patterns of even soft errors over time. Data at rest is generally in transition from layer to layer in PCI Express designs, so the SoC designer will not find a benefit in rewriting any corrected data values back into their originating memory as once passed to the next layer, the original memory locations will be reused for a later packet.

Protecting data in-flight is the process of ensuring correct data is carried through the various non-storage data paths of the SoC. For designers using ECC on their memories, carrying the ECC code along with the data certainly accomplishes the desired protection but the additional ECC checks may not be desirable due to area or timing closure. Given that even cutting-edge FinFET flip-flops are considered to be fairly reliable, the industry practice of carrying simple parity is likely sufficient – even in automotive applications.

When uncorrectable errors are detected anywhere on the outbound path to the PCI Express link, SoC designers must implement some type of error recovery handshake with the application logic. Because packets are often pipelined, simply invalidating an outbound packet and notifying the application logic may not be able to prevent a subsequent packet from being transmitted. Worst case, that packet might indicate a higher-level protocol ‘successful completion’ message related to the corrupted data. Even though the bad packet was never transmitted, the system memory (intended to be updated by the now invalidated packet) will not have valid data, and so receiving a ‘success’ message would be catastrophic.

PCI Express link reliability, availability
The PCI Express transport is inherently excellent at delivering correct data, so if the SoC designer can provide solid data protection up to the PCI Express controller, correct data transfer will be assured. The key area for improvement here is tracking reliability from the perspective of first-time error-free transfer. If every packet takes three attempts to deliver successfully, the link may be reliable in the sense of correct data delivery, but not in the sense of error-free transfers.

Long experience with PCI Express has shown that poor quality channels are the number one contributor to poor link reliability. Unfortunately, the channel design is usually out of the hands of the SoC designer, and automotive environments are notoriously harsh – with wide temperature swings and high levels of vibration. The SoC designer can track channel quality through a series of event counters and logging facilities. Of course the internal data protection errors (both correctable and uncorrectable) should be tracked as previously noted. Figure 1 shows some of the key information which should be considered for tracking at the various layers of the PCI Express protocol. Some of this data may be best understood in the context of number of events per some unit of time, while others may make most sense as simple event logs.

Figure 1: Key events to track at the various layers of the PCI Express protocol

An important consideration here is non-volatile storage of the reliability data. At a minimum, the registers in question must survive an SoC reset – if the link goes down, a system reset may be needed to bring it back up again. Ideally, the data would be preserved in a non-volatile storage medium so it could be accessed after a loss of power and/or a very long time passes. Consider that the automobile in which the SoC resides might not be seen for system diagnostics for a year or more! It’s also useful to note that the same data on link quality which reflects reliability can be invaluable during initial laboratory bring-up of the system – so it is important to consider an access mechanism that is independent of a working PCI Express link.

Examples might include a logic analyzer connection, a USB interface for debugging, a processor In-Circuit-Emulator connection, or other proprietary mechanisms for accessing the SoC without requiring a working PCI Express link.

Error injection is another capability to consider. Automotive certifications tend to require extensive system testing, and creating the full gamut of potential PCI Express error events can be very difficult. By designing-in the ability to generate those events – both inbound as if they’d been detected on the PCI Express link and outbound by actually causing them to occur on the PCI Express link – SoC designers can greatly facilitate such testing. Furthermore, controlled error injection substantially eases the process of software testing (both embedded firmware and system drivers) and so a comprehensive system of error injection provides a huge benefit overall.

Conclusion
Today’s connected automobile contains compute platforms and architectures which closely resemble those used in data centers. Given PCI Express’ successful use in such cloud and data center applications, it is no surprise to see the protocol used in automotive applications. Automotive electronic systems must meet certain standards of reliability and safety, and the PCI Express protocol can fulfill those requirements with a combination of external and internal data protection and reliability features. SoC designers should provide data protection both at rest and in-flight, using industry best-practices of SECDED ECC on all memories and at least byte-parity on datapaths. They should also design-in link reliability measuring hardware, with non-volatile storage as/where appropriate, to enable comprehensive system diagnosis and to ease initial link bring-up efforts. SoC designers should also implement the corresponding error injection capabilities to best provide for system-wide reliability testing and software development. Synopsys’ DesignWare® IP for PCI Express supports these features and enables designs for automotive reliability.