Looking Everywhere for IC Reliability

At the nanometer scale, engineers must not only design functionally correct circuits that can be tested and manufactured but also take extra steps to guarantee short and long-term reliability in the field.

It seems that over the last few years, significant and growing emphasis has been placed on reliability. We see articles and papers on it everywhere. The July/August 2013 IEEE Micro Magazine has "Reliability-Aware Microarchitecure" on the front cover. The Global Semiconductor Alliance has a working group on the electrostatic discharge (ESD) structures required for 3D-ICs. And, of course, the IEEE ESD Association and International Reliability Physics Symposium (IRPS) groups are constantly weighing the benefits and design tradeoffs that must be made for such endeavors.

Decreasing design cycles, tighter integrations, and faster turnaround-times have resulted in a general increase in the use and re-use of IP, from both internal and external sources, that has helped us put systems together at an accelerated rate, with ever-increasing transistor counts.

For me, one big question stands out above all: How do you know that you've assembled your system-on-chip correctly? What is "correctly," anyway? Each piece of IP has its own quirks, requirements, and history. Version incompatibility may exist for some IP combinations -- from something as simple as the power domains and voltages that need to be hooked up, to the complexities of a chip's power state table.

Correct may come in several different flavors. Our logic simulators help us a great deal for many of these issues, but not when it comes to the physical implementation. We build transistors in silicon, and ultimately, that's what we need to validate.

Traditional design rule check (DRC), layout versus schematic (LVS), and electrical rule check (ERC) tools have taken us a long way toward avoiding the more obvious manufacturing limitations we have for each process node. But, how do we adequately verify that thin-oxide gates have the correct bias? That high-voltage devices are not driving low-voltage devices past their rated thresholds? Or that the complex symmetry and orientation rules we have are dutifully obeyed?

This next level of verification, the level of reliability, focuses more on the subtle and longer-term effects that these circuits may experience. Detailed SPICE simulations can help immensely. But it is not possible to execute and resolve SPICE simulations across an entire complex digital IC. So SPICE can only help if we know there is a problem, where it is, and what the correct input vectors are for correctly stimulating that portion of the circuit.

Today's designs often contain elements that can be readily checked from a topology perspective, before we even get to SPICE simulations. One example that comes to mind is that of low-power design. In attempts to minimize both dynamic and static (leakage) power dissipation, lower voltages, thinner oxides, and, where appropriate, transistor stacking are all employed. Transistor stacking, where a single transistor with high leakage is substituted for two stacked transistors, each half the width of the original, results in a slight increase in signal delay, but a significant improvement in static leakage.

Isolating and confirming the presence of these design elements forms an integral part of an overall comprehensive reliability verification solution. So, too, does the validation of the bulk and contact locations of these transistors.

Much of the additional verification we do on our designs is to ensure robust operation over an extended operating period. What levels do you go to for the extra validation of design robustness? Are point-to-point resistance, current density, and electromigration simulations on your "must do" list? Or does someone else in another group manage that for you? As the designer, or verification engineer, what is driving the next thing that you're looking to add to the reliability verification suite? Is this motivation being driven by increased design complexity? Aspirations of greater design reliability? Or maybe even greater awareness of product failures and device returns?

Given almost unlimited CPUs and verification cycles, where do you see things going? What's your next killer check going to be? How do you contribute to improving your designs' overall reliability?

— Matthew Hogan is a product marketing manager at Mentor Graphics. He is an active member of the ESD Association, involved with the EDA working group, the EOS/ESD Symposium technical program committee, and the International Electrostatic Discharge Workshop management committee.

My experience is that the many types of MEMORY added to a microcontroller (for example) often require special screening tests that are often so proprietary that there is nothing in the literature rating the various screen effectiveness. While IP cores are often robust for particular flows already tested, the addition of more and more memory and memory types may be responsible for serious field failure issues? Any comments?

I suspect that memory issues hide behind more problems than we know. My PC, with original memory, has experienced memory errors that defy detection by the internal diagnostics. However, when the memory chips are reseated, the problems have gone away. Furthermore, years ago I experienced compatibility problems with "equivalent" additional memory. I think there is a gray area of performance issues that drag down speed and reliability without being obvious.

From the IC design viewpoint, reliability modeling leaves a lot to be desired, and first silicon characterization requires step-stress testing to protect the end customer. That is what my comments were about, related to more and different types of memory being less than robust and almost impossible to model in predictive manner.

So here is some more bad news, or expensive good news, just my opinions.

No easy answers.

So multiple testing, both at probe and final test, can lead to first silicon learning, and then if the mechanisms are studied for each process flow and design-rule node, decisions can be made, but that may often mean moving to a cleaner foundry, rather than any layout or process specification change. Test temperatures can help screen these weak devices, hot or cold, but that is costly. But not much up front device-level testing prior to full IC first silicon will uncover the impact of wafer fabriciation abberations.

But where are the case studies published for these types of issues? Many IEEE Rel Physics papers are all about III V and bleeding edge silicon, not about simply making Silicon CMOS memory devices more robust instead of using costly screens (but I will look at recent papers if someone can suggest a few).

So design for reliability unforetuneatly has to include post-test screening such as raw parametric data analysis at probe, with outliers (relative to rest of same wafer) screened statistically (often weighting outlier distances for multiple tests) OR tested in package form at two temperatures, screening out outliers vs rest of lot for example. Then decision can be made if the post-test outlier screen yield is poor for some foundries vs others, or some process tools vs others, and finally if there are design changes, pre and post change yield of outliers. This is an "automotive market" solution (costly).

With super high device counts, and ultra-clean foundries charging more than tier 2 foundries, commodity consumer products are at the mercy of the end product testing as only screen for these new memory-intensive products that have to assure long battery life but not necessarily long product use life nor nor adverse outdoor environmental issues.

Supply chain war. Anyone have a decision matrix that shows which foundry tier may be dangerous for certain device types and counts for consumer markets?

"Design" includes process-device integration profound knowledge, as always. Modeling without failure statistics is perhaps not useful. Do we share real test results vs field results as memory device counts increase? Or as number of "must be matched" analog device counts increase? Fabless design shops are at serious risk, and IDM's who often get all this "profound knowledge" are not sharing. But that's what makes this industry interesting...and costly for investors at current rate of change. As far as consumer are concerned, staying one or two generations behind the "bleeding edge" may minimize surprises.

So the worlds fastest game machine, may also be the most vulnerable to single bit failures as well as wearout issues from running very hot. And a flash drive beyond 256 bits used for OS, applications, and storage may be better a few years from now, maybe not today? Does anyone have real data? Automotive industry is VERY cautious, and Medical Device people avoid IC's not well screened and well understood. Its the consumer products that get the latest and fastest and perhaps less robust IC's, and you can always get a warranty, but expect to have to use it.