Killer Bugs

Hardware and software bugs are all around us. When an application suddenly dies or a smart phone freezes because of the unanticipated interaction between hardware and software blocks in a system on chip, most users aren’t even the least bit fazed. They usually just re-boot and forget about it.

Bugs caused by power are an entirely different matter, however. For one thing, they’re usually fatal. For another, they’re getting much, much harder to detect. And third, they’re harder to fix when they are detected.

“Debugging is getting much more difficult because when the lead generator is powered off, how do you find out that there’s a problem?You may have two power lines with a different Vdd because of connectivity and it will not work,” said Bhanu Kapoor, president of Mimasic, a consultancy focused on low power. “With power you used to have a single voltage. Now you have different supply lines, so you get new problems. Some of this can be detected in the netlist, but some of these problems also show up in the course of manufacturing.”

The problem is magnified by the addition of multiple power islands and multiple cores.

“Correct delivery of a power supply is at the core of many of the power issues, and traditional testing methods use a fault model that is based on wires erroneously connected to supply or ground,” Kapoor said. “And needless to say, incorrect delivery of power will result in fatal issues for proper operation of the chip. For example, an isolation cell at the output of a power domain ensures active regions receive meaningful signal when this domain is shut down. If the supply to the isolation cell itself is switched off due to either an incorrect wiring or improper placement of the isolation cell then active regions will see some unknown values that will lead to failure of operation in this mode. “

Similar things will happen if the power supply for a level-shifter has wiring issues. It may be worse here since depending upon of voltage differences, the issue may only show up sometimes. And there may be these very hard to find sneaky leakage paths that drain the battery much faster without any functional problem ever showing up. They will also sneak through testing methods and only show up as a fatal business issue.

Consider a real-world example: A major wireless chipmaker was recently headed to tapeout when it ran some additional tests and found eight bugs related to wrong implementations of power intent. “They would have caused catastrophic failures,” said Peter Hardee, director of solutions marketing at Cadence Design Systems. “Things are getting a lot more complex. You may have power domain ‘A’ physically separated from power domain ‘B,’ and at some point they need to talk. The problem is that the wires may run through power domain ‘C.’ Was ‘C’ on or off when you verified the chip?”

It’s not that the wireless chipmaker didn’t understand all of these issues, either. Even at the most sophisticated chip companies where power intent and design was part of the up-front architectural decisions, problems still surface late in the design cycle. A device may be functionally verifiable but have fatal errors. And there’s no magic button to push or even an integrated tools flow that solves everything.

“A lot of things that used to be secondary issues are now primary issues,” said Vic Kulkarni, general manager of the RTL business unit at Apache Design Automation. “In the past, you could just put a lot of margin into the design, but the voltage has to be high for that to work. Today, the margin is no longer there.”

Dueling priorities
Creating SoC designs has always been about making tradeoffs between area, power and performance. Before 90nm, however, the power was more of an afterthought than part of the initial planning process. At 65nm and beyond, it is now an integral part of every chip, along with software and IP—which also were afterthoughts at older process nodes.

“The reality is if you have a performance issue or a power problem, it stems from the fact that you may validated the hardware in isolation, but not in the context of the software application,” said Shabtay Matalon, ESL marketing manager in Mentor Graphics’ Design Creation Division. “There are ways to fix functionality in terms of software. But I’m not aware of one that can fix power or performance by fixing the software.”

IP is likewise a problem when it comes from multiple sources and when it involves multiple voltages. Big IP vendors are all emphasizing power-aware IP so that it can be re-used more easily. But the amount of IP inside all SoCs is growing steadily, in large part because there are too few engineers inside companies to re-invent that IP and still get a chip to market on time.

Not all of that IP runs at the same voltage, and not all of it is necessarily used in a manner in which it was intended by the IP vendor. And while power methodologies such as UPF and CPF are supposed to account for that, some of it still slips through the cracks. In the best-case scenario, some of that can be fixed with software. There are plenty of cases that don’t fit that description, however.

“The fatal bugs are the ones that kill the company before the product ships,” CEO of MCCI Corp. “What causes those are mask spins. Behind those are system-level problems. You hook it up to a critical system and it doesn’t work. It’s down at the PHY level or the RTL level and it’s not accessible to software.”