Power Reduction Through Sequential Optimization

Dealing with power is a multifaceted challenge and is an equal-opportunity problem — everybody can contribute to the solution and at many levels of abstraction.

At the architectural or system level, fundamental tradeoffs are done and the engineering team decides how much memory the system needs, what type of processor, what performance, area, power, among other things. Some people may use language description, while others do spreadsheet-based analysis or even back-of-the-envelope calculations.

But this is not really optimization. It’s more like a tradeoff analysis or a rough estimation, said William Ruby, senior director of technical sales at Ansys-Apache.

Once the architecture is defined, some engineering teams may use high-level synthesis, which has been getting some good traction as of late. High-level synthesis has its place for certain types of designs, but for other designs, he pointed out, “You dive in and start writing RTL code — VHDL or Verilog — and that is where more rigorous power estimation and power analysis really comes in.”

For years, combinational analysis/optimization techniques have been used to realize power reduction. As Abhishek Ranjan, senior director of engineering at Calypto explained, the combinational changes that happened in the past mostly involved making changes within the flop boundaries and this restriction has been there for ages.

“It was common wisdom that whatever flop boundaries or whatever sequential boundaries were in the design were sacrosanct and could not be changed,” Ranjan said. “As a result, all of the synthesis tools, implementation tools and even power optimization tools were working with the assumption that the sequential boundaries cannot be changed. The biggest reason for having this restriction was verification. Once the architect for the chip had actually laid out the sequential boundaries, then the RTL designer and also the implementation tools had to honor those boundaries because the verification was done with these boundaries intact.”

That artificial restriction to a large extent was driven by lack of proper verification and was very limiting because, as he noted, when you are restricting yourself to within the flop boundaries the only thing that you can play with are combinational gates, the sizing of the gates, the multi-threshold libraries, the data path architecture — and this was the only technique available to work with power or timing. Over time synthesis tools became smarter and squeezed whatever they could out of combinational optimization. “They do a very good job of optimizing data path architectures, coming up with the right sizes for your gates and wires but there is nothing much that can be done for power just based on the conventional changes.”

Sequential vs. combinational changes
The ability to make sequential changes to the design had been talked about in academia for quite some time because it helps the designer explore other possibilities beyond what is possible by the combinational changes.

“The idea is very simple,” Ranjan said, “if you look beyond the flop boundaries, if you can see through cycles in advance and back in time — that lets you explore the flow of data much better. The technique has been used at the architecture level where people used to decide on the caches and where the memories should be. Then the architect, based on his knowledge or the rule of thumb, would decide what the sequential nature of the design should look like to provide some power savings. But at such a high level the accuracy of estimation is very, very low, so whatever choices you have made at that level are purely rule of thumb or spreadsheet-driven. Only at the RTL can you get that accuracy of power numbers, which can help you drive the optimization much more meaningfully.”

Check the registers
Koorosh Nazifi, engineering group director for low power and mixed signal initiatives at Cadence, says that sequential optimization involves optimization of register elements — anything that has to do with register elements as represented by flip-flops in a given circuit. Historically, it has been timing-driven, and optimization involves either managing the clock that is feeding those registers and/or managing the data that is feeding into the registers or data that is output from the registers. There are multiple techniques that are used to ensure that the timing, with regard to the arrival of the data or the data that is being output, is managed appropriately to deliver the required performance.

“There are many techniques that are used that basically involve either looking for opportunities to merge flip-flops or in some cases maybe add flip-flops,” said Nazifi. “There are techniques like pipelining or retiming where you are looking at multi-stage register banks and you try to again make sure that you balance the data arrival time across the different register banks. In most cases, it involves adding additional registers and so forth.”

In the context of low power, sequential optimization involves managing the same set of information in particular clocks, he noted. “The technique that is used is primarily clock gating —gating the clock that is feeding the registers in order to minimize the activity when the output data is not required in certain cycles. There are also techniques with regard to managing the output of those registers, which involve looking at the combinatorial logic between the register elements and may be using sizing to reduce the timing slack between the register banks and so forth.”

What’s really interesting in sequential optimization, observed Sean Safarpour, technology director at Atrenta, is “we start going across flops … and by [doing that] you can still do all of the combinational optimizations plus really interesting analysis like find out that a register fanning out to other registers is, for example, not toggling. If you find out that this particular register isn’t toggling, you can use that information to turn off the clock of the subsequent registers. This kind of structure is actually very common in designs. You have registers feeding other registers through some clouds of logic. The effect of this is that you can turn off the combinational part, and you can turn off the switching of the sequential elements as well as the clock. The clock power is a major component to power consumption, so the more you can turn off the clock for some of the circuitry the more savings you get. Clock power and dynamic power are a big concern, and when using the traditional techniques you only get a few percentages worth of reductions. Once you start using sequential you can get, depending on the circuit and the exact details, between 10% to 25% reduction.”

As with many advanced technologies, there is a tradeoff, pointed out Lawrence Loh, vice president of engineering at Jasper Design Automation. In this case, it is that verification becomes a challenge. “A lot of tools try to optimize power using gated clock logic. There are some ideas about where the clock can be disabled. The EDA tool provider also has a dilemma — do they want to deliver something very aggressive that potentially has some risk, meaning they cannot guarantee at all times the equivalence, and will the customer except those changes unless they are proven 100% equivalent? If they go back to only something that they can prove, because most of the tools have some localized sequential equivalence to make sure that those constructs are correct, then the opportunity is limited. That’s the dilemma they have to face. The customer is facing the same problem. ‘I take something aggressively from a tool, but at the risk of having a bug in the design I cannot verify, or I take something that I can be sure of, but with sub-optimal optimization.’”

There are two ways to balance this, Loh explained. “One of them is to manually do some power optimizing combined with some low hanging fruit that’s kind of safe — knowing that it will not be the most optimal but at least they know what they are changing. The problem of getting the optimizations from the tool is that you don’t really know what’s happening so you don’t know how to assess the risk and how to verify this. More and more customers that are very aggressive are adopting sequential equivalence techniques.”

Just like when engineering teams started to use synthesis tools, he believes a lot of designers are experiencing anxiety here. “We used to do schematics and the next thing we know we are doing logic synthesis. The tool does something, but how do I know it’s correct? Then combinational equivalence came along and removed a tremendous amount of anxiety and reduced the effort of doing gate-level simulation. There’s a similar case happening when you have a tool that can tell you what the differences are and whether there even are differences. Ideally, if there are no differences, I am home-free. If there are differences, show me what the differences are. Then I can assess if that is something I care about or not. So having sequential equivalence tools to analyze if there are equivalents, and what are the areas they are not equivalent, will give comfort to both use cases. One approach is where they take the low-hanging fruit but do some of the work by hand so they can better understand the situation. A second approach is to do an aggressive optimization and find a way to verify them. It. could be painful, it could be a lot of gate level simulation, and it could be a lot of regressions to test. Having a good sequential equivalence-checking tool can solve both problems — the comprehension part and the verification part.”

Fundamentally, it is widely believed that at the RTL the engineer should still be in ultimate control of their design, and that while they should use anything and everything to achieve power reduction, they must be intelligent about it, Apache’s Ruby said. “They have to decide if it really make sense to do this type of RTL change because the RTL change by itself is not guaranteed to preserve the timing and reduce power if there is no precise timing information available.”

Today, the name of the game is still fundamentally RTL design for low power and the tools available today from Apache, Atrenta, Cadence, Calypto, Jasper and others enable the engineers to do just that — in the context of the engineering teams’ overall decision making process for their project. It is still not an automatic process, but that could change in the future. Efforts are underway to streamline the power reduction process.

Part two will address some of the power reduction/optimization techniques and approaches being worked on in the industry today.