Guest editorial: Low power is everywhere

Design for power extends beyond mobile applications. This article show what the industry is doing and where they are having the greatest problems...

IntroductionMeeting power budgets for most System-on-Chip (SoC) designs today is no longer a requirement for mobile applications only. Almost every market segment today has some concern with designing in low power features—although the driving factor for why does differ among them. The primary impetus for low power design was initially driven by the mobile market due to the need for extending battery life; however, different segments do have different reasons for making power a primary design requirement.

For example, the advent of the internet and social media heavily drives the Servers and Networking Market segments where large server clouds and compute farms need to work reliably without overheating; so, their primary concern is reducing the amount of expensive energy required for operation and air conditioning. Other markets such as the multimedia and set top box segments are plugged into the wall but ‘green’ initiatives and the high cost of electricity have forced them into increasing energy efficiency through building in low power techniques similar to those used in the mobile application space.

Power is now a primary requirement for all designs – it’s not just about performance or area anymore and there are several factors that designers need to take into consideration to meet the stringent low power requirements. There are several key components that comprise a low power design and offer methods for controlling power:

Optimization engines delivers on rapid time-to-market and quality of results

Technology processEven process technologies from semiconductor vendors have had to adapt. It used to be that a low-power (LP) process could be used in place of a generic (G) or high-speed/performance (HS/HP) process to provide significant static leakage savings. Back then, the LP process offered typically a 20-30% slower performance than the standard process in exchange for 1.5X less dynamic and up to 50X less static power dissipation which helped extend battery life in most portable designs.

Nowadays at 28nm and below, there are more process variations targeted for low power to meet the various market demands. LP and HS/HP processes continue to be offered where LP is still targeted for mobile applications and extending battery life. However, there are now process variations in between that offer both performance and lower power depending on the application. For example, TSMC offers a 28nm HPL process technology using high-k metal gates that reduces both operation and standby power by about 40% (vs. HP) which is best suited for cellular, wireless and programmable logic devices. They also offer a 28nm HPM process which offers both high performance and low leakage targeted specifically for mobile consumer applications.

The choices of standard cell library architectures for the targeted process have also expanded quite a bit. There used to only be a single standard cell library architecture (fixed cell height) available per process node – one that is characterized for the different voltage threshold (Vt) points. Now, there are several standard cell height (number of grids) choices that offer performance, power and density tradeoffs. For example, Synopsys offers standard cell libraries that are lower in cell height for consumer applications and taller cells for higher performance applications with examples shown below.

In addition to the different cell architectures, channel length variants are also available, exponentially increasing the number of actual cell variants available for libraries. Library vendors, like Synopsys, are creating variations of cells with different channel lengths within each cell. Generally, High-Vt (HVt) libraries are better for power and worse for timing, while Low-Vt (LVt) libraries are much better for timing, but are very leaky. With the availability of libraries containing multiple channel lengths, it is possible to achieve better timing and lower leakage with a Standard-Vt (SVt) cell with a longer channel than an HVt cell with standard channel length. As shown below, for the 28nm HPM process, a shorter length SVt cell would provide 17% lower performance and 30% lower leakage than a standard length LVt cell making it more compelling to use while also saving on an extra mask layer.

Starting at 28nm and below, we are seeing the advent of variations of other low power processes such as fully depleted silicon on insulator (FD-SOI) and fin-based field effect transistors (FinFET). FD-SOI can provide high performance with approximately 35% lower power as compared to traditional MOS-based technologies according to ST Microelectronics. FinFET technologies extends the ability to do 3D transistors which offers up to 50% power savings with about 35% better performance compared to traditional planar transistors at 22nm according to Intel.

I think we're close on our definitions. I am thinking of any accesses to the stack carried out by code running on an ARM system which complies with the ABI. That covers parameters, spills, automatic variables, caller/callee-saved registers etc.
The ABI says that the stack pointer must be word aligned at all times (and doubleword-aligned at external boundaries). It doesn't actually say that you can't push/pop two halfwords at once in a pair of atomic operations but doing so would be impractically difficult while sticking to the ABI.
Yes, you can use halfword memory accesses indexed via SP, in the sense that the instruction set permits it. But it isn't possible (or at least practical) to do so in a way which doesn't violate the ABI.
The ABI for AArch64 specifies quadword alignment for SP at all times (whether externally visible or not) so, although instructions may exist for sub qword stack accesses, they aren't practically usable in this context.

I think we may have different definitions of spilling. I think of spilling as any moving of a register value into memory (e.g., due to register pressure). I am guessing that you may mean something else, perhaps saving callee save registers (where the callee cannot conveniently know the size of register contents nor if the value already has a slot allocated in a previous stack frame--interprocedural optimization might be able to discover such).
I also do not understand your statement "All ARM stack accesses are 32-bit" since ARM provides LDRH/STRH using the stack pointer, which is just a GPR afterall (I doubt even AArch64--which makes SP a non-GPR--prohibits sub-word accesses using SP). (Pushing and popping smaller values would be problematic in making SP unaligned.)
By the way, my gmail.com address is 'paaronclayton'.

Thanks for the response.
I was referring to any spilling of variables onto the stack. All ARM stack accesses are 32-bit so any spilled variable (or parameter, or variable allocated to the stack) takes up a full word.
To my knowledge, the register allocator does not take this into account when allocating registers to variables within procedures. If it is possible to save/spill a pair of variables using LDRD/STRD, that is sometimes down to serendipity as I understand it (some forms of these instructions require that the registers be a consecutive odd/even pair).
You are right that you don't need to stick to the ABI for internal functions. Not doing so is obviously potentially dangerous, as I'm sure you are aware!
Leaving the stack aligned to anything less than a word boundary when interrupts are enabled can be especially perilous.
Chris

I based my comment on the statement "Remember, too, that local variables, regardless of size, always take up an entire 32-bit register when held in
the register bank and an entire 32-bit word in memory when spilled on to the stack." (page 4 of "Efficient C Code for ARM Devices")
If it meant callee spilling, I could understand the constraint. (This limitation could motivate a compiler optimization that would preferentially allocate 32-bit values into callee save registers.) I could also understand how such could make debugging easier. (Also on ARM, code density--or even performance as such has sometimes been implemented using paired word operations--goals might promote use of store/load multiple word.)
(The ABI forcing such expansion for function parameters may be a concession to simplify debuggers or perhaps compilers. In theory, one does not need to use the ABI, at least for internal functions.)

I am the author the first of those papers which Brian cited. Glad you found it interesting.
I'm interested in your comment about local variables being expanded to 32-bit in the cache. Can you expand on that a bit more because I don't believe it has to be that way.

The former paper was somewhat interesting (I was surprised that 16-bit local variables would be expanded to 32-bit even in the cache) and points to some unfortunate limits of C and its compilers.
The latter article was more focused on the specific topic of exploiting the benefits of MIPS MT. I had already understood the principles, but the examples were interesting.
One problem seems to be that this information is scattered. Because the information content is vast and has complex interconnection, it seems that something like a wiki could be useful. Such a project would be outside the scope of EE Times (alone).
I do not know that such would be useful to anyone. Since I am just an information junkie, my feelings should have little weight.

Yes, it must be difficult for professionals to handle so much complexity (made worse by communication barriers even within organizations)--and with severe time limits and pressure to predict the result more than a year in advance. I am just a thinker (not even an academic), and even the limited complexity of which I am aware makes my head hurt (almost literally).

I did run two articles on software and power a couple of weeks ago:
Efficient C code for ARM devices http://eetimes.com/design/eda-design/4370230/EDADL-Efficient-C-code-for-ARM-devices?
and
Optimizing performance, power, and area in SoC designs using MIPS® multi-threaded processors http://eetimes.com/design/eda-design/4370392/Optimizing-performance--power--and-area-in-SoC-designs-using-MIPS--multi-threaded-processors?

While this article focuses on low-level techniques--as reasonable coming from someone at Synopsis--, there might be interest in overviews of higher level (architectural, microarchitecture, and software) techniques.
Techniques like approximate computation (mainly for audio/visual but also sometimes applicable to sensor data analysis) and analog computation (as in Lyric Semiconductor's error correction technology) seem to show some promise. (These can also apply to predictive structures like branch predictors.)
Asynchronous design, "Power Balanced Pipelines" (Sartori et al.), and other general microarchitectural techniques look interesting (at least to someone with an academic interest in computer architecture).
Techniques to improve performance can also improve power efficiency.
Software techniques can include optimizations to improve cache utilization (code density and code and data layout can help) and the scheduling of work to reduce the number of power transitions.
Software optimizations which improve performance can also improve power efficiency by avoiding unnecessary work and improving hurry-up-and-go-to-sleep effectiveness.
Even the little I have read in this area indicates that there are a lot of interesting techniques for managing power use.

I think what you are pointing out is that so many issues associated with complete product design are interrelated and that the consumption of power and the removal of the heat it generates impacts every facet of system design. Thanks for adding some of those dependencies.