Capturing Performance

The challenge of working out the best performance for a given power budget is not a new one, but in many power-sensitive applications, the balance is tricky and requires sophisticated techniques.

This is especially true in the media processor market where many systems companies are held back by power, energy and thermal issues.

“It’s really not a battery problem, it’s a thermal problem,” said Jem Davies, an Arm fellow and vice president of technology for the company’s media processing division. “If somebody produced a battery tomorrow that radically transformed your charging capabilities, it wouldn’t change my problem at all. My problem is still the thermal budget. For a huge number of our customers, they are thermally limited — different numbers for different classifications, but certainly at the high end, superphones, that’s a thermal problem.”

How to deliver extra performance within the same thermal budget is the question.

The media processor business is brutally competitive, requiring a cadre of technical strategies to design the highest performing GPUs at the lowest power. For ARM’s graphics processors, some of the techniques by its design teams include tile-based rendering and multicore architectures.

“We still fundamentally believe in tile-based rendering and it’s more than just, strictly speaking, putting the tiler with the GPU. It’s a way of thinking about working on the data locally to keep everything as close as possible, and external memory bandwidth to a minimum,” he said. Case in point, you can have the energy of 10,000 floating point operations for the amount of energy required to reload a Cache 1.

That dictates the way in which ARM designs its GPUs. “We try and keep stuff locally. And actually what ARM deems ‘local’ has been redefined. It used to be local as in ‘on chip/ off chip,’ but now with the change in silicon geometries we’ve got in terms of the way in which transistors are scaling but the wires aren’t — the way that power in the transistor is scaling versus the way the power in the wires isn’t,” Davies noted. “The way that scaling equation is now, not scaling uniformly means that if you’re not careful, if you try and do something centrally and have everything talking to it, you end up with all wires and the chip doesn’t shrink properly.”

And that drives everything ARM does because, he pointed out, “somebody’s actually got to build this. And chances are, because we’re in high volume in these newer products in incredibly high volume markets, these are the people using the new, modern processes.”

When it comes to multicore, Davies said it’s all about scalability — and learning from partners. “They license stuff from you and you think you know what they’re going to do with it, and then they go off and do something completely different with it, which is great because often times you sell them one license and they go off and build five different chips. What they’re trying to do is hit different use cases, different performance points — and oftentimes when they are buying, they don’t actually know what they’re looking for. It’s a hugely attractive concept for them. ‘I can scale from this performance point right up to this performance point. I can scale from this area to that area. I can hit this device market and that device market.’ We don’t want to make 20 GPUs a year — that would be hard — so we’re looking to introduce scalability to everything we do. Scalability in number of cores, scalability in the way in which devices get implemented.”

Davies recommends the following approaches for optimizing an architecture for energy efficiency:

–Try not to do things twice, where doing things once is better.
–Try not to do things that it turns out you don’t need to do. Graphics is all about throwing a ridiculous amount of content at the GPU — only half of which ever appears on the screen.
–Applying intelligence to the ways in which you do things pays extremely good dividends, which often requires reordering the way that you do things.

“If you don’t do it until the last possible moment that you have to do it, chances are by the time you get there, you’ve found out you don’t have to do it at all. For example, forward pixel kill buffer. We do a whole bunch of stuff, put it in a pipeline, it drops out the other end and it gets done. But actually, enlarge the size of that buffer, spend some silicon on increasing that buffer and at the end of it, as you’re pulling it out, say, is there something in the buffer that’s actually going to override this? Yes. Alright, don’t do it. There, you would actually spend some silicon to save on energy, and oftentimes, that’s the tradeoff you’re making,” Davies explained.

The easiest one is caches. Spend silicon on cache, reduce memory bandwidth, memory bandwidth is power. It’s about, on LP DDR-4, just 100mW per Gigabyte per second, so anything you can shave off that pays dividends.

He noted that ARM is now looking at not just shaving memory traffic off, but also sending out better quality memory traffic. “If you send out your memory transactions in a way that the DRAM is better able to turn into LP DDR bursts, then that will actually save power.”

Memory is central in the power equation, and not just for ARM. Frank Ferro, senior director of product marketing at Rambus, noted the key is to keep the memory in the lowest power state possible, especially DDR. That includes the ability to switch that memory on and off, and keep it off when you’re not using it.

“We follow the controllers’ lead in terms of the ability to put the memory into a low power state,” Ferro said. “We’re going to provide all the hooks and the support for the different power states that are required for the memory, especially in LP DDR, but we also have in DDR multiple power states that we support. We allow the memory to be throttled into various power states based on the commands from the system.”

Those kinds of decisions used to be made entirely by the CPU, but increasingly they are being scattered throughout the design. “On one hand, within the memory subsystem it’s ultimately the CPU, which gets handled by the memory controller, that is going to be putting the PHY in and out of those power states,” Ferro explained. “Those decisions do come at the system level. On the other hand, at the subsystem level, there are things that we can do to help facilitate the speed of those transitions.”

Further, Loren Shalinsky, strategic development director at Rambus, pointed out that DRAM and the PHY all have to support these various power modes. Then, the controller or the CPU through the controller then has to actually drive out the commands that puts the PHY into the power mode that drives it over to the DRAM. You have to support it, then someone’s got to control that core that comes into it.

“Depending on the interface that’s between the PHY and the DRAM itself, some of the interfaces are more efficient in terms of power — either in switching or in the amount of power that gets used for the signal in itself,” Shalinsky said. “If you look at some of our proprietary interfaces, they do really try to tackle that aspect of having a more efficient signaling technology. But if you are looking at industry standard PHYs where you’re connecting the PHY to just an industry standard DRAM, then you really don’t have an tricks up your sleeve. You’ve got to adhere to the standard, like the rest of the subsystem.”

In addition, Marc Greenberg, director of product marketing for DDR Controller IP at Synopsys, said that because memory can be 25% of the total power budget, using the memory effectively — both for performance and power — becomes a very interesting topic. “How do you address it properly so that you’re getting the performance that you need out of it but also not dialing up the power too high?” On the DRAM side specifically, Synopsys has techniques within its Platform Architect and DDR Explorer tools that allow for things like logical to physical address mapping in an optimal way that gets the performance out of the device. It also allows analysis of different mappings for different workloads to get the best performance out of the DRAM device without thrashing it too much to increase the power too much.

“Also, as we are looking at it in the same way, we’ll look at how much work is getting done over unit time and be able to optimize the clock frequency so that we’re not clocking faster than we really need to, and trying to keep the power down that way. Both of those techniques are done using system level models available under those tools that allow people to do this analysis and optimization to help find the right combination of settings for the DDR subsystem that will get the best use of the DRAM that’s out there,” he said.

Optimizing energy
From an overarching design perspective, the main objective is to optimize energy rather than power. The reason, according to Bernard Murphy, CTO at Atrenta, is that it provides more flexibility to increase performance for some of the time, as long as integrated power over time is within budget.

“The most basic architectural method is called “Run Fast Then Stop,” Murphy said. “You do some high-performance computing for a short time, then shut off. If time in the run-state is significantly less than time in the off-state, integrated energy can be much lower than computing at lower performance for a longer time. ARM’s big.LITTLE is a more refined version of this approach. You use a high-power CPU for the short period you need high power, then gate that CPU when you don’t need that high-speed, and switch to a lower power CPU.”

A general principle here is that a lot of functions only need relatively short bursts of computation followed by longer idle periods: this would often be the case for IoT devices.

“Jim Kardach [director of integrated products at FINsix], who knows a thing or two about architectural power management from his time at Intel Corp., observed recently that devices should be designed to do work efficiently, but also to do nothing efficiently,” he said. “For example, polling architectures burn significant energy needlessly when nothing is happening, where an interrupt architecture would be more efficient (seems obvious, but this is a real problem in USB). So power-aware design of communications protocols enables high performance when needed, but still within a power budget.”

Clearly, more and more design teams are looking to reduce both energy and power. “Especially in the last year, in every customer meeting, people have talked about reducing energy, not just power,” said Vic Kulkarni, senior vice president and general manager of the RTL power business at Ansys-Apache. “ARM’s big.LITTLE strategy got everybody fascinated by that in terms of a run fast and stop strategy, as opposed to a continuous low power strategy. Something to consider here is the impact on localized heating on the chip. Medium speed continuous clocking will increase the substrate temperature and lead to increased leakage. A burst of computation followed by latent period could allow the chip to cool. This is one strategy being used mainly in handset applications.”

Kulkarni observed that RTL power analysis appears to have become a linchpin between the system world and the physical world so that things can be connected and brought up.

Another approach includes abstracting RTL power models in order to make OS and transaction level decisions. Here, Ansys is working with Docea Power to enable abstracting of RTL power models from Ansys tools into Docea’s Aceplorer tool. “Power policies” can be decided by the system architect for power profiling use case models at the system level.

Heterogeneous architectures
While virtual platform developer Imperas Software Limited has no position on low power architectures, its tools are being used for power estimation, to enable dynamic analysis of the impact of the complete software stack — OS, firmware, applications — on the power consumption of SoCs and systems, according to CEO Simon Davidmann. “We have seen various architectures used by our customers with power constraints. What most of these architectures have in common is heterogeneity: using the right processor for the appropriate task.”

He sees the next step in this progression of system architectures as enabling optimized sharing of processing resources. “We have seen two basic approaches. The first is an architecture approach to heterogeneous computing, such as that being developed by the Heterogeneous System Architecture (HSA) Foundation, which enables the easier programming of heterogeneous systems. The second is the use of hypervisors for controlling resource allocations on a SoC. This approach has been significantly enhanced in the last two years by the introduction of hardware virtualization instructions to the ARM and MIPS architectures, to enable hypervisors to operate with much lower performance and power overhead than previous generations of hypervisors that did not rely on the underlying hardware. The two approaches are more complementary than competitive.”

Along these lines, Krishna Balachandran, product management director at Cadence, said another common solution is pipelining. “You figure out the depth of a pipeline and how many stages you want in the pipeline because you increase the throughput. You don’t increase the performance per instruction, but you end up having to push more instructions through this pipeline. Therefore your output measured over a period of time is much higher, so that translates to performance. That’s a technique that’s been successfully used and continuing to be used in terms of eking out a power budget. The advantage of pipelining is that some of the pipeline stages are inactive, or not consuming power, so that’s a power -efficient technique by definition.”

Summing things up, Pat Sheridan, director of product marketing for virtual prototyping at Synopsys asserted that methodologies to do all of this at the system level are there. “Do architecture simulation earlier in the development cycle where you can look at performance tradeoffs of an architecture. There are also things that have been added to these methods in the last year or so that provide the ability to overlay power models for components in that system model, define power states and the amount of energy that’s consumed when this simulation is resident in a component state — this is all possible to do with architecture prototyping.”