Performance Plus Lower Power

By Pallab Chatterjee
Power and performance often have been seen as something of a tradeoff. Chipmakers focus on one or the other, or they extract a little improvement in both at each new process node.

That way of thinking is changing, though. At the recent Linley processor conference, the central theme for both standalone and embedded processors was that architectures have to optimized for power management and performance. Historically, performance and application code execution were the two lead design parameters. All of the processors shown now have as one of the primary design constraints a power management method and a design partitioning that supports selective-block power down.

One of the most anticipated presentations at this show came from Tilera, which presented its new architectural fabric for dramatically improved multicore processor designs. Its new technology features a bus and interconnect architecture for connecting tens to thousands of cores on a single die. This new processor family is optimized for power efficiency on a performance-per-watt metric. In its designs the number of cores is the new Megahertz factor.

The power efficiency of Tilera’s design (up to 200Tbps on-chip with its 2-D mesh network) is based on using short wires and locally optimized CV2f. Designs for the Tile-Gx family, exploit the capabilities of locally available and distributed L1 and L2 caches, distributed memories. In addition, the use of custom OSes allow for the localized power up/down of not only the processor cores, but their local unused memory block. This method provides for close to linear scaling of the processor and power consumption based on the workload sent to the device.

Applied Micro presented its PACKETProc processor family, which is a high-speed network processor simultaneously optimized for security, concurrency, availability, power management and determinist behavior. To maintain security in all states the processor features distributed cores and localized state machines functions. For the power management, this includes standby power modes, the recently ratified IEEE 802.3az-2010 Energy Efficient Ethernet controller, Dynamic frequency scaling that uses individual control over each core in the design, and smart I/O that supports “wake on LAN” and low-power polling/support for WoX, USB and GPIO. This architecture is scalable from one to many embedded cores in a design on a single SoC.

Netronome presented a clarification on the new paradigm for application software and hardware processing of data traffic. As data traffic increases due to the prevalence of video and mobile data, peer-to-peer is no longer going to be the most voluminous data source. This large data-sized traffic (video is based on sustained packet flow, not single burst point-to-point data passing) is driving the server community from its base 10G infrastructure to 40G and 100G. These higher-bandwidth systems are based on “flows” rather than “packets.”

A flow is defined as a unidirectional sequence of packets all sharing a set of common packet header values. They are generally a common criteria found in 2-tuple, 3-tuple, 5-tuple, 7-tuple and 10-tuple groups. The 10-tuple form is the base of the Open Flow specification. These flow-based processors require a different power management methodology as the cache-flush cycles, and hence the power-down cycles, are different from packet-based processing.

In a flow-based design, the data is not spatially dissociative between the general-purpose CPU and the cache memory. Instead, it is distributed over the group of cores and caches via a load balancer. This removes the memory latency issues and stalls due to cache and CPU misses associated with power-down of cores between packets. The flow processors are aware of the upcoming data strings due to the header commonality, and adjust the power management accordingly to minimize memory and data latency. This method also allows for multichip threading in addition to in-die multithreading.