big.LITTLE Processing

ARM big.LITTLE™ processing is an energy saving technology where the highest performance ARM CPUs are combined with the most efficient ARM CPUs in a combined processor subsystem to deliver greater performance at lower power than today's best-in-class systems. With big.LITTLE processing, software workloads are dynamically and instantly transitioned to the appropriate CPU based on performance needs. This software load balancing is so fast that it is completely seamless to the user. By selecting the optimum processor for each task, big.LITTLE can reduce energy consumption in the processor by 70% or more on light workloads and background tasks, and by 50% for moderately intense work, while still delivering the peak performance of the high performance cores.

Background

The performance demanded of current smartphones and tablets is increasing at a much faster rate than the capacity of batteries or the power savings from semiconductor process advances. At the same time, users are demanding longer battery life within roughly the same form factor. This conflicting set of demands requires innovations in mobile SoC design beyond what process technology and traditional power management techniques can deliver.

The usage pattern for smartphones and tablets is dynamic: Periods of high processing intensity tasks, such as gaming and web browsing alternate with typically longer periods of low processing intensity tasks such as texting, e-mail and audio.

Innovative power savings techniques are required to sustain the dramatic pace of performance increases in mobile platforms while preserving and increasing the power efficiency and battery life.

big.LITTLE Processing

ARM big.LITTLE processing is designed to deliver the vision of the right processor for the right job. In current big.LITTLE system implementations a ‘big’ ARM Cortex™-A15 processor is paired with a ‘LITTLE’ Cortex™-A7 processor to create a system that can accomplish both high intensity and low intensity tasks in the most energy efficient manner. For example, the performance capabilities of the Cortex-A15 processor can be utilized for heavy workloads, while the Cortex-A7 can take over to process most efficiently majority of smartphone workloads. These include operating system activities, user interface and other always on, always connected tasks.

By coherently connecting the Cortex-A15 and Cortex-A7 processors via the CoreLink™ CCI-400 coherent interconnect, the system is flexible enough to support a variety of big.LITTLE use models, which can be tailored to the processing requirements of the tasks.

The central tenet of big.LITTLE is that the processors are architecturally identical. Both Cortex-A15 and Cortex-A7 implement the full ARMv7A architecture including Virtualization and Large Physical Address Extensions. Accordingly, all instructions will execute in an architecturally consistent way on both Cortex-A15 and Cortex-A7, albeit with different performances. The implementation defined feature set of Cortex-A15 and Cortex-A7 is also similar. Both processors can be configured to have between one and four cores and both integrate a level-2 cache inside the processing cluster. Additionally, each processor implements a single AMBA® 4 coherent interface that can be connected to a coherent interconnect such as CoreLink CCI-400

Future Implementations

In a similar fashion, the ARMv8 architecture-based Cortex-A53 and Cortex-A57 processor can also be implemented in a big.LITTLE configuration. In this case, the processors will be connected by the CoreLink CCN-504 coherent interconnect that enables a fully-coherent, high-performance many-core solution that supports up to 16 cores on the same silicon die.

Real World Performance Metrics

Energy savings of 50 percent for moderately intense workloads like web browsing, and savings of up to 70 percent for background workloads like mp3 audio playback have been measured. These measurements compare the average power consumption of a big.LITTLE system with a system with only the big processor, under full DVFS power management and core idle policies in each case.

These results were initially measured on test silicon and have recently been replicated on partner silicon across a range of typical mobile workloads. The software changes to take advantage of big.LITTLE are typically done in the OS kernel scheduler and are completely transparent for the application running on that OS.

Hardware Requirements

For big.LITTLE processing to be invisible to software and fast enough to migrate execution opportunistically to the right sized core, the big and LITTLE processors being paired must be fully architecturally compatible - they must run all the same instructions and support the same extensions such as virtualization, large physical addressing, etc.

The first such pairing is between the Cortex-A15 and the Cortex-A7 processors, where the big cluster of CPUs and the LITTLE CPUs can contain one to four CPUs in each, enabling big.LITTLE eight core designs, smart quad core designs with two of each processor type, or an asymmetric mix like four LITTLE cores and two big cores.

Big.LITTLE System Diagram

Both the Cortex-A15 and Cortex-A7 processors are available to partners now, and available in production separately with first big.LITTLE silicon now being demonstrated by lead licensees. The second big.LITTLE pairing is between the Cortex-A57 and the Cortex-A53 processors, successors to the Cortex-A15 and Cortex-A7 processors respectively. These cores, announced in 2012, will be available to ARM lead licensees in mid 2013, and can be combined over ARM CoreLink™ CCI-400 or other cache coherent interconnect in the same way. They both increase performance while retaining the same power efficiency as their predecessor, and both introduce 64-bit support via the ARMv8 architecture, in addition to full backwards compatibility to 32-bit ARMv7 architecture with the virtualization and large addressability extensions of the latest version of ARMv7.

Future ARM cores will also be capable of combining with these first four in big.LITTLE processor SoCs.

Software

Software can control the allocation of threads of execution to the appropriate core, or in some versions of the software simply move the whole processor context up to big or down to LITTLE based on measured load. There are two software approaches to handling the CPU selection decision, described below. In both software approaches, cache coherence is required to enable the software to quickly move execution from LITTLE to big and from big to LITTLE as appropriate. Cache coherence allows one CPU cluster to look up in the caches of the other CPU cluster, and full hardware cache coherence between the two clusters is key to making big.LITTLE software fast and transparent. Cache coherence can be provided by the ARM CCI-400 cache coherent interconnect or any interconnect that follows the AMBA4 ACE protocol.

In a big.LITTLE SoCs, the OS kernel dynamically and seamlessly moves tasks between the 'big' and 'LITTLE' CPUs. In reality this is an extension of the operating system power management software in wide use today on mobile phone SoCs.

Most OS kernels already support Symmetric Multi-core Processing (SMP) and those techniques can easily be extended to support big.LITTLE systems. There are two main variants of big.LITTLE software scheduling.

big.LITTLE CPU Migration
In CPU migration a whole workload of a CPU gets move to a differently CPU, once the OS detects it requires more or less performance. This builds on generic techniques in an OS to wake up and put to sleep CPUs in an SMP system. The key extension is around the detection that a CPU is running at maximum frequency while still requesting further performance and thus the workload needs to be moved to a ‘bigger’ CPU. Once the workload has reduced, it can moved back to a ‘smaller’ CPU.

This CPU migration software is available today from Linaro, and is being actively developed by multiple ARM partners.

big.LITTLE MP
Task migration (aka big.LITTLE MP) detects a high intensity task and will schedule that onto a ‘big’ CPU. Similarly it will detect a low intensity task and move this back to a ‘LITTLE’ core.

The advantage of task migration over CPU migration is that a system can benefit from all its CPU at the same time, if the processing demands are extremely high. For example in a 2x ‘big’ + 2x ‘LITTLE’ system all 4 CPUs can be used at peak demand times, where as CPU migration would only be able to use 2 CPUs.

ARM and Linaro have been developing Linux support for both migration models. For more information go to:

Related Technology

CoreLink Cache Coherent Interconnect (CCI-400)

The ARM CoreLink™ CCI-400 Cache Coherent Interconnect provides full cache coherency between two clusters of multi-core CPUs, such as the ARM Cortex-A15, and Cortex-A7 processors enabling big.LITTLE.

The CoreLink CCI-400 enables system coherency in heterogeneous multicore and multi-cluster CPU/GPU systems, such as those required for the networking and high-performance computation markets, by enabling each processor in the system to access the other processor caches. This reduces the need to access off-chip memory, saving time and energy, which is a key enabler in systems based on ARM big.LITTLE™ processing.

CoreLink CCN-504 can deliver up to one Terabit of usable system bandwidth per second. It will enable designers to provide high-performance, cache coherent interconnect for ‘many-core’ enterprise solutions built using the ARM Cortex-A15 MPCore processor and the latest ARM Cortex-A50 series processors with 64-bit support.

ARM Development Studio 5 (DS-5)

The ARM Development Studio 5 (DS-5™) toolchain is a suite of professional software development tools for ARM processors and extends its world-leading capabilities to the big.LITTLE performance analysis and debug.

ARM Fast Models

ARM Fast Models provide the necessary models for constructing virtual platforms of ARM big.LITTLE processing-based systems along with templates of popular configurations. Customization of model content and configuration of items such as memory map and interrupt map, and the ability to export the platform to SystemC/TLM environments are supported.

Fast models are available for the Cortex-A15 and Cortex-A7 processors and the CoreLink CCI-400