The Future of Mobile CPUs

In the first part of our series, we explored the major trends that will influence the mobile system-on-a-chip (SoC ) market over the next five to ten years. This sets the backdrop for looking at the architecture for future SoCs and the specific players within this market, both critical IP players as well as the actual SoC vendors. For the most part, this focuses on mid-range to high-end devices, rather than the lowest-end smartphones and tablets. This means that some SoC vendors have been omitted, for the sake of clarity and brevity.

SoCs today

The vast majority of smartphones today are single- and dual-core SoCs. At the very high-end, there is a smattering of quad-cores. The same is mostly true of tablets, although the larger power budget means that the processors tend to skew towards higher core counts. The CPU cores are clocked at around 1GHz, and the more advanced ones feature out-of-order execution and modest superscalar issue, typically two to three RISC instructions per cycle at peak. Simpler cores for more power-constrained systems tend to be in-order and issue one to two instructions per cycle. This level of complexity is generally on-par with the CPU cores found in the early to mid 1990s.

Realistically, it is hard to see any benefits from quad-cores in mobile devices. The majority of PCs today sell with dual-core CPUs, and that is a reflection of the state of software; multithreading is hard and most applications are single threaded. Software for mobile devices is even more primitive and less amenable to threading. Comparing a quad-core to a dual-core at the same power, the dual-core should be able to reach about 25 percent higher frequencies (power scales roughly with frequency cubed). For the vast majority of workloads, a faster dual-core CPU will have better performance. Despite this fact, there appears to be some marketing value for quad-core SoCs, even if the delivered value is minimal.

One reflection of the divergence of smartphones and tablets is the graphics for these devices. Tablets have higher-performance graphics to drive the larger and higher-resolution displays and to make use of the greater power envelope. The actual GPU cores are usually the same, but with more cores and higher frequency for tablets. Looking at the iPhone 5 and iPad 4, the latter GPU has about 3X the shader throughput measured in FLOP/s (~100 vs. ~30 GFLOP/s). In terms of performance, the iPhone 5 is roughly the equivalent of a very low-end discrete DX10 GPU from 2007, while the iPad 4 resembles a mid-range model.

The other significant blocks in a mobile SoC are the wireless modem, which is often discrete for high-end phones and tablets (i.e., LTE devices), along with dedicated hardware for video encode/decode and image processing for the camera.

Power management ties together all these blocks and is particularly vital, since performance is limited by both the battery life and skin temperature (i.e., how hot the case gets). Simply put, there isn’t enough power or cooling for every block to be in a high-performance mode simultaneously. For example, when running a strenuous game, the display and GPU will draw much of the power; the CPU will actually have to reduce frequency and voltage to deliver the best overall performance. This becomes even more complex if there is significant wireless traffic as well.

SoCs of the future

Looking out 5-10 years, Moore’s Law means that transistors will be even cheaper. However, battery technology improves slowly and the maximum skin temperature is constant. Consequently, power will be even more of limiting factor in the future than it is today. So techniques that spend transistors (or area) to reduce power will be increasingly attractive.

While change is slow, eventually mobile developers will be able to take advantage of multiple cores. At this point, quad-cores can be more efficient by reducing frequency and voltage, as the PC industry has shown. Most workloads will still be single threaded and need high frequencies, so the SoC must be able to efficiently deliver both aggregate throughput and single-core performance. Eventually, almost all mobile SoCs will move to quad-core to handle the few cases of properly parallelized code.

The CPU cores will also become more sophisticated, improving single-core performance through both frequency and instructions per cycle (IPC). However, this evolution will be slow and steady because CPU performance is non-linearly expensive (in terms of both area and power) beyond a certain point. Many workloads simply cannot reach high IPC because of the nature of the code. One way that the industry has looked to get around this issue is with heterogeneous cores, which ARM bills as “big.LITTLE." This method pairs a small and efficient core with a larger and more complex core and switches between the two. The challenge again is power; these big cores can only be active one to five percent of the time, which limits the potential performance gains, and the switching penalty is an issue. Initially, there seems to be some interest in this solution, but it is unclear whether this will be a long-term solution for most vendors.

Graphics are an entirely different story because the workload is inherently data-parallel. While there are limits, desktop GPUs have shown that performance scales nicely up to at least 1-4 TFLOP/s if memory bandwidth increases commensurately (to roughly 200-250 GB/s). To a large extent, this performance will be used to deliver higher-quality graphics for 3D applications or better energy efficiency. Display resolutions may increase, but at a relatively slow pace considering today’s high-density displays and the slow rate of change for TVs and other external displays. Given the benefits of Moore’s Law, this means that GPUs will consume more and more die area, while keeping frequency and voltage relatively low to improve performance and energy efficiency. This is also one of the greatest motivators for any form of memory integration (whether in-package, 2.5D, or 3D), as there is simply no other way to provide enough bandwidth given the power constraints.

Image signal processors (ISPs) are also exquisitely parallel, just like GPUs, but the main driver is enhancing still and video images. Current cameras are strongly limited by the low quality and compact physical dimensions of optical lenses in mobile devices rather than the sensor resolution. In such a scenario, ISP performance will grow slowly, motivated by more sophisticated filters rather than higher resolution. However, an array camera could improve lens quality and motivate much more robust ISPs in the future.

The video-encoding and -decoding blocks are typically fixed function and will be upgraded to take advantage of the emerging High Efficiency Video Coding (HEVC) standard.

Over this time frame, the wireless landscape will be relatively stable. The industry is currently undergoing the transition to LTE, although the various 2G and 3G protocols will be crucial for backwards compatibility in areas with spotty coverage. LTE will certainly progress to higher speeds, but there is no replacement on the horizon for the next ten years or so. Some high-end phones, and most tablets, may continue to use discrete LTE modems for performance and flexibility, especially for vendors without internal wireless expertise. However, most smartphones will integrate the various modems into the SoC, reducing cost and power.

Of course, these guidelines are not absolute, and SoCs will vary to cover the full range of the market. Devices like the Kindle e-reader hardly need a lot of graphics performance, and budget devices may continue to use single or dual cores for many years.

Licensed CPUs

The most pervasive mobile IP company is unquestionably ARM. ARM is particularly well-known for licensing the eponymous instruction set (e.g., ARMv7 and v8), the Cortex cores (e.g., A7) that implement it, and other SoC components such as the AMBA interconnect. Nearly every company in the mobile ecosystem is an ARM customer in one fashion or another.

One big trend we mentioned earlier that impacts ARM is the shift toward vertically integrating IP. Today, ARM has a large number of customers that license the Cortex A-series for mobile devices, including Broadcom, Mediatek, Nvidia, Texas Instruments, and Samsung. In contrast, the larger SoC vendors such as Apple and Qualcomm prefer to license the instruction set and design their own CPU cores. The latter approach requires more engineering talent, but ultimately costs less in terms of royalties; essentially it is a trade-off between fixed and variable costs.

Long-term, companies with sufficient volume will shift from licensing CPU cores to licensing the ISA and designing the cores. ARM’s cores are by necessity somewhat generic, since they must be attractive to all customers and compatible at all the major foundries (TSMC, GlobalFoundries, UMC, and Samsung). In addition to cost advantages, a custom core can be carefully optimized for the target applications and the underlying manufacturing.

Another issue is the divergence between tablets and smartphones. It is very hard to design an optimal CPU core for radically different power limits, and at some point the tablet market may grow to be large enough to merit a more carefully optimized design. The sweet spot for tablet SoCs is around 2-6W, versus 0.5-1.5W for a smartphone. It may prove more efficient to have two different cores to spanning the full range from 0.5-6W rather than using a single design.