The Future of Mobile CPUs

In the first part of our series, we explored the major trends that will influence the mobile system-on-a-chip (SoC ) market over the next five to ten years. This sets the backdrop for looking at the architecture for future SoCs and the specific players within this market, both critical IP players as well as the actual SoC vendors. For the most part, this focuses on mid-range to high-end devices, rather than the lowest-end smartphones and tablets. This means that some SoC vendors have been omitted, for the sake of clarity and brevity.

SoCs today

The vast majority of smartphones today are single- and dual-core SoCs. At the very high-end, there is a smattering of quad-cores. The same is mostly true of tablets, although the larger power budget means that the processors tend to skew towards higher core counts. The CPU cores are clocked at around 1GHz, and the more advanced ones feature out-of-order execution and modest superscalar issue, typically two to three RISC instructions per cycle at peak. Simpler cores for more power-constrained systems tend to be in-order and issue one to two instructions per cycle. This level of complexity is generally on-par with the CPU cores found in the early to mid 1990s.

Realistically, it is hard to see any benefits from quad-cores in mobile devices. The majority of PCs today sell with dual-core CPUs, and that is a reflection of the state of software; multithreading is hard and most applications are single threaded. Software for mobile devices is even more primitive and less amenable to threading. Comparing a quad-core to a dual-core at the same power, the dual-core should be able to reach about 25 percent higher frequencies (power scales roughly with frequency cubed). For the vast majority of workloads, a faster dual-core CPU will have better performance. Despite this fact, there appears to be some marketing value for quad-core SoCs, even if the delivered value is minimal.

One reflection of the divergence of smartphones and tablets is the graphics for these devices. Tablets have higher-performance graphics to drive the larger and higher-resolution displays and to make use of the greater power envelope. The actual GPU cores are usually the same, but with more cores and higher frequency for tablets. Looking at the iPhone 5 and iPad 4, the latter GPU has about 3X the shader throughput measured in FLOP/s (~100 vs. ~30 GFLOP/s). In terms of performance, the iPhone 5 is roughly the equivalent of a very low-end discrete DX10 GPU from 2007, while the iPad 4 resembles a mid-range model.

The other significant blocks in a mobile SoC are the wireless modem, which is often discrete for high-end phones and tablets (i.e., LTE devices), along with dedicated hardware for video encode/decode and image processing for the camera.

Power management ties together all these blocks and is particularly vital, since performance is limited by both the battery life and skin temperature (i.e., how hot the case gets). Simply put, there isn’t enough power or cooling for every block to be in a high-performance mode simultaneously. For example, when running a strenuous game, the display and GPU will draw much of the power; the CPU will actually have to reduce frequency and voltage to deliver the best overall performance. This becomes even more complex if there is significant wireless traffic as well.

SoCs of the future

Looking out 5-10 years, Moore’s Law means that transistors will be even cheaper. However, battery technology improves slowly and the maximum skin temperature is constant. Consequently, power will be even more of limiting factor in the future than it is today. So techniques that spend transistors (or area) to reduce power will be increasingly attractive.

While change is slow, eventually mobile developers will be able to take advantage of multiple cores. At this point, quad-cores can be more efficient by reducing frequency and voltage, as the PC industry has shown. Most workloads will still be single threaded and need high frequencies, so the SoC must be able to efficiently deliver both aggregate throughput and single-core performance. Eventually, almost all mobile SoCs will move to quad-core to handle the few cases of properly parallelized code.

The CPU cores will also become more sophisticated, improving single-core performance through both frequency and instructions per cycle (IPC). However, this evolution will be slow and steady because CPU performance is non-linearly expensive (in terms of both area and power) beyond a certain point. Many workloads simply cannot reach high IPC because of the nature of the code. One way that the industry has looked to get around this issue is with heterogeneous cores, which ARM bills as “big.LITTLE." This method pairs a small and efficient core with a larger and more complex core and switches between the two. The challenge again is power; these big cores can only be active one to five percent of the time, which limits the potential performance gains, and the switching penalty is an issue. Initially, there seems to be some interest in this solution, but it is unclear whether this will be a long-term solution for most vendors.

Graphics are an entirely different story because the workload is inherently data-parallel. While there are limits, desktop GPUs have shown that performance scales nicely up to at least 1-4 TFLOP/s if memory bandwidth increases commensurately (to roughly 200-250 GB/s). To a large extent, this performance will be used to deliver higher-quality graphics for 3D applications or better energy efficiency. Display resolutions may increase, but at a relatively slow pace considering today’s high-density displays and the slow rate of change for TVs and other external displays. Given the benefits of Moore’s Law, this means that GPUs will consume more and more die area, while keeping frequency and voltage relatively low to improve performance and energy efficiency. This is also one of the greatest motivators for any form of memory integration (whether in-package, 2.5D, or 3D), as there is simply no other way to provide enough bandwidth given the power constraints.

Image signal processors (ISPs) are also exquisitely parallel, just like GPUs, but the main driver is enhancing still and video images. Current cameras are strongly limited by the low quality and compact physical dimensions of optical lenses in mobile devices rather than the sensor resolution. In such a scenario, ISP performance will grow slowly, motivated by more sophisticated filters rather than higher resolution. However, an array camera could improve lens quality and motivate much more robust ISPs in the future.

The video-encoding and -decoding blocks are typically fixed function and will be upgraded to take advantage of the emerging High Efficiency Video Coding (HEVC) standard.

Over this time frame, the wireless landscape will be relatively stable. The industry is currently undergoing the transition to LTE, although the various 2G and 3G protocols will be crucial for backwards compatibility in areas with spotty coverage. LTE will certainly progress to higher speeds, but there is no replacement on the horizon for the next ten years or so. Some high-end phones, and most tablets, may continue to use discrete LTE modems for performance and flexibility, especially for vendors without internal wireless expertise. However, most smartphones will integrate the various modems into the SoC, reducing cost and power.

Of course, these guidelines are not absolute, and SoCs will vary to cover the full range of the market. Devices like the Kindle e-reader hardly need a lot of graphics performance, and budget devices may continue to use single or dual cores for many years.

Licensed CPUs

The most pervasive mobile IP company is unquestionably ARM. ARM is particularly well-known for licensing the eponymous instruction set (e.g., ARMv7 and v8), the Cortex cores (e.g., A7) that implement it, and other SoC components such as the AMBA interconnect. Nearly every company in the mobile ecosystem is an ARM customer in one fashion or another.

One big trend we mentioned earlier that impacts ARM is the shift toward vertically integrating IP. Today, ARM has a large number of customers that license the Cortex A-series for mobile devices, including Broadcom, Mediatek, Nvidia, Texas Instruments, and Samsung. In contrast, the larger SoC vendors such as Apple and Qualcomm prefer to license the instruction set and design their own CPU cores. The latter approach requires more engineering talent, but ultimately costs less in terms of royalties; essentially it is a trade-off between fixed and variable costs.

Long-term, companies with sufficient volume will shift from licensing CPU cores to licensing the ISA and designing the cores. ARM’s cores are by necessity somewhat generic, since they must be attractive to all customers and compatible at all the major foundries (TSMC, GlobalFoundries, UMC, and Samsung). In addition to cost advantages, a custom core can be carefully optimized for the target applications and the underlying manufacturing.

Another issue is the divergence between tablets and smartphones. It is very hard to design an optimal CPU core for radically different power limits, and at some point the tablet market may grow to be large enough to merit a more carefully optimized design. The sweet spot for tablet SoCs is around 2-6W, versus 0.5-1.5W for a smartphone. It may prove more efficient to have two different cores to spanning the full range from 0.5-6W rather than using a single design.

109 Reader Comments

The old wintel - Intel and MS WOS - is dyingIntel must learn how to make cheap SoCs and MS WOS perhaps should change its kernel to a nix one to achieve better performance at cheap computers as Apple did some time ago.

ARM64 and Ubuntu are not even considered at this great article.

Nix kernels has been proved far better than MS WOS ones as every techie knows, and of course far better at this not powerfull SoCs.

Chrome OS is shining, also preinstalled Ubuntu machines in Asia.

The ubuntu POCKET COMPUTER concept is even better than original iphone and Google and Apple will make their own aproaches, we will probably see pocket computers OSs from both brands.

ARM currently is no match for x86 computers. People may be buying them in droves, but seriously - try running a modern game, or even just a more complex productivity program, on ARM. They will have their place, but I have no expectation that every PC in the near future will be SoC. I own an Android tablet myself, but I'm not throwing away my PC any time soon.

As for Microsoft using a *nix kernel - I want whatever you've been smoking man! I'm not advocating any OS, but that statement just blows my mind. Microsoft has kept its position in the business world for its pretty well-regarded NT kernel, which AFAIK is now the same for server, business and private use Windows versions. On top of that, they already have the WinRT kernel running on ARM. Why would they suddenly drop all this and turn to *nix? What are the advantages? You speak of performance, but is that seriously even an issue any more? I haven't heard many people complain about Windows 7 hogging their systems... let alone Windows 8, but that's pretty fresh so I'm not even considering it fully.

"Realistically, it is hard to see any benefits from quad-cores in mobile devices. The majority of PCs today sell with dual-core CPUs, and that is a reflection of the state of software; multithreading is hard and most applications are single threaded."

It's important to note that this is not just David's opinion.Work done in 2011 looked at the state of threading across a wide variety of desktop apps and showed that, while they were aggressively threaded, very few threads actually ran at the same time. Basically the state of the art in 2011 on the desktop was that there was minor benefit to 3 CPUs available, and substantial benefit to 2 CPUs. The only exception was what you would expect: video encoding. In particular, both image processing and games still make little use of aggressive SIMULTANEOUS threading. (That is, they have plenty of threads, but most of those threads are sitting around waiting, handling things like async IO.)

For example, two of the obvious sloths on current devices (mobile and, to a lesser extent desktop) are PDF viewing and the browser, and while ideas about how to parallelize these have been floating around for years, they're yet to become real. In particular the most obvious way to parallelize web sites, running different tabs/windows in different processes, is of even less interest and value on mobile, with its small screen, than on the desktop.

What does this suggest? I suspect we may see a bifurcation between vendors driven by specs (Octacore...) and vendors driven by actual use cases (obviously Apple, maybe Qualcomm). Detailed examination of real mobile apps, for example, shows that they are currently most throttled by small I-cache, small I-TLB, and small branch-prediction storage. It would make far more sense (even though it may not be as sexy) to devote a whole lot of extra space to those rather than to an extra cache. Second level cache versions of all of these are an obvious idea, but it may even be worth increasing cycle time if that's what's necessary to allow for larger such caches. (Again, lower MHz, not as sexy, but faster on real apps.) (It's interesting to note that right now D-cache size and D-TLB misses are much less of an issue.)

A second way to use silicon would be to beef up NEON by making the registers and units wider. Obviously a lot of effort is going into auto-vectorization on the Intel side, and the first fruits of this are starting to appear in LLVM. Most of the realistic auto-parallelization possibilities (eg polyhedral optimization) are as applicable to vectors as to multiple CPUs, with less overhead and less power usage. In particular, if NEON+ or whatever they call it could pick up scatter-gather capabilities you'd have something really awesome there.

All of which suggests to me that for Apple the near future will not be quad core. It may include a third low-power companion core (the value of such cores is that you can leave the phone "alive" for much longer, even though it's not doing very much. Think of eg those apps that monitor your sleep patterns so they wake you up when you're already in light sleep --- they work well, but drain battery, even though they require minimal real CPU.) It might also (perhaps it already does? How much do we know about the interior guts of Swift?) look rather more like a server CPU --- large I-cache, large I-TLB, lots of branch-prediction, and maybe (because it's easy and cheap in power and logic) SMT. If I were an Apple (or a Qualcomm, or an ARM) designer I'd be pushing, in order of priority(a) the large I support structures I mentioned(b) better NEON (scatter-gather as highest priority, wider as next priority)(c) SMT with more cores (non-companion) as a far future possibility.

Then to top it off with trying to claim that the big daddy, Qualcomm, is the more vulnerable is laughable. Hint: neither Apple nor Samsung make true mobile SoCs,

Then who makes the Exynos SoC?

ARM does. Samsung is little more than a fab (read: silicon fabrication) for ARM's parts. They do a little side customization to the logic to facilitate implementation into their devices. This is why Samsung is used in the industry as the benchmark for the current ARM generation of products for performance (compute, power, etc.). Qualcomm and Apple are the only two that I know of that implement the ARM ISA independently. Benchmarks will show that either of these two company's cores outperform the Exynos on a per clock and (usually) per watt basis.

I will technically agree with paul5ra here though. Qualcomm provides a top-to-bottom in-house SoC design, CPU, GPU, modem, peripheral, audio, etc.. Whereas Apple designs their own custom ARM core and probably licenses peripheral technologies and obviously don't produce their own modem or GPU. And again, Samsung is more or less ARM's fab.

I think it's important to note that Samsung is much more open to third party IP, to the point that they use NVIDIA, TI, and Qualcomm SoC besides their own. If Intel were to release a blockbuster mobile product I would expect Samsung to be one of the first to use it.

Quote:

Apple would need to develop or acquire a multi-mode modem (i.e., capable of 2G/3G/4G) and integrate it into the iPhone SoC.

Qualcomm is epxected to ship a multi-mode modem this year; while it won't be integrated into the iPhone SoC, it will be available for them to use.

I find myself wondering what impact the DX9/10 issue is for Windows Phone performance, and thus uptake in the market. The fact that Microsoft insists on using an API that is not supported in anything like it's modern form by any mobil GPU manufacturer seems like a real liability to the platform. And until there is a lot of uptake, likely no one will support it. The chicken/egg conundrum here seems like a bit of an anchor around the neck of Microsoft as they try to swim.

Realistically, it is hard to see any benefits from quad-cores in mobile devices. The majority of PCs today sell with dual-core CPUs, and that is a reflection of the state of software; multithreading is hard and most applications are single threaded. Software for mobile devices is even more primitive and less amenable to threading. Comparing a quad-core to a dual-core at the same power, the dual-core should be able to reach about 25 percent higher frequencies (power scales roughly with frequency cubed). For the vast majority of workloads, a faster dual-core CPU will have better performance. Despite this fact, there appears to be some marketing value for quad-core SoCs, even if the delivered value is minimal.

I fully and completely disagree with this section. While no one application may make efficient use of multiple cores, having additional cores is beneficial for system responsiveness in so many ways. If updates are being installed, that's a full core gone. If you've got a spreadsheet open that's crunching numbers, that's another core gone. Now your background apps have nothing to use for their tasks. (Nothing, of course, here referring to very constrained resources. The process scheduler will allow them to run at some point.)

I also believe that our modern programming languages are becoming more adept at multiprocessing, finally, and so we're seeing better use of multiple cores than we have in the past. Especially if I'm running Ubuntu Phone in a dock with the full desktop open, I'd rather have 4 fast cores, than 2 slightly faster cores.

Your post kind of makes the author's point for him. Sure, there are situations as you describe where many cores can be put to use, but those situations are contrived and rare. How often do you really have a spreadsheet using 100% of a core while and update uses 100% and you are doing something else? My Mac has 4 cores, 8 threads, and there are very few times when they are all saturated for any length of time. Even when running other apps in the background, usually they finish what they were doing long before I switch back to them and just sit there not doing anything. I'd venture to say that having one core that is twice as fast would produce a more responsive system that one with four half speed cores 99% of the time or more.

I expect intel to gain some traction as tablets start to transition from "big smartphones" to something more akin to laptops to justify their price premium. intel will get their X86 power envelopes down while maintaining superior performance to ARM. I recall reading that their JIT recompilers work extremely well, so handling ARM-centric code does not seem to be a problem on x86, in addition to the uncountable mountain of x86 code in the world.

The old wintel - Intel and MS WOS - is dyingIntel must learn how to make cheap SoCs and MS WOS perhaps should change its kernel to a nix one to achieve better performance at cheap computers as Apple did some time ago.

ARM64 and Ubuntu are not even considered at this great article.

Nix kernels has been proved far better than MS WOS ones as every techie knows, and of course far better at this not powerfull SoCs.

Chrome OS is shining, also preinstalled Ubuntu machines in Asia.

The ubuntu POCKET COMPUTER concept is even better than original iphone and Google and Apple will make their own aproaches, we will probably see pocket computers OSs from both brands.

For large machines, X86 is still the king. I have a 128GB iPad 4 and an OQO-02 for my mobile computing and an Ivy Bridge/2 GTX690s for home use. No matter how you cut it, the current crop of SOCs cannot come close to any high end desktop. (Not even the best laptop on the market will touch that desktop).

Of course Moore's Law will win in the end. I have personal experience there. My first "machine" was a Minivac 6010 in the mid 60s, I built an 8080 S-100 box (with an ASR-33 Teletype) in the 70s, 8086 in the early 80s, etc.

They don't really have one. The best low power chips AMD has produced have struggled to reach the tablet power window, and they've got nothing at all in the phone space(>2W). Their newest chips are expected to be better fits for the low power market(>6W), but I'm not holding my breath. Combine these with the fact that AMD has no capabilities in the wireless space, and they're simply not that relevant.

I think that if AMD delivers with Jaguar(their new low power core, expected Q2 '13), they will be worth looking at for tablets, but that's about the best that can be said.

Nice article, but I wondered a bit at, "Realistically, it is hard to see any benefits from quad-cores in mobile devices." I know it might seem like this at first glance given the typical one-app-at-a-time usage pattern, but... Are pricier quad-cores really getting design wins in tablets and even smart phones due to marketing reasons? Is that why Samsung is bringing out its 2 x 4 core design? I strongly suspect that even now these architectures are providing benefits for common usage such as web rendering. I realize this was a survey article, but it would have been nice to see a bit of substance behind the statement that quad-core is just a hose.

Then to top it off with trying to claim that the big daddy, Qualcomm, is the more vulnerable is laughable. Hint: neither Apple nor Samsung make true mobile SoCs,

Then who makes the Exynos SoC?

ARM does. Samsung is little more than a fab (read: silicon fabrication) for ARM's parts. They do a little side customization to the logic to facilitate implementation into their devices. This is why Samsung is used in the industry as the benchmark for the current ARM generation of products for performance (compute, power, etc.). Qualcomm and Apple are the only two that I know of that implement the ARM ISA independently. Benchmarks will show that either of these two company's cores outperform the Exynos on a per clock and (usually) per watt basis.

Samsung uses standard ARM cores. However, there's a great deal beyond the CPU cores in an SoC (e.g. power controller, video encode/decode, ISP, etc.). Moreover, Samsung is responsible for the physical design...e.g. techniques like body biasing.

Quote:

I will technically agree with paul5ra here though. Qualcomm provides a top-to-bottom in-house SoC design, CPU, GPU, modem, peripheral, audio, etc.. Whereas Apple designs their own custom ARM core and probably licenses peripheral technologies and obviously don't produce their own modem or GPU. And again, Samsung is more or less ARM's fab.

Source: I work as a DVE in the industry

Samsung definitely owns a lot more IP than Apple does (e.g. they have a 4G modem). But Apple has a pretty nice software stack that seems to earn a lot of money.

great job arstechnica for leaving amd out of both articles. i must commend David Kanterfor his lack of knowledge. Amd's answer to the mobile space is temash. it is a ultra low power variant of the jaguar apu that is sitting inside the ps4 and xbox720. here is an example of what it will do for the next incarnation of Microsoft surface http://hexus.net/tech/news/systems/5200 ... echnology/

once temash let loose on the mobile space it will have all the performance of x86 with amd's gpu technology that is by far superior to anything intel has or will ever have and arm compete with x86 ha keep dreaming. Temash will take the mobile market by storm as intel has nothing to compete with it till next years 22nm atom and it will still have craptastic intel graphics.

I simply didn't have the time to discuss AMD, especially since they don't have any viable phone offerings.

I do agree that AMD has a chance to make an impact in the tablet market, but you seem to have unrealistic expectations in that regard.

I just get annoyed at how companies think throwing more hardware at the situation will miraculously solve some problem when the software seems to be the problem.

Because that's the way things have worked in the desktop world for many many years, now. Optimization is seen as a bit of a waste of time when even the kludgiest code can run smoothly given a couple of years of faster hardware improvements.

Unfortunately, for mobile use, hardware isn't everything. You have to be power conscious, too, since battery life and thermal dissipation limits aren't evolving nearly as quickly as the number of transistors and cores are, so optimized, efficient programming has to come back into style now. Unfortunately, it hasn't been a universal adoption just yet.

This article, like the first one, sounds like it is written by a consumer rather than from someone who understands semiconductors. More nonsense about "vertical integration" and another nonsense statement of "Long-term, companies with sufficient volume will shift from licensing CPU cores to licensing the ISA and designing the cores." This will not happen and standard ARM designs will prevail.

Why would you want to use a standard ARM design (no offense to my friends at ARM)?

1. They are more expensive than designing your own core if you have sufficient volume2. They are necessarily generic, designed for the lowest common implementation team3. They remove a huge number of degrees of freedom. E.g. if you can design a really fast L1 cache, you're still stuck with the architecture ARM chose...even if you could shave a cycle off.

The trend is clear, companies with high volume design their own cores. First only Intel did. Then QCOM got a license. Then Apple. Now Nvidia and probably Samsung. The numbers are going UP not DOWN.

Quote:

Then to top it off with trying to claim that the big daddy, Qualcomm, is the more vulnerable is laughable. Hint: neither Apple nor Samsung make true mobile SoCs, more mere Application Engines. Both are completely reliant on horizontal integration for actual telecom capability. When it comes down to it cost will decide everything and that means greater integration of everything (i.e. including telecom capabilities into a single piece of silicon). Graphics IP is much more simple to buy-in and integrate than telecom.

That's factually incorrect. Samsung has ~10% of the LTE modem market, they are the #2 player behind Qualcomm (which has about 85% share).

My point about Qualcomm is apparently a bit subtle. Right now, the company essentially has a monopoly on LTE. That will last for 1-2 years at best. I just saw a paper at ISSCC from Renesas showing a dual-core AP with integrated LTE. Samsung already has LTE and will probably integrate soon. Intel is planned to have integrated LTE in 2014. The trend is clear, more companies will have LTE in the near future which removes a unique advantage of Qualcomm SoCs.

They will continue to be the largest merchant vendor for quite some time...but their actual marketshare will probably decrease.

Quote:

If I were to look at the future, I would see Mediatek and other upcoming Asian companies like Spreadtrum as having a bigger influence. Intel, while certainly becoming a big player, were also nowhere in the mobile SoC business before their Infineon acquisition, which is odd not to mention in their history. Broadcom and ST-Ericsson should also both have had a look-in.

Yeah I didn't have the time/space to really discuss Mediatek or Broadcom (those were the next two on my list). ST-Ericsson doesn't seem to be going anywhere...