Diamond Member

When I started out this piece the goals I set out to reach was to either confirm or debunk on how useful homogeneous 8-core designs would be in the real world. The fact that Chrome and to a lesser extent Samsung's stock browser were able to consistently load up to 6-8 concurrent processes while loading a page suddenly gives a lot of credence to these 8-core designs that we would have otherwise not thought of being able to fully use their designed CPU configurations.
[...]What we see in the use-case analysis is that the amount of use-cases where an application is visibly limited due to single-threaded performance seems be very limited. In fact, a large amount of the analyzed scenarios our test-device with Cortex A57 cores would rarely need to ramp up to their full frequency beyond short bursts (Thermal throttling was not a factor in any of the tests). On the other hand, scenarios were we'd find 3-4 high load threads seem not to be that particularly hard to find, and actually appear to be an a pretty common occurence. For mobile, the choice seems to be obvious due to the power curve implications. In scenarios where we're not talking about having loads so small that it becomes not worthwhile to spend the energy to bring a secondary core out of its idle state, one could generalize that if one is able to spread the load over multiple CPUs, it will always preferable and more efficient to do so. [...]In the end what we should take away from this analysis is that Android devices can make much better use of multi-threading than initially expected. There's very solid evidence that not only are 4.4 big.LITTLE designs validated, but we also find practical benefits of using 8-core "little" designs over similar single-cluster 4-core SoCs. For the foreseeable future it seems that vendors who rely on ARM's CPU designs will be well served with a continued use of 4.4 b.L designs.

Diamond Member

Diamond Member

I wonder if Intel will follow suit, and also introduce 8+ core variants with big.LITTLE approach on mobile? I'm not sure if Intel have any cores that are suitable for such a design though. What uArch:es would they use for the big vs LITTLE cores then? Do they have anything that is similar to ARM A53 + A57?

Golden Member

Definitely an interesting article. Makes sense that more emphasis would be placed upon multi-threading both OS and applications given that the raw single-thread performance simply isn't there and that using it burns more power. Here's to hoping that the PC space follows suit at some point...

As for big.LITTLE, it is an interesting question. There's certainly an argument to be made for it in the smartphone space as those perf/w curves imply that the SoC is spanning a ~100x power range and the rule of thumb is that you can only make an architecture scale efficiently in a ~10x power range... though how far Intel's actually able to stretch it is up for debate. But even then, how much of a difference does that actually end up making?

Senior member

Definitely an interesting article. Makes sense that more emphasis would be placed upon multi-threading both OS and applications given that the raw single-thread performance simply isn't there and that using it burns more power. Here's to hoping that the PC space follows suit at some point...

Nah if fairly flawed. For example, the reason the update numbers are so high is that android devices do the FTL in software instead of offloading that to dedicated hardware which makes the number of threads inflate massively.

Nor can we actually make any statement that having 8 active cores provides any meaningful performance in any of this cases, esp considering the CPU power states alongside the number of threads used.

Diamond Member

Nah if fairly flawed. For example, the reason the update numbers are so high is that android devices do the FTL in software instead of offloading that to dedicated hardware which makes the number of threads inflate massively.

Nor can we actually make any statement that having 8 active cores provides any meaningful performance in any of this cases, esp considering the CPU power states alongside the number of threads used.

Diamond Member

Nah if fairly flawed. For example, the reason the update numbers are so high is that android devices do the FTL in software instead of offloading that to dedicated hardware which makes the number of threads inflate massively.

I've seen a lot of people claiming this now and I'd like to see some more real information behind it.

For one thing, what FTL? My Nexus 4 (not by any means a recent phone) says its storage is ext4. That should mean that it presents a block interface and flash translation is handled by hardware. Someone correct me if I'm wrong here. This should also be true for storage like SD cards. Flash based file systems should apply for raw NAND, but these days most Android devices use eMMC for internal storage.

But let's say FTL is really being used. Why assume that FTL threads are spending any meaningful and persistent amount of CPU time in these tests? They shouldn't be dominated by flash reading. Especially the gaming tests show a pretty reliable steady state load that doesn't look at all like what you'd expect with the threading being taken up by flash accesses (much more intermittent). And the work of flash translation is really in wear leveling, which happens during writing. I doubt any of the tests ran in this article do much flash writing at all.

So if you're going to say that FTL skews the results here and it'd be a lot different with different hardware acceleration for flash put up something to actually qualify this.

Performance isn't really the point. Andrei says in the article several times that running tasks on more cores in parallel allows lower clock speeds (or better using the little cores) which can be more power efficient. There are a lot of people saying that in many of the cases all of the threads could be running on a single big core, and that's true, but it's missing the point. That would probably be less power efficient.

For a while now many people have said that these cores in mobile devices go almost completely unused, and that big.LITTLE in particular is a waste because phones just run everything on the little cores (or the big cores?) This article shows empirically that this is not the case. So now people are saying that the Linux kernel scheduler is just plain fundamentally wrong and that it should really be distributing this work among far fewer cores. I guess it's going to take someone like Andrei doing tests with cores disabled and comparing power consumption to see if the schedulers are really totally wrong.

Golden Member

Yep, that's the thing that is really missing/I really hope they expand upon, how does this affect power consumption.

Especially on XDA, you see lots of posts about people disabling cores, lowering clock speeds, etc. While this certainly lowers peak usage, I've never seen much gain from these mods because race to sleep actually seems to work.

Also, I don't think the writing is on the wall w.r.t. Big.Little. Both Intel and Qualcomm seem to think that one set of optimized cores are better.

Diamond Member

Nah if fairly flawed. For example, the reason the update numbers are so high is that android devices do the FTL in software instead of offloading that to dedicated hardware which makes the number of threads inflate massively.

Nor can we actually make any statement that having 8 active cores provides any meaningful performance in any of this cases, esp considering the CPU power states alongside the number of threads used.

Diamond Member

It was such a terrible article. It did nothing to address the fact that one of the fastest devices on the market only has two cores, and that some of those 8 and 10 core devices are slower than dog poo.

Diamond Member

Yep, that's the thing that is really missing/I really hope they expand upon, how does this affect power consumption.

Especially on XDA, you see lots of posts about people disabling cores, lowering clock speeds, etc. While this certainly lowers peak usage, I've never seen much gain from these mods because race to sleep actually seems to work.

But a pure race to sleep isn't what the scheduler is doing; if that were the case you would only have two states: full clock speed (on a big core) and idle. Instead running at a lower clock speed (and on a little core) is allowed depending on how frequently the thread goes idle.

Better for them anyway. That doesn't mean that they wouldn't benefit from an asynchronous setup with smaller cores, but developing those smaller cores is a big overhead. And they might feel that time and money is better utilized on other techniques.

ARM, on the other hand, was going to do the little cores regardless of their big core strategy because there's a big market for devices only running those cores. This availability could have influenced design decisions for their big cores that were different from their competitors.

Here's what we do know: just about everyone doing SoCs for mobile platforms agrees that ARM's big cores should be paired with a little core. And Qualcomm has enough faith in this strategy to continue using A72 + A53 for the entire mid-range of their next gen lineup, having only one high end model with their custom core. If this design was much worse I think they would have held out and pushed their core further down, even if it meant backporting it to 28nm or selling lower end 16nm models.

Senior member

Better for them anyway. That doesn't mean that they wouldn't benefit from an asynchronous setup with smaller cores, but developing those smaller cores is a big overhead. And they might feel that time and money is better utilized on other techniques.

ARM, on the other hand, was going to do the little cores regardless of their big core strategy because there's a big market for devices only running those cores. This availability could have influenced design decisions for their big cores that were different from their competitors.

Its kinda the opposite really, ARM had no real choice but to do b.L because they don't control the end point designs. Most of the advanced power management features of say Intel or Qcom rely on understanding the underlying process and customizing for it and accounting for it. In stark contrast, ARM's designs basically have to be process neutral. Qcom/Intel can do things like fine grain power gating, etc because they control the whole entire design stack from RTL to silicon, ARM only really controls the RTL.

Here's what we do know: just about everyone doing SoCs for mobile platforms agrees that ARM's big cores should be paired with a little core. And Qualcomm has enough faith in this strategy to continue using A72 + A53 for the entire mid-range of their next gen lineup, having only one high end model with their custom core. If this design was much worse I think they would have held out and pushed their core further down, even if it meant backporting it to 28nm or selling lower end 16nm models.

Senior member

I've seen a lot of people claiming this now and I'd like to see some more real information behind it.

For one thing, what FTL? My Nexus 4 (not by any means a recent phone) says its storage is ext4. That should mean that it presents a block interface and flash translation is handled by hardware. Someone correct me if I'm wrong here. This should also be true for storage like SD cards. Flash based file systems should apply for raw NAND, but these days most Android devices use eMMC for internal storage.

But let's say FTL is really being used. Why assume that FTL threads are spending any meaningful and persistent amount of CPU time in these tests? They shouldn't be dominated by flash reading. Especially the gaming tests show a pretty reliable steady state load that doesn't look at all like what you'd expect with the threading being taken up by flash accesses (much more intermittent). And the work of flash translation is really in wear leveling, which happens during writing. I doubt any of the tests ran in this article do much flash writing at all.

So if you're going to say that FTL skews the results here and it'd be a lot different with different hardware acceleration for flash put up something to actually qualify this.

There is no other viable explanation for why doing a simple copy/replace operation would burn so many CPU cycles. Hell, I can run a ZFS raid doing 100+ MB/s (which is massively more than the bandwidth in a phone) with less CPU cycles!

Performance isn't really the point. Andrei says in the article several times that running tasks on more cores in parallel allows lower clock speeds (or better using the little cores) which can be more power efficient. There are a lot of people saying that in many of the cases all of the threads could be running on a single big core, and that's true, but it's missing the point. That would probably be less power efficient.

It requires more cores to be online, each eating up more power. Looking at the data, its trying to clock gate or power gate the cores the majority of the time. Going with a 2+2 solution would likely provide the same levels of performance AND save power.

For a while now many people have said that these cores in mobile devices go almost completely unused, and that big.LITTLE in particular is a waste because phones just run everything on the little cores (or the big cores?) This article shows empirically that this is not the case. So now people are saying that the Linux kernel scheduler is just plain fundamentally wrong and that it should really be distributing this work among far fewer cores. I guess it's going to take someone like Andrei doing tests with cores disabled and comparing power consumption to see if the schedulers are really totally wrong.

Even the data presented points ot the the number of cores basically going completely unused, sure it running a lot of threads but its idle or powered off 50+% of the time or operating at horrible operating points for power/efficiency (aka 4 cores running at 400mhz, well below Fmax at Vmin). Basically, you pretty much never want multiple cores running at Fmin @ Vmin as you are just wasting leakage power which will be dominate. Ideally, you want to run at Fmax @ Vmin on as few cores as possible and then go to sleep.

Diamond Member

But that assumes it is possible to create a single CPU uArch that has ideal perf/watt across a very wide performance range. Is that really possible? Can't you always do better if you have one uArch handling the low performance range, and another one handling the high performance range? I.e. as the graph from the AT article shows:

As I understand it, even Intel could do better with such a solution, assuming they have suitable uArch:es available for the big vs LITTLE cores. But I don't think they have suitable cores available for that at the moment, unless there are some uArches I'm not aware of.

Sure, cost should also be weighed into the equation. But in the case of ARM, the LITTLE cores occupy so small die area that the added cost of them is not very high.

Also, note that big.LITTLE can be configured to run both the big and LITTLE cores concurrently, when the system is under heavy load. So then you don't get the problem where one of them is always inactive, resulting in "dead or idle silicon".

Senior member

But that assumes it is possible to create a single CPU uArch that has ideal perf/watt across a very wide performance range. Is that really possible? Can't you always do better if you have one uArch handling the low performance range, and another one handling the high performance range? I.e. as the graph from the AT article shows:

b.L like all core hopping has a significant issue with scheduling to even break even. After all, 'its just a simple matter of software'

So yes, it really is possible. After all, we have an actual example of it working in Apple's designs.

As I understand it, even Intel could do better with such a solution, assuming they have suitable uArch:es available for the big vs LITTLE cores. But I don't think they have suitable cores available for that at the moment, unless there are some uArches I'm not aware of.

In theory they could combine Atom + Core for low power laptop/tablet markets, but the reality is it would present more problems than it would solve.

b.L is almost entirely a compromised solution do to the limitations that ARM has to work with. If it provides such an advantage you would see Apple/QCom doing it with their own designs, after all, they also have easy access to a53s, but instead they are going to a single core design that scales.

Senior member

Sure, cost should also be weighed into the equation. But in the case of ARM, the LITTLE cores occupy so small die area that the added cost of them is not very high.

Also, note that big.LITTLE can be configured to run both the big and LITTLE cores concurrently, when the system is under heavy load. So then you don't get the problem where one of them is always inactive, resulting in "dead or idle silicon".

There is no viable phone workload that should in anyways possible be able to use 8 cores that isn't simply horrible software design. The whole example of the update threads is purely an example of something fubar with the software, there is no need for more than a single core to handle sub 25 MB/s storage I/O.

Diamond Member

There is no other viable explanation for why doing a simple copy/replace operation would burn so many CPU cycles. Hell, I can run a ZFS raid doing 100+ MB/s (which is massively more than the bandwidth in a phone) with less CPU cycles!

It requires more cores to be online, each eating up more power. Looking at the data, its trying to clock gate or power gate the cores the majority of the time. Going with a 2+2 solution would likely provide the same levels of performance AND save power.

Even the data presented points ot the the number of cores basically going completely unused, sure it running a lot of threads but its idle or powered off 50+% of the time or operating at horrible operating points for power/efficiency (aka 4 cores running at 400mhz, well below Fmax at Vmin). Basically, you pretty much never want multiple cores running at Fmin @ Vmin as you are just wasting leakage power which will be dominate. Ideally, you want to run at Fmax @ Vmin on as few cores as possible and then go to sleep.

Either you have a strange definition of Vmin or you don't understand that you can't run Fmax at Vmin. Either way you're basically saying that DVFS is a farce despite the fact that it's basically been ubiquitous for several years.

Lower frequencies have better perf/W so long as the static power consumption does not overtake this. That's because dynamic power consumption is proportional to the square of voltage and lower frequencies can run at lower voltages. You say static leakage dominates but on modern processes w/HKMG leakage has gone down and having relatively small cores brings it down further. Meanwhile, the full range of dynamic power consumption has gone up tremendously in phones over the past several years. This isn't a mere hypothesis, this is demonstrated in the graphs at the end of the article which are from empirical measurements.

Switching to a core that's has a lower peak performance target increases perf/W further because you spend potential efficiency in hitting a higher dynamic range. Because a lower performance target means less power burned on speculation, pipeline stages, OoOE, branch prediction, caching, prefetching, etc. All things you can see in A53's design vs A57's. That's the benefit of big.LITTLE. It's a very fundamental concept.

This is utterly and completely false. It depends where on the Vmax-Vmin scale you are operating. Based on the frequencies shown in the article, the cores are generally Fmin@Vmin. Comparing 2 cores at Fmax/2@Vmin vs 1 core at Fmax@Vmin would result in the Fmax@Vmin core burning lower power.

Either you have a strange definition of Vmin or you don't understand that you can't run Fmax at Vmin. Either way you're basically saying that DVFS is a farce despite the fact that it's basically been ubiquitous for several years.

Of course you can run Fmax@Vmin. You do understand that a each Vx there is a range from Fmax to Fmin, right? Have you never seen a Shmoo plot before? And where am I saying DVFS is a farce? It seems you just don't understand semiconductor parametrics.

Lower frequencies have better perf/W so long as the static power consumption does not overtake this. That's because dynamic power consumption is proportional to the square of voltage and lower frequencies can run at lower voltages. You say static leakage dominates but on modern processes w/HKMG leakage has gone down and having relatively small cores brings it down further. Meanwhile, the full range of dynamic power consumption has gone up tremendously in phones over the past several years. This isn't a mere hypothesis, this is demonstrated in the graphs at the end of the article which are from empirical measurements.

Lower Frequencies only have better perf/W until you hit Vmin. Once you hit Vmin, and reduce from Fmax@Vmin, the actual perf/w drops.

Switching to a core that's has a lower peak performance target increases perf/W further because you spend potential efficiency in hitting a higher dynamic range. Because a lower performance target means less power burned on speculation, pipeline stages, OoOE, branch prediction, caching, prefetching, etc. All things you can see in A53's design vs A57's. That's the benefit of big.LITTLE. It's a very fundamental concept.

No it doesn't increase perf/W. It only increases perf/W if the performance*power*time metric is low enough to win out over the context switch latency and scheduling inefficiencies and the additional cost of having both cores active while performing the context switch. It also depends on the cache residency of the workload as moving from a core with a hot cache to a core with a cold cache can result in significant power reloading the cold cache.

Sure in a perfect world where caches didn't exist, context switches were instant, context switches didn't require any power, and schedulers were oracles, b.L would be a definite win, unfortunately we don't exist in that world.

But you're saying they would be better off in all ways with fewer cores, so the scheduler could just keep the other cores powered off.

But it doesn't.

All scheduling is based on analyzing repetitive behaviors. Always has been. Saying that it'd need an oracle to be effective is meaningless without actually data that shows that it's a loss with subpar scheduling.

This is utterly and completely false. It depends where on the Vmax-Vmin scale you are operating. Based on the frequencies shown in the article, the cores are generally Fmin@Vmin. Comparing 2 cores at Fmax/2@Vmin vs 1 core at Fmax@Vmin would result in the Fmax@Vmin core burning lower power.

Of course you can run Fmax@Vmin. You do understand that a each Vx there is a range from Fmax to Fmin, right? Have you never seen a Shmoo plot before? And where am I saying DVFS is a farce? It seems you just don't understand semiconductor parametrics.

Okay, so when you say Fmax @ Vmin you mean the maximum frequency that a particular voltage can support and Fmin as the lowest frequency that voltage can support (and in this case Fmin should be the same for all voltages) Usually when I've seen Vmin/Vmax it refers to global limits irrespective of frequency.

What in the article leads you to think that they are ever running at a higher voltage than the binning of the chip and power management dictates is allowed that frequency?

Fmin for the big core is at 800MHz and if you look at frequency tables voltage keeps scaling all the way down until this point, which is why it switches to the little core below that, which have Vmin at a substantially lower frequency. Yes there's no point decreasing frequency below what you can scale voltage. What makes you think that's happening here?

Again. The graph at the end. I don't know why you keep ignoring it. It shows perf/W increases with lower perf over a huge dynamic range.

No it doesn't increase perf/W. It only increases perf/W if the performance*power*time metric is low enough to win out over the context switch latency and scheduling inefficiencies and the additional cost of having both cores active while performing the context switch. It also depends on the cache residency of the workload as moving from a core with a hot cache to a core with a cold cache can result in significant power reloading the cold cache.

Sure in a perfect world where caches didn't exist, context switches were instant, context switches didn't require any power, and schedulers were oracles, b.L would be a definite win, unfortunately we don't exist in that world.

You're making broad statements about costs of context switches and cache flushes but the fact is that this only applies for as much as you actually switch clusters. And a lot of loads hit steady states where they don't switch for a long time, if ever. This is evident in the data.