Post Your Comment

95 Comments

I just picked up a ASUS Memo pad HD7 for my wife, and it has a quad A7 @ 1.2Ghz. I have been pretty impressed with how well it runs. Great little device. Looks like the A53 will be quite a bit more awesome.

I do wish that table above included some high performance comparisons as well. It's nice to see the A9, but how about a Krait variant or two, and perhaps A15?Reply

Qualcomm really can't claim any benefit to 64 bits without making a glaringly obvious hole in their higher end lineup. Another announcement pending? Wouldn't think so since they just announced the 805.Reply

805 will ship in 1H 2014, 410 in 2H 2014. It stands to reason that 805 is a stopgap before the 64 bit followup to krait is finished. If they're executing as well as they have been since krait's introduction, they would either have this 'krait2_64' ready for either the holiday season 2014 (unlikely, since that doesn't line up well with HW releases) or the new top devices of H1 2015 (S6, HTC one something). It could be that they make it for LG's next top device, perhaps the nexus 10 for 2H 2014. Reply

Awesome! If this is the Snapdragon 410, and on the high end we only have an 805, I'll be looking forward to the Snapdragon 810 soon. Qualcomm's naming structure has improved lately but it's still a little bit complicated sometimes. Reply

You are here, because the 64 bit processors also have improved performance. Forget the 64 bitness and concentrate instead on the generation; the old generation of processors is slower than the next generation of processors, as per the norm, it just happens to be that the next generation of processors also happen to be 64 bit. You get more registers, better IPC, wider execution units, and lower power all mixed together. 64 bit is just frosting at that point.Reply

Registers can be thought of as a very fast cache. The more registers a CPU has, the less the compiler has to move data back and forth to/from memory.

You seem to have confused quantity of registers with width of registers, and then conflated the power savings proposition with the 64 bitness. In fact, the power saving comes (in part) because the 32 bit ARM ISA evolved over time and has numerous tweaks for backwards compatibility, requiring more transistors and more power. The 64 bit ISA wipes that slate clean and implements only what is required. It's more efficient because it ditches those tweaks *and* is designed with learnings from the past decade in mind.

64 bit isn't better in some abstract sense. 64 bit ARM is both higher performing and lower power than 32 bit ARM. And, as luck would have it, that's what we're talking about here.Reply

No, the amount of logic involved doing the actual addition is a small proportion of the total involved in the execution of a single instruction. So a 64-bit addition might use maybe 5% more power than a 32-bit addition, not twice as much.Reply

Dan, most CPU logic is not in math, but in a lot other components like: CPU cache (which is sometimes more than 1/2 of the entire CPU), branch prediction, memory addressing unit, etc. Also, when you use 64 bit CPUs, the code is using still 32 bit integers, making the transistor count the same. Without knowing the full specifics, most of 64 bit integers can be implemented by using 32 bit integer math, so the extra added logic can be reduced even further (as an uncommon path).

Are more registers faster? Oh, yeah. By a large amount because the registers run like 4 times faster than L1 cache (or even more), like 10-20 times faster than L2 cache, and the L2 cache is typically 10x faster than the memory access. A compiler that can have 2x more registers on the target CPU will likely give a code that is not 4 times faster, but 30-50% speedup is doable in a lot of real code. LLVM (the main backend optimizer) stated that when improving by 10% the register allocation got a speedup up-to 20% http://bit.ly/1d7B3awReply

That would sound nice, but you miss the point that ARM8 has a 32 bit mode that is compatible with ARM7 (and transitively with older ARM ISAs). So they cannot "wipe that slate clean" at all, everything has to be there.Reply

More registers are generally better indeed, however the gain from 14 to 31 is not that large - studies indicated around 20-24 is optimal. Note there are drawbacks as well to having more registers such as a slower process switch.

The A53 includes all 32-bit instructions, so can run all existing binaries. So nothing has been ditched at all. The power savings are not due being 64-bit and not due to the new ISA either. The efficiency improvements are simply due to it being newer and better than its predecessors (if it had been 32-bit then any gains would be the same).

64-bit code will often run a little faster than 32-bit but not hugely so. While the 64-bit ISA allows for power savings in decoding, 64-bit pointers and registers increase power slightly, so which effect is larger will depend on each particular application.Reply

Do you links for any studies outside of the ones AMD did when evaluating x86-64? Those are good for a single data point, but they're a bit limited given that they were specific to x86-64 and a relatively wide OoO uarch. In-order uarchs, in comparison, benefit from code that's more aggressively scheduled to hide latency which increases register pressure.Reply

I was thinking of the original RISC studies for MIPS and SPARC. They are old now but I confirmed those results for ARM - basically the benefit of each extra register goes down exponentially. If you have a good pressure-aware scheduler (few compilers get it right...) then you only need a few extra registers.Reply

Thanks for the clarification. I'd say that even a sweet spot is 20-24 justifies 31 GPRs (plus SP). I'd also argue that research done with the original MIPS and SPARC aren't perfectly representative of something like Cortex-A53 either. In my experience, going from hand coding ARM9 to Cortex-A8 assembly presented a lot of new challenges in scheduling which absolutely increased register pressure. Dual issue means you have to hide more instructions in a similar latency, and generally more latencies were added, like for address generation or shifts. The original 5 stage RISC CPUs like the first MIPS uarchs would be a lot closer to ARM9 than Cortex-A8. Cortex-A53 probably doesn't have as many interlock conditions as A8 but it should still be substantially worse than ARM9.

One particular application I know I'd appreciate having 31 GPRs for is emulating another ISA with 16 GPRs, like x86-64..Reply

Do you know how computers work?Registers actively store the work in progress of the CPU. Adding two numbers takes three registers. Adding 12 sets of numbers in parallel takes 36 registers.

Increasing your register count 10 fold allows you 10x improvement in performance, assuming 10 available execution units to do work. The way register files work, you can also work on more bits at a time too! Instead of adding 2 ints you can add 20 ints, 10 doubles, or 5 floats at a time.

You don't get 10x performance from 10x registers. You can maybe, if you are very-very lucky, get 1/10th usage of the main memory. That is faster, but even if your program has 50% memory operations (unrealistically high) and 50% other, then you get from 50%+50% -> 5%+50% execution time from the 10x registers, that is 1.8x speedup, all things being super optimal. In exchange, the registers themselves use more power.

You get 10x performance from 10x execution units. That will give you *more than* 10x power (due to how pipelines work) too. Even Haswell has only 7 execution units..Reply

I explained my post perfectly. 10x registers and 10 available execution units = 10x improvement in performance. I apologize if my hyperbole threw you for a loop, I was trying to explain a concept.

10 execution units with only 3 registers = 1 add per clock. 10 execution units with 30 registers = 30 adds per clock. I also hinted at parallel processing; if you have 5 floats that need to be added to 5 floats (or multiplied, or accumulated, or whatever), you can do that in one clock cycle now.

AES saw a huge improvement because AArch64 has instructions specifically to assist AES acceleration which Geekbench is leveraging. If DGEMM uses double precision then it'd have seen a big improvement due to AArch64 adding support for double precision SIMD. The smaller improvements (and one notable regression) in the integer tests could be from the increased register count but possibly also from other factors, like for example if Cyclone is more efficient with conditional select in AArch64 than predication in AArch32.

Actually "race to sleep" uses more power because you are running the CPU at a higher frequency and voltage. It's always better to spread tasks across multiple CPUs and run at a lower frequency and voltage, even if that means it takes longer to complete.Reply

Um, no. Not in the slightest. Race to sleep is the best fit we've yet come up with given our current technologies (i.e., constant running power and fixed performance-to-sleep transition requirements).See:www.cs.berkeley.edu/~krioukov/realityCheck.pdfwww.usenix.org/event/atc11/tech/final_files/LeSueur.pdf‎for some examples.Reply

LOL. Not all CPUs are 130W extreme edition i7's on an extremely leaky process!!!

A < 5W mobile core on a modern low power process has very little leakage (unlike the i7), so it is always better to scale the clock and voltage down as much as possible to reduce power consumption. Big.LITTLE takes that one step further by moving onto a slower, even more efficient core. Running as fast as possible on the fastest core is only sure way to run your batteries down fast.Reply

That link doesn't say anything about "race to idle". Basically if you look at the first graph it shows that the 3 devices have different idle consumption simply due to using different hardware (the newest hardware wins as you'd expect). Anand concludes the device with the lowest idle consumption uses less energy over a long enough timeframe eventhough it may use far more power when being active. True of course, but that has nothing to do with "race to idle".

Look at the right side of the graph "Heterogeneous CPU operation". That shows performance vs the amount of power consumed. As you can see it is not linear at all, and the more performance you require, the more the graph curves to the right (which means less efficient). To paraphrase Anand: "Based on this graph, it looks like it takes more than 3x the power to get 2x the performance of the A7 cluster using the Cortex A15s." So if you did "run to idle" on the A15, you'd use at least 50% more energy to execute the same task on the A7. Of course the A7 runs slower and so returns to idle later than the A15, but it still uses less energy overall.Reply

Yes it does:If we extended the timeline for the iPhone 4 significantly beyond the end of its benchmark run we'd see the 4S eventually come out ahead in battery life as it was able to race to sleep quicker.Reply

No. The 4S uses significantly more power than the iPhone 4 when actually running. You can clearly see that the power consumption above the idle level for the 4S is exactly twice that of the 4, but it is only 75% faster. That means it used about 15% more energy to complete the benchmark. So clearly "race to idle" uses more energy than running a bit slower. If you lowered the maximum clock frequency of the 4S then it would become as efficient as the 4.Reply

If race to sleep were always the best solution CPUs wouldn't have DVFS at all. You'd run it at the highest clock always then power gate when idle. But that's not how things are done at all. Dynamic clocking schemes actively try to minimize the amount of time the CPU spends idle, at least until it hits some minimum clock speed.Reply

More registers are better because its the fastest memory available to the CPU. The less you have, the more it'll be spent waiting on slower memory. Also, RISC architecture has no concept of say "add RAM_location_1 to RAM_location_2", RISC can only move data from RAM to registers and perform operations on registers. Therefore, more registers is pretty much vital on a RISC system.Reply

There are also specific security and ease-of-programming benefits to 64 bit. The Linux kernel used in Android supports address space randomization; it's much easier to pick an unexpected address range when you have 64 bits to play with. And the POSIX mmap() functionality can work on much later files with a 64-bit address space, making it easier to write performant apps that work with large data sets (high MP photos, for instance).Reply

Having 64 bit applications is even a disadvantage when you have little (say 1GB or less) of memory. As mentioned http://www.anandtech.com/show/7460/apple-ipad-air-...">here on Anandtech 64bit apps use 20-30% more memory so 1GB of memory on a 64bit phone/tablet equals to ~0.75 GB memory on a 32bit device. Meaning that you see reloading tabs in browsers and apps occur sooner.Reply

Estimates are that going to 64 bit can bring a 15% increase in performance by itself. There is a multiplication factor to that when you include the other improvements made.

But for a number of mathematical calculations, such as those made using encryption, estimates are that there could easily be a 10x improvement. The touch sensor would be benefiting. Perhaps that's why it's so fast.Reply

Yes and no. Yes in terms of testing the demand for low-end 4-bit chips to allow for easier and faster adoption later. Then hit the market with high-end fast chips when it mushroomed. Secondly to stake in the ground that Qualcomm has a production 64bit Arm chip that it can do amazing 64bit chips later when the demand shoots up in the market. Qualcomm has been able to move fast, so that is a real advantage to competitors like Nvdia who announces and OEMs wait and wait. Then waited somemore ...Reply

I think that in first approximation you're right: using 64bit CPU to run 32bit code is substantially useless. It will go a bit faster in some cases, consume a bit more, all things being equal.I think though that the point of 64bit CPU is a mid/long term preparation.I think it is inevitable to see mobile systems with 4GB or more of memory: already the tablets, especially 10" tablets, could (should?) be used to multitask and to run some relatively intense computational programs. I think what Qualcomm is trying to avoid is what happened int the PC market, where the need for 4GB+ of RAM was there and at that point the vast majority of PCs used 32bit CPUs.For cellphones, especially the ones in the ranges targeted by the 410, I really don't see any significant benefit, other than making the architecture future-proof and having 64bit-only devices across the board.Reply

In terms of the iPhone 5S, the 64-bit ARM cores actually are a step up in performance, due to operating system optimisations for the hardware, as well as instruction set changes that improve performance, and larger/more registers (the obvious 64-bit benefit). I believe there was an article on Ars explaining the operating system enhancements that allow iOS7 to take advantage of the 64-bit architecture more than a straight 64-bit port of an OS normally would.Reply

A couple of months ago there was an interesting discussion on realworldtech.com about the benefits of 64-bit. Linus Torvalds (if I remember correctly) mentioned that 64-bit was already beneficial when physical memory exceeds 896 MB. Reason was that above 896 MB, it is not possible to address all physical memory from 32-bit kernel virtual memory space (which is limited to 1 GB). At that point managing memory becomes significantly less efficient because of the need to frequently remap virtual memory space to different chunks of physical memory.Unfortunately was not able to locate the thread anymore.Reply

Anand, if you would truly "personally much rather see two higher clocked A53s", and if that applies to quad core versus dual core in general, you should start using your influence through this site and with direct industry connections to put that out there. I don't recall reading anything significant talking down quad cores versus dual cores in your articles when it may have been applicable (not Apple because they don't have a quad core anyway), and if it was even mentioned it was so minor that it was easy to overlook.Reply

I've been doing this for the past year (in fact I literally just did this last week). I am going to start campaigning for this more aggressively though. It'll take a while before we see any impact given how long it takes to see these things come to fruition though.

Why not campaign for better dynamic clock control/turbo/etc. To me, that seems like the best solution going forward. For example, back in the day of the Core 2 Duo vs Core 2 Quad, you had to make the compromise, max clock (Dual) or Max Multi Thread Perf (Quad). Nowadays with turbo mode and whatnot you can essentially have the best of both worlds. A quadcore chip that shuts down 2 cores and can run as a fast dualcore if needed.

For a lot of reasons I think it will be very difficult convincing people, oems, etc that moving from 4 cores to 2 cores is a good thing, and in some ways it really IS moving backwards. (You are still making that same compromise)Reply

Also another part of what I (and a few others) have been campaigning for over the past year as well. Power management and opportunistic turbo is still largely a mess in mobile. Thankfully there are improvements coming along this vector.Reply

Agreed, a longer pipeline will reduce power efficiency considerably, making it less suitable for big.LITTLE. However further frequency gains are likely, the ARM website says 2GHz is expected (no mention of process, so I guess on 16nm in 2015).

We already have 1.8GHz quad-core A15 phones today. ARM claims that A57 is actually more power efficient, so I don't see why there would be an issue with using A57 in a phone besides the fact that 64-bit seems a bit unnecessary. The 20nm process will be used next year as well, improving power and performance further.Reply

Dual vs quad is a lost cause already, especially since we're moving towards 8 cores (4+4 in big.LITTLE). The die-size cost is low enough that the performance gain in the cases where you can use all CPUs is worth it.Reply

Very true, I don't know the die size of a 28nm A53 core (and the L2 cache it will need), but on a modern ~60mm^2 "economy SoC" there's probably not much difference between two and four cores (<5 mm^2).Reply

The difference between 2 and 4 Cortex-A7 cores is less than 1mm^2 on 28nm, so why go for dual core when quad core is almost free? A53 will be a bit larger than A7 of course, but the same will happen.

I don't understand why Anand is against quad cores. If you only use 1 or 2 cores, then having a quad core doesn't cost you anything (not in power nor frequency). If you can use more than 2 cores you get the benefit of having a quad core. So you never lose out.Reply

The problem is China. If you want to eventually sell your phone there it needs to be quad core for some reason. The marketing push for quad core chips was too effective and the market was too big. If Qualcomm wants to eventually compete in asia at least.

I think I read about it at techpinions or maybe I was listening to Ben Thomson.Reply

Certainly it appears that current Android apps rarely use more than two (meaty) threads, so it is common to see two cores out of four sleeping most of the time in quad-core SoCs.

It would be nice to see some form of turbo in these quad-cores. E.g., 4xA53 @ 1.2GHz, 2xA53@1.4GHz, 1xA53@1.6GHz. It would certainly help in those single threaded Javascript benchmarks that everyone uses to test single threaded performance as if its even meaningful unless the systems are running the same OS, browser.Reply

Just because Anand is calling it out doesn't mean that Qualcomm disagrees. The market for this processor (the Asian market) demands more cores because the OEMs want to market as many cores as possible. It doesn't matter if they are slow or not (I think I saw someone wants to make an 8 core Cortex-A7 chip), that's what the OEMs want.

Being able to market the first 64-bit Android devices will be a huge selling point even though the devices won't have more than 1 GB of memory and the users will never use encryption because that's how the market works over there. Nobody realizes that Intel makes dual core parts that are several time faster than these.Reply

Yeah, Asians are weird. I recently read that many Koreans belief you die if you run a fan in a closed room. This includes people like physics students. WTF? Back on topic, Quad-cores make no sense in phones. I admit, mine has a quad too. For once I have to applaud to Apple (I actually "hate" Apple). Their A6 and A7 are very well thought-out designs. However that's not enough to charge 2-3 times the price to comparable phones.Reply

Why would you go backwards unless there is a clear sign of disadvantages? Would you say the same about memory, storage, or screen resolution? It makes no sense when these days web browsers can make use of as many cores as available. (and they have been doing so for some time)

Try scrolling the HOME PAGES on iPad mini (2013) with a few Safari tabs open. Tell me what it looks and feels like. I thought I was dealing with Touchwiz.Reply

Web page rendering isn't multi-threaded at all not even on desktop browsers. I would say most stuff on smartphones isn't. But that does not matter. Even on a desktop 2 fast cores will always be better than 4 slow ones.Howver with Intel turbo in quads that is not an issue anymore but it still is in phones (speak ARM).Reply

Makes no sense to want higher clocks for A53 , we would be better off with 2+2 A57 and A53 or even 1+2..Would be nice to see some very thin devices , bellow 5mm , with 4xA53 clocked low.- ofc this one is on 28nm so we can hope for more soon when 20nm parts show up.Reply

I'm with Qualcomm adopting this, as I thought they would, since they were already using A7 in S400, but this is rather embarrassing for Qualcomm.

I mean how the hell is Qualcomm's first ARMv8 chip an ARM one?! I thought the whole point of licensing the architecture and building the core yourself was that you got to release it EARLIER than ARM themselves. I expected Nvidia to not release an ARMv8 chip in 2014, but Qualcomm seems to have handled their transition to ARMv8 just as poorly.

In the end, this is good, though. It just means Qualcomm gets to have less monopoly, and perhaps for once Samsung will do something with their chips, and try to steal customers away from Qualcomm, by giving their chips to others. I don't know what the hell the chip business of Samsung is thinking. They had so many opportunities to push Exynos chips as competitors to Qualcomm's chips, especially since Apple wants to give up on having them make their chips, and they never took advantage of them. They even have their own fab. Heck, they aren't even using them for their own chips, let alone sell them to others. If Samsung will be the first with a 64-bit chip in 2014, this would be a good opportunity for them to start doing that.Reply

I think that Qualcomm had a good reason to develop its own core with the Krait(Cortex A15 having problematic power efficiency), but now with the A53/A57, I wouldn't be surprised if we don't see the successor of Krait core actually.

And using stock ARM won't make them any less competitive.. Qc is the market leader in the whole SoC design, and that has always been their greatest strength. The CPU architecture is just a small part of their puzzle. Reply

You're expecting too much. Had Apple not released their 64 bit A7 then neither NVIDIA nor Qualcomm (or Samsung) would have been 'late'.

You also over estimate Samsung. Samsung's initial chip designs were via Intrinsity, which Apple bought out from under them. Without that they have to rely on ARM and their own in-house talent, which has demonstrably been experiencing growing pains as they try to DIY. Likewise Apple has never 'given up on them', though they may be shopping around. Apple still purchases the bulk of their SoC from Samsung, which deprives Samsung of the capacity necessary to fulfill their own orders, much less anyone else's. In that light this is why Samsung has and continues to use third party chips like Snapdragon and Tegra... and third party fabs like TSMC!

So relax; it doesn't matter who is second with a 64 bit chip, it only matters that you have sufficient competition to get a phone/tablet you like.Reply

It does matter who is #2, because it influences the Android market, and thus competition. Also, both Qualcomm and Samsung are late. That's just a fact.

Krysto is right that Qualcomm missed the 64-bit train, which is why they demoted their CMO when he started to trash Apple's A7. If they are forced to use ARM designs for their first 64-bit chip it is essentially an admission of failure to predict where the market would be. (If you actually need 64-bit SoC's in phones for 2014 is up for debate, I think it's highly debatable.

But the fact is that the market has moved here in preparation for future phones and tablets and Qualcomm is late. Simple as. But it's not going to be very serious. If they can get their Krait 64-bit before the year is over or very early next year, not much is lost. But still; embarrassing to be so flatfooted and kudos to Apple for driving the ecosystem(once again).Reply

Maybe it's my google skills but I can't find anything resembling a Qualcomm or Samsung roadmap talking about 64 bit dates prior to the A7 announcement and as you know qualcomm's cmo immediately wrote 64 bit off as a "gimmick" which does imply a thing or two about Qualcomm's priorities at the time of the announcement.Reply

The 410 is simply a low end reference design if I'm not mistaken, which they can get to market much faster than a 64 bit Krait and there's no Qualcomm published roadmap with a 64 bit Krait in it as far as I'm aware.Reply

That's the irony isn't it?Apple beat ARM to market with a custom design, while Qualcomm is relying on a reference design to be first to market. Yes, Qualcomm never published a roadmap, but ARM did. I just assumed that everyone would release their parts after ARM did.Reply

I am sure Qualcomm has a Krait 64bit design but it might be late, hence the use of A53 reference design as the way to "beat the 2013" time frame in order to show the market it has some 64-bit cred. This buys time for a real Krait 64 sometime Q2 or next year when it will begin to heat up as the other Arm players begin to show their iterations of the reference A57 designs. Here is where Krait can still work its magic to capture the bulk of the market the way it did with S600.Reply

There is no pressure in the Android world to switch to 64 bit earlier (Android would still run using the ARM v7s ISA). Apple, on the other hand, does have advantages switching to 64 bit on its own terms in addition to speedups. For instance, I expect Apple's 64 bit transition for their ARM-powered devices to be very smooth sailing: by the time their devices hit the 4 GB barrier, all of iOS ecosystem has moved to 64 bit long ago.

In the Android world having a »64 bit cpu« is just a marketing checkbox at this point. However, he advantages of A53 (low power consumption, higher performance) remain even if it is run in 32 bit mode. Reply

Well thank goodness that everyone that "thinks" they know what's going on said that 64-bit on a phone is a waste right??? Last thing we want is for the fandroid community to once-again be proven wrong right?Reply

"I'm really excited to see what ARM's Cortex A53 can do." Then we have a marketing table from Qualcomm with projected numbers presented as fact. I'm holding my breathe all next year in anticipation.Reply

If Cortex A53 is 43 percent faster than Cortex A7 at same clock speed, then Qualcomm's Snapdragon 410 at 1.4 Ghz should be 65 percent faster than Snapdragon 400. So Moto G2 could be that much faster than the original. Not bad improvement in 12 months time.

The GPU, however, seems a little disappointing? I'm not sure how much faster it is compared to the one in Snapdragon 400, but overall it should be less than Adreno 225/Adreno 305? I hope I'm wrong. I'd expect to see Adreno 320 level performance in such devices in late 2014/early 2015.Reply