The Move to 64-bit

Prior to the iPhone 5s launch, I heard a rumor that Apple would move to a 64-bit architecture with its A7 SoC. I initially discounted the rumor given the pain of moving to 64-bit from a validation standpoint and the upside not being worth it. Obviously, I was wrong.

In the PC world, most users are familiar with the 64-bit transition as something AMD started in the mid-2000s. The primary motivation back then was to enable greater memory addressability by moving from 32-bit addresses (2^32 or 4GB) to 64-bit addresses (2^64 or 16EB). Supporting up to 16 exabytes of memory from the get go seemed a little unnecessary, so AMD’s x86-64 ISA only uses 48-bits for unique memory addresses (256TB of memory). Along with the move from x86 to x86-64 came some small performance enhancements thanks to more available general purpose registers in 64-bit mode.

In the ARM world, the move to 64-bit is motivated primarily by the same factor: a desire for more memory. Remember that ARM and its partners have high hopes of eating into Intel’s high margin server business, and you really can’t play there without 64-bit support. ARM has already announced its first two 64-bit architectures: the Cortex A57 and Cortex A53. The ISA itself is referred to as ARMv8, a logical successor to the present day 32-bit ARMv7.

Unlike the 64-bit x86 transition, ARM’s move to 64-bit comes with a new ISA rather than an extension of the old one. The new instruction set is referred to as A64, while a largely backwards compatible 32-bit format is called A32. Both ISAs can be supported by a single microprocessor design, as ARMv8 features two architectural states: AArch32 and AArch64. Designs that implement both states can switch/interleave between the two states on exception boundaries. In other words, despite A64 being a new ISA you’ll still be able to run old code alongside it. As always, in order to support both you need an OS with support for A64. You can’t run A64 code on an A32 OS. It is also possible to do an A64/AArch64-only design, which is something some server players are considering where backwards compatibility isn’t such a big deal.

Cyclone is a full implementation of ARMv8 with both AArch32 and AArch64 states. Given Apple’s desire to maintain backwards compatibility with existing iOS apps and not unnecessarily fragment the ARM ecosystem, simply embracing ARMv8 makes a lot of sense.

The motivation for Apple to go 64-bit isn’t necessarily one of needing more address space immediately. A look at Apple’s historical scaling of memory capacity tells us everything we need to know:

At best Apple doubled memory capacity between generations, and at worst it took two generations before doubling. The iPhone 5s ships with 1GB of LPDDR3, keeping memory capacity the same as the iPhone 5, iPad 3 and iPad 4. It’s pretty safe to assume that Apple will go to 2GB with the iPhone 6 (and perhaps iPad 5), and then either stay there for the 6s or double again to 4GB. The soonest Apple would need 64-bit from a memory addressability standpoint in an iOS device would be 2015, and the latest would be 2016. Moving to 64-bit now preempts Apple’s hardware needs by 2 full years.

The more I think about it, the more the timing actually makes a lot of sense. The latest Xcode beta and LLVM compiler are both ARMv8 aware. Presumably all apps built starting with the official iOS 7 release and going forward could be built 64-bit aware. By the time 2015/2016 rolls around and Apple starts bumping into 32-bit addressability concerns, not only will it have navigated the OS transition but a huge number of apps will already be built for 64-bit. Apple tends to do well with these sorts of transitions, so starting early like this isn’t unusual. The rest of the ARM ecosystem is expected to begin moving to ARMv8 next year.

Apple isn’t very focused on delivering a larger memory address space today however. As A64 is a brand new ISA, there are other benefits that come along with the move. Similar to the x86-64 transition, the move to A64 comes with an increase in the number of general purpose registers. ARMv7 had 15 general purpose registers (and 1 register for the program counter), while ARMv8/A64 now has 31 that are each 64-bits wide. All 31 registers are accessible at all times. Increasing the number of architectural registers decreases register pressure and can directly impact performance. The doubling of the register space with x86-64 was responsible for up to a 10% increase in performance.

The original ARM architecture made all instructions conditional, which had a huge impact on the instruction space. The number of conditional instructions is far more limited in ARMv8/A64.

The move to ARMv8 also doubles the number of FP/NEON registers (from 16 to 32) as well as widens all of them registers to 128-bits (up from 64-bits). Support for 128-bit registers can go a long way in improving SIMD performance. Whereas simply doubling register count can provide moderate increases in performance, doubling the size of each register can be far more significant given the right workload. There are also new advanced SIMD instructions that are a part of ARMv8. Double precision SIMD FP math is now supported among other things.

ARMv8 also adds some new cryptographic instructions for hardware acceleration of AES and SHA1/SHA256 algorithms. These hardware AES/SHA instructions have the potential for huge increases in performance, just like we saw with the introduction of AES-NI on Intel CPUs a few years back. Both the new advanced SIMD instructions and AES/SHA instructions are really designed to enable a new wave of iOS apps.

Many A64 instructions mode can also work with 32-bit operands, with properly implemented designs simply power gating unused bits. The A32 implementation in ARMv8 also adds some new instructions, so it’s possible to compile AArch32 apps in ARMv8 that aren’t backwards compatible. All existing ARMv7 and 32-bit Thumb code should work just fine however.

On the software side, iOS 7 as well as all first party apps ship already compiled for AArch64 operation. In fact, at boot, there isn’t a single AArch32 process running on the iPhone 5s:

Safari, Mail, everything all made the move to 64-bit right away. Given the popularity of these first party apps, it’s not just the hardware that’s 64-bit ready but much of the software is as well. The industry often speaks about Apple’s vertically integrated advantage, this is quite possibly the best example of that advantage. In many ways it reminds me of the Retina Display transition on OS X.

Running A32 and A64 applications in parallel is seamless. On the phone itself, it’s impossible to tell when you’re running in a mixed environment or when everything you’re running is 64-bit. It all just works.

I didn’t run into any backwards compatibility issues with existing 32-bit ARMv7 apps either. From an end user perspective, navigating the 64-bit transition is as simple as buying an iPhone 5s.

64-bit Performance Gains

Geekbench 3 was among the first apps to be updated with ARMv8 support. There are some minor changes between the new version of Geekbench 3 and its predecessor (3.1/3.0), however the tests themselves (except for the memory benchmarks) haven't changed. What this allows us to do is look at the impact of the new ARMv8 A64 instructions and larger register space. We'll start with a look at integer performance:

Apple A7 - AArch64 vs. AArch32 Performance Comparison

32-bit A32

64-bit A64

% Advantage

AES

91.5 MB/s

846.2 MB/s

825%

AES MT

180.2 MB/s

1640.0 MB/s

810%

Twofish

59.9 MB/s

55.6 MB/s

-8%

Twofish MT

119.1 MB/s

110.2 MB/s

-8%

SHA1

138.0 MB/s

477.3 MB/s

245%

SHA1 MT

275.7 MB/s

948.9 MB/s

244%

SHA2

86.1 MB/s

102.2 MB/s

18%

SHA2 MT

171.3 MB/s

203.7 MB/s

18%

BZip2 Compress

4.36 MB/s

4.52 MB/s

3%

BZip2 Compress MT

8.57 MB/s

8.86 MB/s

3%

BZip2 Decompress

5.94 MB/s

7.56 MB/s

27%

BZip2 Decompress MT

11.7 MB/s

15.0 MB/s

28%

JPEG Compress

15.5 MPixels/s

16.8 MPixels/s

8%

JPEG Compress MT

30.8 MPixels/s

33.3 MPixels/s

8%

JPEG Decompress

36.0 MPixels/s

40.3 MPixels/s

11%

JPEG Decompress MT

71.3 MPixels/s

78.1 MPixels/s

9%

PNG Compress

0.84 MPixels/s

1.14 MPixels/s

35%

PNG Compress MT

1.67 MPixels/s

2.26 MPixels/s

35%

PNG Decompress

13.9 MPixels/s

15.2 MPixels/s

9%

PNG Decompress MT

27.4 MPixels/s

29.8 MPixels/s

8%

Sobel

59.3 MPixels/s

58.0 MPixels/s

-3%

Sobel MT

116.6 MPixels/s

114.6 MPixels/s

-2%

Lua

1.25 MB/s

1.33 MB/s

6%

Lua MT

2.47 MB/s

2.49 MB/s

0%

Dijkstra

5.35 MPairs/s

4.05 MPairs/s

-25%

Dijkstra MT

9.67 MPairs/s

7.26 MPairs/s

-25%

The AES and SHA1 gains are a direct result of the new cryptographic instructions that are a part of ARMv8. The AES test in particular shows nearly an order of magnitude performance improvement. This is similar to what we saw in the PC space with the introduction of Intel's AES-NI support in Westmere. The Dijkstra workload is the only real regression. That test in particular appears to be very pointer heavy, and the increase in pointer size from 32 to 64-bit increases cache pressure and causes the reduction in performance. The rest of the gains are much smaller, but still fairly significant if you take into account the fact that we're just looking at what you get from a recompile. Add these gains to the ones you're about to see over Apple's A6 SoC and A7 is looking really good from a performance standpoint.

If the integer results looked good, the FP results are even better:

Apple A7 - AArch64 vs. AArch32 Performance Comparison

32-bit A32

64-bit A64

% Advantage

BlackScholes

4.73 MNodes/s

5.92 MNodes/s

25%

BlackScholes MT

9.57 MNodes/s

12.0 MNodes/s

25%

Mandelbrot

930.2 MFLOPS

929.9 MFLOPS

0%

Mandelbrot

1840 MFLOPS

1850 MFLOPS

0%

Sharpen Filter

805.1 MFLOPS

857 MFLOPS

6%

Sharpen Filter MT

1610 MFLOPS

1710 MFLOPS

6%

Blur Filter

1.08 GFLOPS

1.26 GFLOPS

16%

Blur Filter MT

2.15 GFLOPS

2.47 GFLOPS

14%

SGEMM

3.09 GFLOPS

3.34 GFLOPS

8%

SGEMM MT

6.08 GFLOPS

6.56 GFLOPS

7%

DGEMM

0.56 GFLOPS

1.66 GFLOPS

195%

DGEMM MT

1.11 GFLOPS

3.24 GFLOPS

191%

SFFT

0.72 GFLOPS

1.59 GFLOPS

119%

SFFT MT

1.44 GFLOPS

3.17 GFLOPS

120%

DFFT

1.41 GFLOPS

1.47 GFLOPS

4%

DFFT MT

2.78 GFLOPS

2.91 GFLOPS

4%

N-Body

460.8 KPairs/s

582.6 KPairs/s

26%

N-Body MT

917.6 KPairs/s

1160.0 KPairs/s

26%

Ray Trace

1.52 MPixels/s

2.31 MPixels/s

51%

Ray Trace MT

3.04 MPixels/s

4.64 MPixels/s

52%

The DGEMM operations aren't vectorized under ARMv7, but they are under ARMv8 thanks to DP SIMD support so you get huge speedups there from the recompile. The SFFT workload benefits handsomely from the increased register space, significantly reducing the number of loads and stores (there's something like a 30% reduction in instructions for the A64 codepath compared to the A32 codepath here). The conclusion? There are definitely reasons outside of needing more memory to go 64-bit.

A7 and OS X

Before I spent time with the A7 I assumed the only reason Apple would go 64-bit in mobile is to prepare for eventually deploying these chips into larger machines. A couple of years ago, when the Apple/Intel relationship was at its rockiest I would've definitely said that's what was going on. Today, I'm far less convinced.

Apple continues to build its own SoCs and invest in them because honestly, no one else seems up to the job. Only recently do we have GPUs competitive with what Apple has been shipping, and with the A7 Apple nearly equals Intel's performance with Bay Trail on the CPU side. As far as Macs go though, there's still a big gap between the A7 and where Intel is at with Haswell. The deficiency that Intel had in the ultra mobile space simply doesn't translate to its position with the big Core chips. I don't see Apple bridging that gap anytime soon. On top of that, the Apple/Intel relationship is very good at this point.

Although Apple could conceivably keep innovating to the point where an A-series chip ends up powering a Mac, I don't think that's in the cards today.

Post Your Comment

466 Comments

If all you can do is name calling then you clearly haven't got a clue or any evidence to prove your point. Either come up with real evidence or leave the debate to the experts. Do you even understand what IPC means?

For example in your link a low clocked Jaguar is keeping up with a much higher clocked Bay Trail (yes it boosts to 2.4GHz during the benchmark run), so the obvious conclusion is that Jaguar has far higher IPC than Bay Trail. For example Jaguar has 28% higher IPC than BT in the 7-zip test. Just like I said.

Now show me a single benchmark where BT gets better IPC than Jaguar. Put up or shut up.Reply

The point that BT Beats Jaguar, especially at performance per watt, clearly proved the point given!

And insisting as you are on your original assessment is a characteristic of acting like a Troll... So you're not going to convince anyone by simply insisting on being right... especially when we can point to Anandtech pointing out multiple benchmarks in this article that showed the Kabini performing lower than bother BT and the A7!

So either learn to read what these reviews actually post or accept getting labeled a Troll... either way, you're not winning this argument!Reply

No, Bob's claim was that Bay Trail was faster clock for clock than Jaguar, when the link he gave to prove it clearly showed that is false. BT may well beat Jaguar on perf/watt, but that's not at all what we were discussing.

So next time try to understand what people are discussing before jumping in and calling people a Troll. And yes I stand by my characterization of various microarchitectures, precisely because it's based on actual benchmark results.Reply

IPC as a comparison point made a lot of sense when we were arguing about which 130 watt desktop processor had the better architecture. It seems largely irrelevant for mobile where we care about performance per watt. Your argument is continually that the ARM/AMD designs are 'faster' based on Geekbench. If Jaguar has a 28% higher IPC than Bay Trail, do you honestly think it matters if Bay Trail is still the faster chip @ 1/3 (or less) of the power requirements? If someone came up with a crazy design that needed 5x the clocks to have a 2x performance advantage of their competitor, but did so with half the power budget, they'd still be racking up design wins (assuming parity for all other aspects like price). That's a two way street. If ARM designs a desktop/server focused chip that needs higher clocks than Intel to reach performance parity or be faster than Haswell, but does so with significantly less power it's still a huge win for them.Reply

IPC matters as you can compare different microarchitectures and make predictions on performance at different clock speeds. I'm sure you know many CPUs come in a confusing variation of clockspeeds (and even different base/turbo frequencies for Intel parts), but the underlying microarchitecture always remains the same. You can't make claims like "Bay Trail is faster than Jaguar" when such a claim would only valid at very specific frequencies. However we can say that Jaguar has better IPC than BT and that will remain true irrespectively of the frequency. So that is the purpose of the list of microarchitectures I posted.

I was originally talking about the performance of Apple A7 and Bay Trail in Geekbench. You may not like Geekbench, but it represents close to actual CPU performance (not rubbish JavaScript, tuned benchmarks, cheating - remember AnTuTu? - or unfair compiler tricks).

Now you're right that besides absolute performance, perf/W is also important. Unfortunately there is almost no detailed info on power consumption, let alone energy to do a certain task for various CPUs. While TDP (in the rare cases it is known!) can give some indication, different feature sets, methodologies, "dial-a-TDP" and turbo features makes them hard to compare. What we can say in general is that high-frequency designs tend to be less efficient and use more power than lower frequency, higher IPC designs. In that sense I would not be surprised if the A7 also shows a very good perf/Watt. How it compares with BT is not clear until BT phones appear.Reply

Your point about benchmarks is actually what surprises me the most nowadays. The biggest thing every in-depth review of a new ARM design brings to light is how freaking piss poor the state of mobile benchmarking is from a software standpoint. I didn't expect magic by the time we got to A9 designs, but it's a little ridiculous that we're still in a state of infancy for mobile benchmarking tools over half a decade after the market really started heating up.Reply

Yes, mobile benchmarking is an absolute disgrace. And that's why I'm always pointing out how screwed up Anand's benchmarking is - I'm hoping he'll understand one day. How anyone can conclude anything from JS benchmarks is a total mystery to me. Anand might as well just show AnTuTu results and be done with it, that may actually be more accurate!

Mobile benchmarks like EEMBC, CoreMark etc are far worse than the benchmarks they try to replace (eg. Dhrystone). And SPEC is useless as well. Ignoring the fact it is really a server benchmark, the main issue is that it ended up being a compiler trick contest than a fair CPU benchmark. Of course Geekbench isn't perfect either, but at the moment it's the best and fairest CPU bench: because it uses precompiled binaries you can't use compiler tricks to pretend your CPU is faster.Reply

SO.....what is it the 'crew' is supposed to 'do'? NOT provide ANY benchmarks? Anand and team are utilizing the benchmarks available right now. They're not building the software to bench these devices...they're reviewing them...with the tools available, currently, NOW---on the market. If you're so interested in better mobile benchmarking (still in it's infancy---it's only really been 5 years since we've had multiple devices to even test), why not pursue and build your own benchmarking software? Seems like it may be a lucrative project. Sounds like you know a bit about CPU/GPU and SoC architecture---put something together. Sunspider is ubiquitous, used on any and all platforms from desktops to laptops---tablets to phones, people 'get it'. As well, GeekBench is re-inventing their benchmarking software---as well, the Google Octane tests are fairly new...and many of the folks using these devices ARE interested in how fast their browser populates, how quick a single core is---speed of apps opening and launching, opening a PDF, FPS playing games, et al.Again---if you're not 'happy' with how Anand is reviewing gear (the best on the web IMHO), open your own site---build your own tools, and lets see how things turn out for ya!Give credit where credit is due....I'd much rather see the way Anand is approaching reviews in the mobile sector than a 1500 word essay without benchmarking results because current "mobile benchmarking is an absolute disgrace"YMMV as alwaysJ

Umm...I think you missed my point. I love the reviews here. That doesn't change the fact that mobile benchmarking software sucks compared to what we have available on the desktop. That isn't a slam against this site or any of the reviewers, and I fully expect them to use the (relatively crappy) software tools that are available. And they've even gone above and beyond and written some tools themselves to test specific performance aspects. I'm just surprised that with mobile being the fastest growing market, nobody has really stepped up to the plate to offer a good holistic benchmarking suite to measure cpu/gpu/memory/io performance across at least iOS/Android. And no, I don't expect anyone at Anandtech to write or pay someone to write such a tool.Reply