The most challenging part of last year's iPhone 5s review was piecing together details about Apple's A7 without any internal Apple assistance. I had less than a week to turn the review around and limited access to tools (much less time to develop them on my own) to figure out what Apple had done to double CPU performance without scaling frequency. The end result was an (incorrect) assumption that Apple had simply evolved its first ARMv7 architecture (codename: Swift). Based on the limited information I had at the time I assumed Apple simply addressed some low hanging fruit (e.g. memory access latency) in building Cyclone, its first 64-bit ARMv8 core. By the time the iPad Air review rolled around, I had more knowledge of what was underneath the hood:

As far as I can tell, peak issue width of Cyclone is 6 instructions. That’s at least 2x the width of Swift and Krait, and at best more than 3x the width depending on instruction mix. Limitations on co-issuing FP and integer math have also been lifted as you can run up to four integer adds and two FP adds in parallel. You can also perform up to two loads or stores per clock.

With Swift, I had the luxury of Apple committing LLVM changes that not only gave me the code name but also confirmed the size of the machine (3-wide OoO core, 2 ALUs, 1 load/store unit). With Cyclone however, Apple held off on any public commits. Figuring out the codename and its architecture required a lot of digging.

Last week, the same reader who pointed me at the Swift details let me know that Apple revealed Cyclone microarchitectural details in LLVM commits made a few days ago (thanks again R!). Although I empirically verified many of Cyclone's features in advance of the iPad Air review last year, today we have some more concrete information on what Apple's first 64-bit ARMv8 architecture looks like.

Note that everything below is based on Apple's LLVM commits (and confirmed by my own testing where possible).

Apple Custom CPU Core Comparison

Apple A6

Apple A7

CPU Codename

Swift

Cyclone

ARM ISA

ARMv7-A (32-bit)

ARMv8-A (32/64-bit)

Issue Width

3 micro-ops

6 micro-ops

Reorder Buffer Size

45 micro-ops

192 micro-ops

Branch Mispredict Penalty

14 cycles

16 cycles (14 - 19)

Integer ALUs

2

4

Load/Store Units

1

2

Load Latency

3 cycles

4 cycles

Branch Units

1

2

Indirect Branch Units

0

1

FP/NEON ALUs

?

3

L1 Cache

32KB I$ + 32KB D$

64KB I$ + 64KB D$

L2 Cache

1MB

1MB

L3 Cache

-

4MB

As I mentioned in the iPad Air review, Cyclone is a wide machine. It can decode, issue, execute and retire up to 6 instructions/micro-ops per clock. I verified this during my iPad Air review by executing four integer adds and two FP adds in parallel. The same test on Swift actually yields fewer than 3 concurrent operations, likely because of an inability to issue to all integer and FP pipes in parallel. Similar limits exist with Krait.

I also noted an increase in overall machine size in my initial tinkering with Cyclone. Apple's LLVM commits indicate a massive 192 entry reorder buffer (coincidentally the same size as Haswell's ROB). Mispredict penalty goes up slightly compared to Swift, but Apple does present a range of values (14 - 19 cycles). This also happens to be the same range as Sandy Bridge and later Intel Core architectures (including Haswell). Given how much larger Cyclone is, a doubling of L1 cache sizes makes a lot of sense.

On the execution side Cyclone doubles the number of integer ALUs, load/store units and branch units. Cyclone also adds a unit for indirect branches and at least one more FP pipe. Cyclone can sustain three FP operations in parallel (including 3 FP/NEON adds). The third FP/NEON pipe is used for div and sqrt operations, the machine can only execute two FP/NEON muls in parallel.

I also found references to buffer sizes for each unit, which I'm assuming are the number of micro-ops that feed each unit. I don't believe Cyclone has a unified scheduler ahead of all of its execution units and instead has statically partitioned buffers in front of each port. I've put all of this information into the crude diagram below:

Unfortunately I don't have enough data on Swift to really produce a decent comparison image. With six decoders and nine ports to execution units, Cyclone is big. As I mentioned before, it's bigger than anything else that goes in a phone. Apple didn't build a Krait/Silvermont competitor, it built something much closer to Intel's big cores. At the launch of the iPhone 5s, Apple referred to the A7 as being "desktop class" - it turns out that wasn't an exaggeration.

Cyclone is a bold move by Apple, but not one that is without its challenges. I still find that there are almost no applications on iOS that really take advantage of the CPU power underneath the hood. More than anything Apple needs first party software that really demonstrates what's possible. The challenge is that at full tilt a pair of Cyclone cores can consume quite a bit of power. So for now, Cyclone's performance is really used to exploit race to sleep and get the device into a low power state as quickly as possible. The other problem I see is that although Cyclone is incredibly forward looking, it launched in devices with only 1GB of RAM. It's very likely that you'll run into memory limits before you hit CPU performance limits if you plan on keeping your device for a long time.

The real question is where does Apple go from here? By now we know to expect an "A8" branded Apple SoC in the iPhone 6 and iPad Air successors later this year. There's little benefit in going substantially wider than Cyclone, but there's still a ton of room to improve performance. One obvious example would be through frequency scaling. Cyclone is clocked very conservatively (1.3GHz in the 5s/iPad mini with Retina Display and 1.4GHz in the iPad Air), assuming Apple moves to a 20nm process later this year it should be possible to get some performance by increasing clock speed scaling without a power penalty. I suspect Apple has more tricks up its sleeve than that however. Swift and Cyclone were two tocks in a row by Intel's definition, a third in 3 years would be unusual but not impossible (Intel sort of committed to doing the same with Saltwell/Silvermont/Airmont in 2012 - 2014).

Looking at Cyclone makes one thing very clear: the rest of the players in the ultra mobile CPU space didn't aim high enough. I wonder what happens next round.

Post Your Comment

182 Comments

Especially because Apple insists on crippling their devices with insufficient RAM, which is something I've never understood. They're premium devices with premium prices, why cheap out on RAM? I get the feeling that many of the performance issues I suffer from on my iDevices are from lack of memory rather than lack of CPU performance.

Anyhow, because they're in that situation, they'd probably benefit from memory compression more than most. They're also in the unique situation of controlling the silicon, so they could implement a hardware memory compression engine if they wanted to...Reply

Apple is very aggressive about power management. DRAM burns power, so Apple installs as little as they think they can get away with. We might quibble with where they've ended up drawing that line, but the principle of it is sound.

Regarding memory compression à la Mavericks, I'm not sure that would be very effective under iOS. iOS will suspend inactive apps if freeing up some RAM becomes necessary. When an app is suspended under iOS its RAM contents are written to 'disk.' Since that's actually quite fast Flash storage, from which there's not much practical penalty in re-fetching those contents, any gain from memory compression instead would seem to be negligible.Reply

The issue is not so much performance as power. Which burns less power --- compressing a page or reloading it? (Writes to flash are very expensive in power, but iOS doesn't swap out to flash, it only swaps in, so that's less of an issue.)I could well believe that on current HW, Apple has done the measurements and concluded that compressed RAM is less power efficient overall; but that with custom HW added to the SOC this changes.Reply

Obviously, more RAM is better, but how exactly do you come to the conclusion that Apple is crippling their devices with insufficient ram? What job is running slow because of this? What type of program has not been brought to market due to this type of limitation?Comparing RAM requirements from different platforms like Android, etc. is missing the point. Apple doesn't have the overhead of a Dalvik virtual machine or the need for just in time compilers, etc. It doesn't need anti-malware services running, etc. The point being, just because 1GB doesn't cut it on Android doesn't mean it's a problem for iOS.As others have mentioned, more memory also consumes more power. It's very clear from Apple's designs that they are trying to optimize power consumption for their mobile devices. That's why you can get that much power in such a small and light weight device. The alternative is a much larger battery that makes for a heavier device and likely requires a larger screen as well.Reply

iOS doesn't use a swapfile. So there are no slowdowns that are RAM related. iOS will kill apps before getting into a OOM situation. (The OOM "errors" you see are iOS logging apps that it has killed to free up space)

Guspaz, your post demonstrates a fundamental lack of understanding of how memory management works in iOS. No, iOS doesn't suffer slowdowns from memory hungry apps. iOS doesn't use virtual memory, so there is no paging, etc. If it doesn't have enough memory, it kills other programs that may have been in memory to make room for the app you are trying to use. All iOS apps are designed to handle this gracefully. The point being, the OS was designed to be very responsive in low memory conditions. Further, when you switch back to an app that was killed, the state was saved so the end user never even sees the difference.Reply