How do you keep increasing performance in a power constrained environment like a smartphone without decreasing battery life? You can design more efficient microarchitectures, but at some point you’ll run out of steam there. You can transition to newer, more power efficient process technologies but even then progress is very difficult to come by. In the past you could rely on either one of these options to deliver lower power consumption, but these days you have to rely on both - and even then it’s potentially not enough. Heterogeneous multiprocessing is another option available - put a bunch of high performance cores alongside some low performance but low power cores and switch between them as necessary.

NVIDIA recently revealed it was doing something similar to this with its upcoming Tegra 3 (Kal-El) SoC. NVIDIA outfitted its next-generation SoC with five CPU cores, although only a maximum of four are visible to the OS. If you’re running light tasks (background checking for email, SMS/MMS, twitter updates while your phone is locked) then a single low power Cortex A9 core services those needs while the higher performance A9s remain power gated. Request more of the OS (e.g. unlock your phone and load a webpage) and the low power A9 goes to sleep and the 4 high performance cores wake up.

While NVIDIA’s solution uses identical cores simply built using different transistors (LP vs. G), the premise doesn’t change if you move to physically different cores. For NVIDIA, ARM didn’t really have a suitable low power core thus it settled on a lower power Cortex A9. Today, ARM is expanding the Cortex family to include a low power core that can either be used by itself or as an ISA-compatible companion core in Cortex A15 based SoCs. It’s called the ARM Cortex A7.

Architecture

Starting with the Cortex A9, ARM moved to an out-of-order execution core (instructions can be reordered around dependencies for improved parallelism) - a transition that we saw in the x86 space back in the days of the Pentium Pro. The Cortex A15 continues the trend as an OoO core but increases the width of the machine. The Cortex A7 however takes a step back and is another simple in-order core capable of issuing up to two instructions in parallel. This should sound a lot like the Cortex A8, however the A7 is different in a number of areas.

The A8 is a very old design with work originally beginning on the core in 2003. Although ARM offered easily synthesizable versions of the core, in order to hit higher clock speeds you needed to include a lot of custom logic. The custom design requirements on A8 not only lengthened time to market but also increased development costs, limiting the A8’s overall reach. The Cortex A7 on the other hand would have to be fully synthesizable while being able to deliver good performance. ARM could leverage process technology advancements over the past few years to deliver clock speed and competitive power consumption, but it needed a revised architecture to meet the cost and time to market requirements.

The Cortex A7 features an 8-stage integer pipeline and is capable of dual-issue. Unlike the Cortex A8 however, the A7 cannot dual-issue floating point or NEON instructions. There are other instructions that turn the A7 into a single-issue machine as well. The integer execution cluster is quite similar to the Cortex A8, although the FPU is fully pipelined and more compact than its older brother.

Limiting issue width for more complex instructions helps keep die size in check, which was a definite goal for the core. ARM claims a single Cortex A7 core will measure only 0.5mm2 on a 28nm process. On an equivalent process node ARM expects customers will be able to implement an A7 in 1/3 - 1/2 the die area of a Cortex A8. As a reference, an A9 core uses about the same (if not a little less) die area as an A8 while an A15 is a bit bigger than both.

Architecture Comparison

ARM11

ARM Cortex A7

ARM Cortex A8

ARM Cortex A9

Qualcomm Scorpion

Qualcomm Krait

Decode

single-issue

partial dual-issue

2-wide

2-wide

2-wide

3-wide

Pipeline Depth

8 stages

8 stages

13 stages

8 stages

10 stages

11 stages

Out of Order Execution

N

N

N

Y

Partial

Y

Pipelined FPU

Y

Y

N

Y

Y

Y

NEON

N/A

Y (64-bit wide)

Y (64-bit wide)

Optional MPE (64-bit wide)

Y (128-bit wide)

Y (128-bit wide)

Process Technology

90nm

40nm/28m

65nm/45nm

40nm

40nm

28nm

Typical Clock Speeds

412MHz

1.5GHz (28nm)

600MHz/1GHz

1.2GHz

1GHz

1.5GHz

Despite the limited dual issue capabilities, ARM is hoping for better performance per clock and better overall performance out of the Cortex A7 compared to the Cortex A8. Branch prediction performance is improved partly by using a more modern predictor, and partly because the shallower pipeline lessens the mispredict penalty. The Cortex A7 features better prefetching algorithms to help improve efficiency. ARM also includes a very low latency L2 cache (10 cycles) with its Cortex A7 design, although actual latency can be configured by the partner during implementation.

Note that in decoding bound scenarios, the Cortex A7 will offer the same if not lower performance than a Cortex A8 due to its limited dual-issue capabilities. The mildly useful DMIPS/MHz ratings of ARM’s various cores are below:

Estimated Core Performance

ARM11

ARM Cortex A7

ARM Cortex A8

ARM Cortex A9

Qualcomm Scorpion

Qualcomm Krait

DMIPS/MHz

1.25

1.9

2.0

2.5

2.1

3.3

The big news is the Cortex A7 is 100% ISA compatible with the Cortex A15, this includes the new virtualization instructions, integer divide support and 40-bit memory addressing. Any code running on an A15 can run on a Cortex A7, just slower. This is a very important feature as it enables SoC vendors to build chips with both Cortex A7 and Cortex A15 cores, switching between them depending on workload requirements. ARM calls this a big.LITTLE configuration.

They won't be able to implement a small, energy-efficient processor similar to Kingfisher (Cortex A7). While at performance/complexity levels of A15, the x86 decode penalty is relatively small, as you get down to ~100mW levels at the die area we're talking about, there simply isn't an x86 core out there that is feasible.Reply

Intel already has the CE4100, a SoC designed for TVs. It just isn't a main point they are focusing on right now, as they are competing against IBM RISC (some 200W per CPU is possible on those :eek:), while having to scale all the way down. Intel even has to focus on IGP, wireless, ethernet, etc. Intel is doing the best they can against an overwheling slew of competitors at all angles in all directions. However, I do not have doubts in their ability to compete when why need to. Right now, they have some big fish to fry (IBM), and it's not like Intel hasn't made ARM chips before (XScale).

Not to mention... they still hold their traditional fab advantage over everyone else.

But do I think Intel may lose out if they don't start making some serious pushes into the smartphone SoC market? Yes, I do. However, Google/Intel have already anounced all future versions of Android will be officiall support x86, in addition to ARM. This works for existing Android apps, too, due to the JIT nature of Android apps (Dalvik).Reply

"Intel is doing the best they can against an overwheling slew of competitors at all angles in all directions."

Actually Intel has made the deliberate decision that FULL x86 compatibility trumps everything else. This saddles Atom with a huge amount of basically useless baggage that ARM is not carrying. This baggage takes up space, uses power, and, most importantly, makes it that much more difficult to execute and validate a chip fast.

This is not "doing the best they can". It is a stupid, short-sighted decision made by execs who have drunk their own koolaid and can't imagine that some decisions made by Intel in the 1980s may not be optimal for today. Atom SHOULD have been a stripped down x86-64 CPU with EVERYTHING not essential to that mission jettisoned. One (modern) way of doing everything and the rest --- 286 mode, V x86 mode, PAE, SMM, x87, MMX etc etc tossed overboard. But that ship has sailed. Intel have made their bed, and now they will have to lie in it --- the same mindset that sank iTanic will sink Atom in its target market.Reply

x86 compatibility is not a significant burden on modern chips; transistor density is increasing far faster than the number of x86 instructions is, and Intel's chips have been effectively RISC since the Pentium Pro, when they started translating x86 into micro-ops internally. CISC on the outside, RISC on the inside.

In 2008, when Anand wrote his Atom architecture article (http://www.anandtech.com/show/2493/3), he pointed out AMD had told him that x86 decoding consumed only 10% of the transistor count of the K8, and current transistor densities are ~35x higher than they were then (130nm -> 22nm).

By that math, that means that x86 decoding consumes only about 0.3% of a modern desktop processor, almost inconsequential.Reply

Allow me to explain how English works to you. You see this statement "and, most importantly, makes it that much more difficult to execute and validate a chip fast."?That means if you wish to refute my point, you have to attack the claim that I said was MOST IMPORTANT.

Do you deny that Intel has an insanely complex task designing and validating their chips, vastly more so than ARM? Do you deny that the major reason for that complexity is all the x86 baggage?Reply

Like i wrote below, Atom and SandyBridge are different thing. Atom does not support SSE 4, Some Do not support Hyper Threading, Some dont have Intel 64, and it also lacks the new SandyBridge AVX.

I am not expert, but since Intel Atom and SB will have different Micro-Op Cache, Unless you write your software on the lowest common denominator, which is the Atom. You cant have the software works the same way as ARM has currently show with A7 and A9.

Yes, with some software tricks and profile i suppose the problem isn't Very hard to tackle. But in terms of software development ARM should be much easier. Reply