Cache and Memory Controller Comparison

Now that you know what parts to compare, let's drill a little deeper. Since cache is a major element separating Phenom II from its precessor, let's start there.

Phenom II, like its predecessor, maintains a 3-cycle 64KB L1 cache. With Nehalem, Intel had to move to a 4-cycle cache, so Phenom II retains the hit rate and performance benefits of a larger, faster L1. The L2 cache latency is where Phenom II and Intel’s architectures really differ.

Phenom II, like the original, has a 512KB L2 cache per core, but the cache is a high latency 15 cycle cache. Compared to the Athlon X2’s 20 cycle L2, Phenom II looks pretty good, but now look at Penryn. Penryn’s 15 cycle L2 is the same speed as Phenom’s L2, but it’s 2-6x larger. Core i7 trumps them all with a very fast 11 cycle L2, although it achieves this by having the smallest L2 cache per core out of the bunch - only 256KB in size.

AMD asserts that Phenom II’s L3 cache is now 2-cycles faster than Phenom’s L3. At 3x the size but with improved access time, Phenom II’s L3 is closer to where it should have been in the first place. Everest measures Phenom II’s L3 as having a 55-cycle latency, while Core i7 has a 35 cycle L3. Sandra puts Core i7 and the original Phenom at 55 cycles, but Phenom II at 71 cycles. I checked with Intel and AMD, and it appears neither application is reporting the correct L3 access latencies for either processor. Intel confirmed Core i7’s L3 as a 42 cycle L3 and I’m still waiting to hear back from AMD on the time to access its cache, but I suspect it will be around 50 cycles.

Processor

L1 Latency

L2 Latency

L3 Latency

AMD Phenom II X4 920 (2.80GHz)

3 cycles

15 cycles

AMD won't tell me

AMD Phenom @ 2.8GHz

3 cycles

15 cycles

AMD won't tell me

Athlon X2 5400 (2.80GHz)

3 cycles

20 cycles

-

Intel Core 2 Quad QX9770 (3.2GHz)

3 cycles

15 cycles

-

Intel Core 2 Quad Q9400 (2.66GHz)

3 cycles

15 cycles

-

Intel Core i7-965 (3.2GHz)

4 cycles

11 cycles

42 cycles

Main memory access time is more telling. A trip down memory lane will cost you 107 ns on an original Phenom processor, 100 ns on an Athlon X2, and now only 95 ns on a Phenom II. The 11% improvement in memory access performance is due to improvements AMD made when it redesigned the memory controller to include support for DDR3.

L2: It’s the New L1

I think I finally get it. When Nehalem launched I spoke with lead architect Ronak Singal at great length about its L2 cache being too small. I even made this graph to illustrate my point:

With only 256KB per core, Core i7’s L2 cache was a large step back. Ronak argued that its 11-cycle load latency was more important than size. But it took Phenom II for me to understand why.

The original Phenom suffered because not only did it have very little L2 cache per core (512KB compared to as much as 6MB with Penryn), but it also had a very small L3 cache. Four cores sharing a 2MB L3 cache just wasn’t enough. The problem is AMD was die constrained; Phenom needed more L3 cache but AMD needed to keep the die size manageable to avoid bankruptcy. Architecturally, Phenom was ahead of its time.

If we were to live in the dual-core era forever, Intel had the right design - two cores could easily sit behind one large shared L2 cache. Move to four cores and the shared L2 design stops making sense. In some situations you’ll have cores operating on independent threads with no spatial locality, and for these scenarios each core will need its own L2 cache. In other scenarios you’ll have multiple cores working on the same data, in which case you’ll need a large cache shared by all cores. Again, Phenom was the right quad-core design, it just didn’t have enough cache (not to mention its other shortcomings).

In a way, Intel recognized that Conroe and Penryn were designed to win the dual-core race - over the life of both CPUs less than 5% of its desktop shipments were quad-core chips. Intel’s last tick and tock dominated the dual-core market. Nehalem and Westmere on the other hand are more interested in winning the multi-core races.

Phenom II addresses the cache deficiency. With a 6MB L3 cache, it nearly has the same size L3 as Core i7. The L2 caches remain larger at 512KB per core but I suspect that’s because AMD didn’t have the time/resources to redesign its cores for Phenom II. It takes 15 cycles to access AMD’s 512KB L2; that’s the same amount of time it takes to access Penryn’s 2x6MB L2. I’ll gladly wait 15 cycles if I have the hit rate of a 6MB cache, but not for a 512KB cache. AMD too will pursue a faster L2, that will most likely come in 2011 with Bulldozer (Orochi and Llano CPUs).

With a very large L3 cache, it no longer makes sense to have a large L2. Instead the L2 needs to be as fast as possible, acting as spillover from L1. Look at what happened to L1 cache sizes as CPUs got wider and faster. The L1 cache grew from 1KB, 8KB, 16KB and eventually up to 32 and 64KB in today’s designs. However L1 sizes haven’t increased beyond that point; instead we saw L2 caches grow and grow. Eventually they too hit a stopping point; for AMD that was Phenom, and for Intel that was Core i7.

With the number of cores growing, we need a large cache shared between all of the cores. Imagine a 12-core processor; would it have a massive 36MB shared L2 cache? Definitely not. It’d be too slow for starters, and the penalty for not finding something in L1 would be tremendous. Remember the point of the memory hierarchy: to hide latency between the software and the processor. A pyramid doesn’t work if the base fattens out too quickly. In the future, as we move to four, eight and more cores, L2 caches will have to be motherly figures to a core’s L1, feeding them individually, rather than a mess hall to feed everyone. That role will fall to the L3 cache.Carrying that further, we may even see future CPUs with more cores add a forth level of cache.

With the role of the L2 cache redefined from being service-all to a service-one, it makes sense for it to be small and fast. The original Phenom had the right idea, it needed a larger L3. Core i7 perfected that idea, and Phenom II took a step towards that. Cache sizes must continue to grow, but as they do, the number of levels of cache must increase as well to avoid a single, large penalty being paid as you go from one level of cache to the next.

Performance isn't really any better, except in a couple of tests, than C2D chips that are 18 months old, so there's no reason to upgrade for a large chunk of us, and most of the rest will want i7.

They need a top end chip that compares to the top end i7 like the 4870 was to the GT280. And this is some way from that. It's like the 4870 was competitive to the 9800GT, and was the same price as well.

With no upgrade path this looks like one strictly for the fanboys at the moment. Reply

Ha. I'm not quite sure whether I hsould try to respond to this, but sure...

It's a non-trivial task to completely redesign the cores themselves, and I'm not even sure whether they could, say cut out the core, and drop in a new one. It's easy for us sideliners to say they need to improve, and quick, but they need to design a new one that has a much better IPC, with speed, not haste.

How is this with no upgrade path? This provides an upgrade path for boards up to AM2, which is good enough. With the AM3 versions coming out, people could drop the AM3 version into an AM2/AM2+ board, wait if necessary until DDR3 prices falls some more, and swap to a newer board with DDR3. And now they've a spare computer.

Look at the i7 prices. Friend of mine just spent 2k for an i7. Sure, he's having fun compiling and playing games with impunity, but I don't think it's the best use of money. Also, C2D is dead. You can't put an i7 into and C2D board, and there's still a good amount of people with older boards that could have a drop in boost. Reply

This is silly. If the CPU had come out in Summer '06, it would have been god-like. Quad core vs dual-core, higher clock speed, equal or better overclocking, very competitive clock-for-clock, and on a smaller and cooler process.

What you could say was that if it came out right as Peryn launched it would be a close race...but Peryn improved lots of stuff over Conroe, so it isn't fair to say AMD is 2 years behind. Reply

I think that's what Atechie's getting at. Intel took the right path by just cobbling together two dual-core processors to make a quad, while AMD spend excessive time and God only knows how much cash to develop a "monolithic" quad. Which then rolled over and played dead.

Hopefully AMD has learned from its mistakes. Otherwise Intel may not have much competition in the near future. What's AMD trading at again, these days? Reply

I dont know, this doesnt do it for me. I'm a massive AMD fan - I run a 5600+ right now, my previous CPU was an XP 2400+.

What is it now, 2 years since the Core 2 Duo was released? And AMD still cant match in clock for clock performance? After the monumental flop that was Phenom, massive delays, poor performance, high power consumption, the TLB bug, patchy backwards compatibility (my MSI K9N mobo with the Nforce 570 SLI chipset cant run AM2+ chips, but the equivalent Asus can), they launch the Phenom II, and the best I can say about it is that is that its acceptable. Acceptable. Not Phenomenal. Just acceptable. Price vs Performance wise, it gets the job done, mostly, sort of. Throw newer game engines at it and even the Q9600, that old workhouse, can beat it.

Its not that Phenom II is a terrible processor. Its not. Its just not what I expected AMD to launch, many months after the flop that was Phenom. I expected something that could at least beat a 65nm Core 2 Duo, if not a Nehalem.

As Anand hinted at, Intel is going to drop prices, which they can afford to, forcing AMD to do likewise, which they cant. AMD's die size is similar yet their margins are far smaller. Intel's next CPU will be the die shrink of Nehalem, what will AMD release? Will it even match Penryn? I can only hope. Reply

Since the Phenom II was always known to just be a die shrink with some optimizations, you were setting your hopes way too high if you thought it was going to compete directly with the i7. AMD needed this launch to keep them in the game, and it looks like it's probably just good enough to be able to do that. We probably won't be seeing any big breakthroughs from AMD until Bulldozer, so we just have to hope that this architecture will have enough headroom in it to last that long. Reply