How L1 and L2 CPU Caches Work, and Why They’re an Essential Part of Modern Chips

This site may earn affiliate commissions from the links on this page. Terms of use.

The development of caches and caching is one of the most significant events in the history of computing. Virtually every modern CPU core from ultra-low power chips like the ARM Cortex-A5 to the highest-end Intel Core i7 use caches. Even higher-end microcontrollers often have small caches or offer them as options — the performance benefits are too large to ignore, even in ultra low-power designs.

Caching was invented to solve a significant problem. In the early decades of computing, main memory was extremely slow and incredibly expensive — but CPUs weren’t particularly fast, either. Starting in the 1980s, the gap began to widen quickly. Microprocessor clock speeds took off, but memory access times improved far less dramatically. As this gap grew, it became increasingly clear that a new type of fast memory was needed to bridge the gap.

While it only runs up to 2000, the growing discrepancies of the 1980s led to the development of the first CPU caches

How caching works

CPU caches are small pools of memory that store information the CPU is most likely to need next. Which information is loaded into cache depends on sophisticated algorithms and certain assumptions about programming code. The goal of the cache system is to ensure that the CPU has the next bit of data it will need already loaded into cache by the time it goes looking for it (also called a cache hit).

A cache miss, on the other hand, means the CPU has to go scampering off to find the data elsewhere. This is where the L2 cache comes into play — while it’s slower, it’s also much larger. Some processors use an inclusive cache design (meaning data stored in the L1 cache is also duplicated in the L2 cache) while others are exclusive (meaning the two caches never share data). If data can’t be found in the L2 cache, the CPU continues down the chain to L3 (typically still on-die), then L4 (if it exists) and main memory (DRAM).

This chart shows the relationship between an L1 cache with a constant hit rate, but a larger L2 cache. Note that the total hit rate goes up sharply as the size of the L2 increases. A larger, slower, cheaper L2 can provide all the benefits of a large L1 — but without the die size and power consumption penalty. Most modern L1 cache rates have hit rates far above the theoretical 50 percent shown here — Intel and AMD both typically field cache hit rates of 95 percent or higher.

The next important topic is the set-associativity. Every CPU contains a specific type of RAM called tag RAM. The tag RAM is a record of all the memory locations that can map to any given block of cache. If a cache is fully associative, it means that any block of RAM data can be stored in any block of cache. The advantage of such a system is that the hit rate is high, but the search time is extremely long — the CPU has to look through its entire cache to find out if the data is present before searching main memory.

At the opposite end of the spectrum we have direct-mapped caches. A direct-mapped cache is a cache where each cache block can contain one and only one block of main memory. This type of cache can be searched extremely quickly, but since it maps 1:1 to memory locations, it has a low hit rate. In between these two extremes are n-way associative caches. A 2-way associative cache (Piledriver’s L1 is 2-way) means that each main memory block can map to one of two cache blocks. An eight-way associative cache means that each block of main memory could be in one of eight cache blocks.

The next two slides show how hit rate improves with set associativity. Keep in mind that things like hit rate are highly particular — different applications will have different hit rates.

Why CPU caches keep getting larger

So why add continually larger caches in the first place? Because each additional memory pool pushes back the need to access main memory and can improve performance in specific cases.

This chart from Anandtech’s Haswell review is useful because it actually illustrates the performance impact of adding a huge (128MB) L4 cache as well as the conventional L1/L2/L3 structures. Each stair step represents a new level of cache. The red line is the chip with an L4 — note that for large file sizes, it’s still almost twice as fast as the other two Intel chips.

It might seem logical, then, to devote huge amounts of on-die resources to cache — but it turns out there’s a diminishing marginal return to doing so. Larger caches are both slower and more expensive. At six transistors per bit of SRAM (6T), cache is also expensive (in terms of die size, and therefore dollar cost). Past a certain point, it makes more sense to spend the chip’s power budget and transistor count on more execution units, better branch prediction, or additional cores. At the top of the story you can see an image of the Pentium M (Centrino/Dothan) chip; the entire left side of the die is dedicated to a massive L2 cache.

How cache design impacts performance

The performance impact of adding a CPU cache is directly related to its efficiency or hit rate; repeated cache misses can have a catastrophic impact on CPU performance. The following example is vastly simplified but should serve to illustrate the point.

Imagine that a CPU has to load data from the L1 cache 100 times in a row. The L1 cache has a 1ns access latency and a 100% hit rate. It therefore takes our CPU 100 nanoseconds to perform this operation.

Haswell-E die shot (click to zoom in). The repetitive structures in the middle of the chip are 20MB of shared L3 cache.

Now, assume the cache has a 99 percent hit rate, but the data the CPU actually needs for its 100th access is sitting in L2, with a 10-cycle (10ns) access latency. That means it takes the CPU 99 nanoseconds to perform the first 99 reads and 10 nanoseconds to perform the 100th. A 1 percent reduction in hit rate has just slowed the CPU down by 10 percent.

In the real world, an L1 cache typically has a hit rate between 95 and 97 percent, but the performance impact of those two values in our simple example isn’t 2 percent — it’s 14 percent. Keep in mind, we’re assuming the missed data is always sitting in the L2 cache. If the data has been evicted from the cache and is sitting in main memory, with an access latency of 80-120ns, the performance difference between a 95 and 97 percent hit rate could nearly double the total time needed to execute the code.

Back when AMD’s Bulldozer family was compared with Intel’s processors, the topic of cache design and performance impact came up a great deal. It’s not clear how much of Bulldozer’s lackluster performance could be blamed on its relatively slow cache subsystem — in addition to having relatively high latencies, the Bulldozer family also suffered from a high amount of cache contention. Each Bulldozer/Piledriver/Steamroller module shared its L1 instruction cache, as shown below:

A cache is contended when two different threads are writing and overwriting data in the same memory space. It hurts performance of both threads — each core is forced to spend time writing its own preferred data into the L1, only for the other core promptly overwrite that information. Steamroller still gets whacked by this problem, even though AMD increased the L1 code cache to 96KB and made it three-way associative instead of two.

This graph shows how the hit rate of the Opteron 6276 (an original Bulldozer processor) dropped off when both cores were active, in at least some tests. Clearly, however, cache contention isn’t the only problem — the 6276 historically struggled to outperform the 6174 even when both processors had equal hit rates.

Caching out

Cache structure and design are still being fine-tuned as researchers look for ways to squeeze higher performance out of smaller caches. There’s an old rule of thumb that we add roughly one level of cache every 10 years, and it appears to be holding true into the modern era — Intel’s Skylake chips offer certain SKUs with an enormous L4, thereby continuing the trend.

It’s an open question at this point whether AMD will ever go down this path. The company’s emphasis on HSA and shared execution resources appears to be taking it along a different route, and AMD chips don’t currently command the kind of premiums that would justify the expense.

Regardless, cache design, power consumption, and performance will be critical to the performance of future processors, and substantive improvements to current designs could boost the status of whichever company can implement them.

Check out our ExtremeTech Explains series for more in-depth coverage of today’s hottest tech topics.

Tagged In

It’s an open question at this point whether AMD will ever go down this path. The company’s emphasis on HSA and shared execution resources
appears to be taking it along a different route, and AMD chips don’t
currently command the kind of premiums that would justify the expense.

I guess AMD will use some kind of stacked memory (HBM) to implement an L4 cache it wouldn’t be used to minimise latency but it will be used to maximise bandwidth for the GPU. more cache means less GPU on an APU so the balance will have to be reset onto these new designs.

i think cache will get smaller use less levels when we start using HMC and other faster memory types. look at most of consumer chips they do not have an L3 cache because it doesn’t really offer more performance for the standard computer workload.while it helps on the server variant.

please note that the L4 cache increases the latency for larger files compared to the L3 cache variant this might give a negative performance impact on some workloads.

Dozerman

Well first, AMD needs to get an L3 in their currently shipping processors.

massau

actually they don’t it helps for there CPU workload but it increases cost or it will replace GPU units. L3 is good for a server workload but it isn’t great when you look at consumer low midrange space. hence the new balance in modern APU processors.

the broadwell L4 cache looks like it is on package it can be compared to the ram stacked in the future.

Dozerman

Apologies.

I had just skimmed through your comment and thought you were referring to some kind of L4, not shrinking caches with new memory types.

That makes more sense.

Joel Hruska

Massau,

I doubt that HBM will do anything to change the use of cache, though it’s certainly possible that AMD will choose to use HBM instead of an L4.

Look at the adoption of on-die memory controllers. When AMD implemented an IMC for Opteron / Athlon 64, it slashed memory latency and gave the chip’s performance a huge upward kick — but AMD didn’t fundamentally alter its caching policies because of it. At most, it helped them use a little less L2 than they might have otherwise.

HBM is great for die shrinking and bandwidth but its latency will still be much higher than conventional SRAM. But HBM might be a better fit for HSA applications than trying to integrate a giant L4, so in that sense yes — I agree that it could be a better road for AMD.

massau

may i turn the question around , what if the industry did not integrate the memory controller wouldn’t the caches be much larger and deeper than today?

Maybe the HBM keep it possible to not use an L3 like they do today on there APU. it will just be a rebalancing of the amount and depth of the cache.

Joel Hruska

I mean, they would have to be, yes.

It’s hard to imagine a future in which the controllers weren’t integrated, though. Intel proved that you could do quad-cores with good performance while relying on an old FSB (Core 2 Quad) but then Nehalem arrived and smoked C2Q performance pretty thoroughly.

I do not recall if the old Northbridge memory controllers had their own caches. I think they could do some speculative prefetching, but I’m not sure they had full cache structures.

massau

i looked at the power 8 it is a modern design with a separate memory controller and it has its own cache.

may i turn the question around , what if the industry did not integrate the memory controller wouldn’t the caches be much larger and deeper than today?

Maybe the HBM keep it possible to not use an L3 like they do today on there APU. it will just be a rebalancing of the amount and depth of the cache.

Dozerman

Why couldn’t a company use a very large L3 coupled with a very fast L2 and even faster L1 instead of having to scale out to a fourth level?

Joel Hruska

Well, they could — but remember, the point is cost savings. The EDRAM design that Intel uses for Haswell is a 1T (one transistor part). That saves both die space and power consumption.

Typical L3 uses 6 or 8 transistors. So to do an eight-transistor cache (128MB of cache like Crystal Well) you’d need to dedicate about 8.6 billion transistors to *just* the cache. Even if Intel used a 4T design you’re still talking about 4.2B transistors.

Doing it EDRAM with a 1T design lets you get away with maybe a billion transistors. I’m not sure Intel has revealed anything more about exact transistor counts on Crystal Well.

massau

maybe an other reason for using a separate chip is to use a specialised fabrication process which is optimised for EDRAM or a larger node and thus lowers the power consumption , improve density or lower costs.

Joel Hruska

That’s also a good possibility.

massau

to bad non of the reviewers have higher technology like a electro microscope to look at the chip and decide which technology is used.

I remember this being noted in school when people asked why L1 and L2 running at the same speed as the core performed differently.

massau

yea i learned this in school as the main reason but our teachers look a bit from the old age and miss the cutting edge stuff like power gating, clock gating etc to improve speed or performance.

Matt Menezes

It’s amazing how an authority on tech such as a prof can be so behind the curve. I always thought I was more tech literate than my profs merely because I actually kept up with the latest where they just rehashed the same, old fundamentals. Not to say I’m smarter, merely more up to date…

massau

the latest tech is based on the old tech but some parts need to be updated. we even saw circular registers but i do not see any cores using it anymore. these small things makes me suspicious.

but at least i will have the official piece of paper in one year.

for example higher clocks gives a higher power consumption it is true, but if you can finish your job faster and go to deep sleep then you could use less energy than the slower core.

Antoine de Champlain

Thank you so much for clearing this up for me! I never fully understood how cache worked and the impact of cache in a CPU but thanks to your awesome explaination know I understand it! Thanks Joel!

jburt56

VRAM will change this.

Mark Nelson

The description of direct-mapped and n-way associative caches is wrong. In a direct-mapped cache a block of memory can only be stored in one particular cache block. In an n-way associative cache, a block of memory can only be stored in one of n cache blocks.

Jonathan Abbey

Absolutely.

This:

“A 2-way associative cache (Piledriver’s L1 is 2-way) means that each cache block can map to one of two memory locations. An 8-way associative cache means that each block of cache can contain information from eight different memory blocks.”

Should instead be:

“A 2-way associative cache (Piledriver’s L1 is 2-way) means that each memory location can be mapped to one of two cache blocks. An 8-way associative cache means that each location in memory could be cached in any one of eight cache blocks.”

It’s worth noting as well that high set associativity makes it slower to put new memory locations into the cache as well as to retrieve it.. You can’t write to an associative cache line without making sure that the memory location isn’t already present in one of the n blocks.

Joel Hruska

You’re right. That was a flip-flop on my part. Fixing it.

Jonathan Abbey

You rock. Kudos++.

Joel Hruska

Just dropping a note to say this has been corrected.

bmwman91

Interesting article. It clears up a lot of questions I had in the back of my mind, like “if L1 is so critical to performance why has it remained relatively small?”

The “closest” I have gotten to the silicon is writing assembly code for PIC and DSP controllers. The Lx caches on modern processors sound a lot like the working-registers on the microcontrollers that I have used…you can execute operations directly on the data stored in them, rather than having to go fetch data from RAM (or program memory) and put it into the working-reg’s in order to manipulate them.

Curious

Hi,

First of all, thank you for this very informative article. It cleared up a lot of things I did not know.

I am not a electronics engineer and hence do not have any idea about it. But I am an engineer and do understand a bit of the tech stuff posted on here and am curious to understand/know more. I might sound a bit naive or a bit “noobish” to ask this, but, can someone please explain to me how

How do you mean ‘translates’? Underneath that square in the middle of the package there is a die. The reason the package itself is so much larger is to accommodate all of the pins — there will be wires that connect from the die to the pins.

Thank you for the reply.
I said ‘translates’ since I did not know which part of the chip had that die or if the whole chip had the die underneath or somewhere. The link you gave to that image cleared that issue.

nty

the first picture is the die, the second is the electrical contact points, the third is the site through a microscope, possibly a special one such as scanning.

Asdf Ghjk

More articles like this! Enjoyed this one even though I took a course on computer architecture last year.

I don’t see there being an L5 cache any time soon. It would be not much faster than the RAM itself.

My guess for the future is that:

1. L3 will probably get bigger
2. L4 will also become truly huge
3. HBM may make some changes in L4

If we are stuck with existing processes, 28nm being the cheapest for the foreseeable future in cost per transistor (perhaps it’s 22nm for Intel?), then I could see very large dies being developed to try to squeeze the every last drop of performance. At that point, performance per mm^2 takes a second seat to absolute performance or performance per watt.

We might see larger, faster L3s on big dies. Because 22 and 28nm ought to continue to mature allowing for either higher yields of existing sized chips or perhaps larger chips at comparable yields. Hmm … could a reticle sized CPU be made (>700mm^2)?

Can L1 and L2 be made faster? How many transistors do they currently use?

massau

the L 4 is on a different chip onto the package so the L4 in the future will be TSV memory stacked directly onto the die so L4 becomes HBM.

mores law will probably evolve to TSV and later on to real 3D vertical transistors. so first we have 1 die than 2 etc all stacked but larger dies is not an option because it decreases yields while increasing costs. maybe the 450mm wafers could drop the cost of existing nodes but it is still in its infancy.

a gpu is easier to harvest the bad cores that’s the reason why they are larger.

ps a 700mm^2 chip gives 76 dies per wafer @300mm.

Joel Hruska

Every cache save L1 began life as an off-die package. If you look back, the first consumer L2 caches were on the motherboard. L3 caches began life as on-motherboard (hence the term “Backside cache” and “Backside bus” to refer to the interface).

Now that doesn’t mean TSV and HBM interfaces couldn’t replace and function as L4, but the fact that first-generation L4 is integrated off-die doesn’t say anything about the likelihood of that outcome.

massau

its seems like i’m not old enough to know the history of the cache levels. but thanks for telling me :) .

Rich

Actually, L1 did too, at least for x86. Intel had a 82385 memory controller, which could cache 32K of memory for the 386. That was the L1 cache, and could be read in two clock cycles although external to the processor(ironic that’s twice was fast as the miserably slow L1 cache in Haswell and Steamroller, despite being internal). Other companies came up with other controllers for the 386, some did direct mapping. Others did none, and simply used interleaving or page mode memory to lower the average wait states.

The difference was dramatic. I used a PS/2 Model 80, with a 20 MHz 386 with Page Mode Memory, and then got a Model 80 with 64K cache, and the difference was very noticeable, far more than one would expect from a 25% clock speed increase.

The 486 introduced the internal L1 cache, and also the L2 cache. The internal was initially only 8K, and shared Data/Instruction but could be read in one clock cycle. Many machines only came with it (the processor costed 1K, and many considered it a bargain because you didn’t need the FPU, or the cache controller/memory), but many companies first started adding SRAM on the motherboard, giving it a L2 cache. I had a PS/2 Model 90, and the L2 cache could be plugged into the processor daughterboard. I didn’t notice the difference to near the extent I did with the SRAM cache on the Model 80.

When Intel went with the DX4(typically running at 100/33), they increased the L1 cache to 16K, and made it write-back. This helped quite a bit, as these chips were clock tripled, and going through the slow memory bus was carrying a heavy penalty. AMD’s 133MHz 486 made this even more important.

Joel Hruska

Haswell and Steamroller have 1ns, 4-cycle L1 caches. Since one of the points of an OOOe engine is to hide latency, the difference between a four-cycle and three-cycle cache is minimal (according to AMD, building a 3-cycle L1 improved performance by only about 3%).

Latency hits only matter when they leave the processor with nothing to do. If the core can work on something else, it’s not a problem.

This is also why I prefer to measure my latencies in nanoseconds, not clock cycles. Two clock cycles at 25MHz is hundreds of clock cycles at 4.4GHz.

Good point on the 386, though. I hadn’t realized that there were chips that used an external L1; I thought the 486 was the first widespread chip with an L1 cache.

Rich

Intel also said 3%, and said they chose 4 clocks not because they had to, but because it would save energy over something timed more aggressively. I remember them saying it the type of transistors needed were more power hungry.

I never trust blanket numbers though, because they always reflect certain scenarios, and if yours doesn’t fall into that, well, the number might be considerably different.

Even so, 3% is still something. If you consider how people get upset their Pentium goes to 4.6, while someone else’s reaches 4.8, you see it is something. I wonder how much more power Intel would have needed to do 3. I suspect AMD, being designed for high clock speeds, would not have been able to do so with the same size and associativity.

I’m still a bit amazed Willamette and Northwood had two clock latency, although the L1 was tiny. Still, they had some serious clock speeds for their processes. Itanium with 1 clock was pretty amazing too.

Jaguar is still three clocks.

Processors have always tried to hide latency, that’s nothing new. Even the 8086/88 had a pre-fetcher than worked independently.

Joel Hruska

Rich,

Sure, on blanket numbers, but I think the larger point is that you can design a chip that’s better at hiding latency you can also get away with designing different cores. That’s one reason HT was invented in the first place — but the P4’s design fundamentally depended on that low latency L1 cache because it lacked some of the capabilities that Haswell has for keeping execution units filled.

It’s true that the P4’s two-cycle L1 latency was better than Haswell’s, even accounting for the difference in clock speed — but I suspect Intel would answer that the other improvements and adjustments it has made to core design allowed it to *use* a slower L1 design without a catastrophic impact on performance.

Intel’s new philosophy, put in place over the last few cycles, is that every core change has to pay for itself double. In the old days, the company pursued a 1:1 to policy — a 1% performance improvement couldn’t draw more than 1% power. Now, Intel has shifted to something along the lines of 2:1 — a change must improve performance by 2% in order to draw 1% more power. Or at least, that’s the high-minded design goal.

szatkus

Yes. Keyword: memristors

Bannerdog

” A direct-mapped cache is a cache where each cache block can contain one and only one block of main memory.”

That is INCORRECT.

With a direct-mapped cache, a block of main memory is associated with exactly one cache block.

In other words, with direct-mapping, if a particular block of main memory is cached, the cache block to be used it determined solely by the memory address (there is only one possible cache block that may be utilized to cache that particular block of main memory).

You have your mapping transposed.

Joseph Taylor

Just gonna say you’ve been posting some good content lately Joel

Bert Tweetering

I really liked this article although I’m just beginning to scratch the surface for learning this complex topic. Re the part about cache contention, I’m thinking this may be more because of the shared L2 cache than the shared L1 instruction cache, where I wouldn’t expect much if any writing. The L1 data cache, where you expect plenty of writing, is not shared.

“…the Bulldozer family also suffers from a high amount of cache contention. Each Bulldozer/Piledriver/Steamroller module shares its L1 instruction cache, as shown below… A cache is contended when two different threads are writing and overwriting data in the same memory space. It hurts performance of both
threads — each core is forced to spend time writing its own preferred
data into the L1, only for the other core promptly overwrite that
information.”

Maybe the L2 cache contention issue is an area that got improved greatly from Piledriver to Steamroller to Excavator. The latter had the size of its cache shrunk by 1/2, although the cache system as a whole is said to perform much better.

This reminds me of – actually i not so fondly remember goin over all this in computer architecture class in undergrad. Had to run the math and tell how long CPU instructions took from fetch to execute based on different cache assumptions.

Felix Gill

could do a series of these type articles

RISC vs CISC
reverse engineering, decompilers, and you
J̶a̶v̶a̶ Python virtual Machine Hunh!, What is it good for…
Instruction sets and why they matter
My command line, your command line: the return to dumb terminal in the workplace
The risks of using open source code in the workplace.. warning EULAS ahead

Joseph Taylor

Pretty sceptical that hit rate doesn’t increase with size in L1 cache, how is that even logically possible?

Joel Hruska

Joseph,

If you’re thinking about a cache as a 1:1 system in which one main memory location can only be mapped to one L1 cache location, you’d be absolutely right. The only way to improve the hit rate would be to increase the cache size. But the situation is more complex than that.

The first thing to know is that all hit rates depend on instruction mix. The goal when performing these kinds of evaluations is to try to capture a variety of instruction mixes and model a best-fit scenario. You can write code that will exhibit better or worse hit rates. So the graph above is an example of hit rates in one particular type of code, not a total example for every type of code.

There are multiple factors that determine the miss rate on a cache. The set-associativity is a significant factor, as is the cache block size (how much data is fetched from cache at one time) and the total size of the cache. Smaller blocks mean that you take less advantage of spacial locality, large blocks mean that you don’t have as many blocks within the cache in which data can be stored.

The goal of the L1 cache is to have the data the CPU is looking for 95-99% of the time. Many CPUs use architectures that push old data into the L2 or L3, so that data *not* found in the L1 can be found in one of the lower cache levels.

The longer answer to your question is that yes, there *are* ways to continue to push the hit rate of the L1 higher, including increased cache sizes — but if you can get 95% of the way there with a 64K L1 and you need a 1MB L1 to finish off the last 5%, it makes a lot more sense to use the 64KB option and then back it up with L2 caches.

Ninja Squirrel

I have a question.

Why cache memory on CPUs use so much space to store such small 8MB, but you can find one small memory chip on a DRAM module or graphic card PCB can be stored 512MB?

SRAM cache used for on-chip cache consists of up to 6 transistors per bit (there are different designs), plus passive electric components, plus additional logic for providing associativity.

In comparison, a DRAM cell only consist of a capacitor and a single transistor per cell – all other components are shared across whole blocks of DRAM cells.

DRAM comes at a cost over SRAM in return, as the only way of “reading” from a DRAM cell, is to release the charge, translate the (varying) voltage levels into binary, and finally recharge the capacitor again. You can’t even read the same address twice in a row, if you tried, you would have to wait for the recharge to complete.

In addition, you always have to read or write a full row of cells in DRAM.

Reading from and writing to SRAM is trivial, as the output line of an SRAM cell is constantly active and can be accessed at will without side effects. Writing to an SRAM cell is also just a matter of applying a voltage to the SET or UNSET inputs.

Besides the access times and size, DRAM and SRAM also differ in power consumption. A DRAM cell does not require power to keep information stored (slow self-discharge from leak-currents aside), while an SRAM cell actually stores the information by keeping an bi-stable circuit constantly powered.

FSRed94

“A DRAM cell does not require power to keep information stored”

Umm, yes it does. Unless I’m not thinking right today.

Ext3h

It’s suffering from a constant discharge from leak currents, but the information is stored inside the charge – not by maintaining a constant flow of current or keeping an external voltage level applied.

So no, keeping(!) the information stored does not require power in an ideal system (assuming proper insulation).

In a real world systems, leak currencies, possibly even amplified by a row hammer attack or something alike, obviously do require cyclic refreshes, but that’s just a result of having made a trade off between density and ideal insulation.

wownwow

DRAM is volatile memory, not NVM.

Wussupi83

Does anyone else remember the Pentium M? Basically analogous IMO to a big battle that turned the tide of the war. It showed Intel the light they used to find their way out of the Pentium 4 furnace that was unfortunately-for-us-all being fueled by illegal business practices against AMD’s superior Athlon 64.

Joel Hruska

The Pentium M was the warning shot that AMD should’ve heeded, and different. It was obvious by the winter of 2004 that Intel had a potent competitor on its hands — and if AMD had been paying attention to the problem, they could’ve had a response ready by the middle of 2006.

FSRed94

One of many mistakes made by AMD. That company seemed to suffer from quite a bit of bad management over the years.

I would like to think K7 was the warning shot Intel should’ve noticed, with K8 being probably the last major evolution of CPU’s (x86-64, integrated memory controller, hypertransport instead of FSB).

Joel Hruska

“I would like to think K7 was the warning shot Intel should’ve noticed.”

Intel *did* notice. That’s why they started putting the squeeze on vendors not to allocate shelf space to AMD. That’s why AMD had so little luck winning mobile SKUs. When Intel decided to kill the Pentium 4 altogether and switch to the Pentium M, it was because they knew AMD’s K8 was going to clean their clock for the next 12-18 months. They used their marketing muscle to do what their CPU couldn’t, and they locked AMD out of critical market share in the process.

With that said, however, AMD really did take its eye on the ball. I remember testing Dothan in a DFI desktop board. The chip was tiny and mounted a custom heatsink with a 40mm fan. Overclocking options were limited. And yes, on the whole, the Athlon 64 family was still much faster than the little Pentium M. Dothan wasn’t very good at multimedia encoding, for example.

The thing was, you could overclock that Pentium M up to around 2GHz — and once you did, there were benchmarks where it started matching the Athlon 64 processors at the same speed, while drawing a fraction of the heat.

I told my contacts at AMD that when Intel finished refining this design, it was going to cause them a serious headache if they didn’t have something better in the pipeline. 18 months later, Core 2 Duo debuted, and while AMD kept up reasonably well for awhile (Phenom II was C2D competitive), they never regained the lead again.

FSRed94

Marketing muscle, and some rather shady business practices as you are very well aware. I’m not sure what was going on at AMD. I’ve read they were working on some other design but canned it and improved on K8 to make K10. Or they were too busy cashing in on K8 to realize what Intel was doing with the Pentium M.

I also happen to think they were busy preparing to purchase ATI and perhaps didn’t have the resources they needed. Maybe they figured K8/K10 would be enough until they had their first “fusion” product after acquiring ATI. Who knows for sure what really happened, but they sure dropped the ball. I am very interested in seeing what becomes of Zen, as it may be the last exciting thing to happen in high end x86 for awhile.

I’m editing this as I just had another thought about the Pentium M. It was certainly excellent at integer code, but as I recall it was fairly modest at floating point math. AMD could have figured integer would be close enough but that they would still have a much better FPU. That turned out not to be the case at all.

Joel Hruska

I can tell you what I was told happened and what I’ve pieced together.

Once Hammer started rolling out, AMD started hitting its stride like never before. At first, Intel fans confidently predicted that Prescott would counter AMD by hitting insane new clock speeds. After all, the P4 was still beating the Athlon 64 in many professional workloads thanks to Hyper-Threading. HT also gave it an edge in desktop “smoothness” and multi-tasking.

Then Prescott actually dropped. It was a disaster. It melted plastic motherboard stands. It blew out small form factor power supplies that were supposedly Prescott-ready. I could walk by my testbeds and tell if a system was running Prescott or Northwood just by touching the power supply — that’s how large the difference was.

Meanwhile, AMD is launching Socket 939 and ramping up the Athlon 64. By mid-2005, the dual-core Athlon 64 4800+ has arrived on the scene and completely eaten Intel’s lunch. Intel is staggering on the ropes, AMD is sweeping the workstation workloads that were Intel’s last strength.

And then AMD gets distracted by ATI, like you said. They weren’t *wrong* to see combined graphics and CPU as the future of the company, but they end up paying 2x what they should’ve for ATI and it takes years to integrate the companies. They probably should’ve sucked it up, bought Nvidia, and let Jen-Hsun be CEO.

They do a 65nm die shrink on Athlon 64, but it doesn’t buy them as much clock speed as they would need to catch up to Core 2 Duo. This is where things start to go south.

If Phenom had been Phenom II, AMD would’ve had a serious competitive position established against Intel in 2007 and into 2008. But by then, they were dealing with the ATI acquisition and numerous other market challenges.

FSRed94

Why did AMD end up paying so much for ATI? Or at least the best guesses we have as to why they overpaid. Has there ever been any accurate speculation as to what they would have paid for NVIDIA?

NVIDIA probably would have made more sense anyway, as they had already developed several successful chipsets for K7/K8 (well, except maybe nforce 3). A rock solid, in house chipset may have helped in the corporate market. I have read AMD’s chipsets improved quite a bit as a result of acquiring ATI.

Llano came too late. I almost said Bobcat as well but that was very competitive even if late. I think it’s very possible they didn’t have the money they needed for R&D because of overpaying for ATI and Conroe eating into their sales.

GloFo didn’t help. I think it’s very likely Apple would have went with Llano if they could have produced enough. If Zen is good enough, and if an APU variant can be produced in sufficient quantities soon enough (2017), I could see Apple taking another serious look at them. Or at least “looking” at them to get a better deal from Intel. I think that would be AMD’s best shot at making some money again. That’s a lot of “ifs” though, and I’ll believe it when I see it.

The pentium M itself was just a tweaked pentium 3, whose lineage in turn goes all the way back to the P6 core (pentium pro). The 1.4 GHz tualatin core pentium 3’s would spank 1.8 GHz pentium 4’s at most things people cared about (basically anything not heavily using SSE2).

wownwow

Thanks for the article. It will be nice to have a similar table for Intel’s caches instead of just showing the AMD’s inferior caches.

About AMD’s L1 caches, SHARED, THREE-way, and 3-4 clock LATENCY, were they designed by engineers with brains that:

On the left side, there is nothing right, and on the right side, there is nothing
left?

Joel Hruska

Intel hasn’t changed its cache structure since Nehalem, as far as I know. All Core processors since the Core 2 Duo use the following system:

And then a varying amount of L3 cache, 16-way associative if I recall.

The 3-4 cycle latency on L1 is not AMD’s problem. In fact, that’s firmly in-line with what Intel achieves on L1 latency. AMD’s L2 cache latencies are substantially higher than Intel’s, as is the L3 latency on Piledriver CPUs. This is a much larger problem.

Finally, I suggest you use nanoseconds to measure cache latency, not clock cycles. An L1 cache with a 3-cycle latency on a CPU clocked at 4GHz is faster than an L1 cache with a 3-cycle latency clocked at 3GHz. Nanoseconds are an absolute measure of time, clock cycles are relative.

wownwow

Joel, thanks again for the information. Per test results, at least 4-way is needed be competitive, will Zen have at least 4-way, or any good link for the details of Zen if you have it handy?

Joel Hruska

The current rumor is that the L1 is 64KB split between 32KB instruction and 32KB data and eight-way associative. The cache sizes have been at least backed up by software patches that reference them, the associativity is not backed up, but is a reasonable “best guess.”

If Carrizo were inserted into the table above, it would look somewhat different. The L1 data cache increased to 32KB, from 16KB, and is eight-way associative, up from four-way.

AMD’s L2 and L3 caches have historically been 16-way associative. The problem with the BD line of chips is that the L2 latency wasn’t very good (this was somewhat fixed in Piledriver, but never completely addressed).

Zach Acox

Were this a post about specific SKUs and clock speeds, nanoseconds would be the appropriate way to describe relative speed. Although this post calls out several different CPU SKUs, it aptly defines latency in terms of clock cycles.

Example: A 5 GHz CPU has a clock cycle time of 200 picoseconds, while a 2.5 GHz CPU cycle time is 400 picoseconds, yet if the 5 GHz CPU has 10-cycle latency and the 2.5 GHz CPU has 5-cycle latency, both will access the cache in the name number of nanoseconds.

Joel Hruska

Zach,

Sure, I agree that nanoseconds are most important when comparing between CPUs, but wow&wow explicitly identified a 3-4 clock latency as AMD’s problem as compared to Intel’s superior caches.

Because the vast majority of the work I do when discussing these issues *is* comparative, I tend to prefer nanoseconds over cycles. For my purposes, nanoseconds tends to be more accurate.

The big picture, of course, is that both AMD and Intel have said that reducing L1 latency from 4-cycles to 3-cycles doesn’t get them very much performance.

FSRed94

I read somewhere, Anandtech I think, that Intel purposely moved from a 3 cycle to 4 cycle L1 as it allowed for a higher clock speed, which was of much more benefit than one cycle.

FSRed94

About the L1, is it being write through a problem?

Joel Hruska

It was definitely a theoretical issue, in that it could slow down performance in certain cases. I am not certain if it was a real-world cause of BD’s low performance.

tektel

This article was already published about a year/year and a half ago.

Joel Hruska

That’s correct. We occasionally resurface stories like this after updating them if necessary. While the vast majority of what we publish is new content, stories like this one remain relevant even after several years.

wownwow

Thanks for resurfacing the useful information and updating it if needed. It still benefits people who didn’t read it before, like myself.

Tom

You could find out how greatly the cache affected performance pretty easily back in the day. Lots of computers let you disable the cache in the BIOS. It absolutely CRIPPLED performance. We’re talking Windows 95/98 going from booting in a minute or two to taking the better part of an hour. An astounding difference for what amounted to a couple hundred kilobytes of memory (if that) being switched on or off.

FSRed94

“In the real world, an L1 cache typically has a hit rate between 95% and 97%….”.

I’ve always wanted a little more elaboration on this. Are we talking instruction, data, or both? Instruction is easy enough to believe. Since we are talking about the relative speed of cache vs DRAM here though, it seems we must be talking about data.

L1 is very fast and low latency. It is also fairly small. I’m sure it can hold a good amount of what is needed (for loops come to mind). Seeing how far away main memory is though, the CPU must have a very good idea of what it will need next so it can load it into cache before it needs it. Is this where data prefetch comes into play?

Joel Hruska

Prefetch and branch prediction both come into play here, but I am honestly not sure how to characterize the precise difference between what information is stored in the L1D cache vs. what is stored in the L2 cache — meaning I am not certain if this is strictly a difference in *likelihood* of needing data or if the system preferentially puts different kinds of data in different locations. I’m fairly certain it’s the former.

One thing I do know is that the difference between 95% and 97% is huge because of the sheer number of cycles that a CPU performs per second. Those last few % points can dominate your overall performance because the penalty if you go out to main memory is just so huge compared to L2.

Bert Tweetering

Do SMT processors carry two sets of L1 instruction and data caches per physical core? (assuming dual thread like in the intel i-cores with HT)

yumri

Intel tried L4?
i guess 128MB is to little maybe if they had a better GPU and only used it for the GPU side it would have done better. I still see many games that only need 32MB or 128MB if graphics RAM but all of them are indie games.
Unless Intel wants to compete with AMD and nVidia for low end gaming they won’t try L4 again.

Joel Hruska

Intel continues to use L4. That’s what Crystal Well is — an L4 cache running at 1600MHz.

kritikl

HARD WARE REQUIREMENT QUESTION: What wouuld be the least number of cores required on a gaming pc to play very fast games? Is 8 cores sufficient? Techs and extreme gamers are welcome to answer my question.

Seb

The “Hit Rate” chart is wrong. If you take the last datapoint, the chart says: 55 + 90 = 95. This makes me question the validity of the entire article.
Of course, if I’m wrong, please correct me.

This site may earn affiliate commissions from the links on this page. Terms of use.

ExtremeTech Newsletter

Subscribe Today to get the latest ExtremeTech news delivered right to your inbox.

Email

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our
Terms of Use and
Privacy Policy. You may unsubscribe from the newsletter at any time.