Addressing the Memory Bandwidth Problem

Integrated graphics solutions always bumped into a glass ceiling because they lacked the high-speed memory interfaces of their discrete counterparts. As Haswell is predominantly a mobile focused architecture, designed to span the gamut from 10W to 84W TDPs, relying on a power-hungry high-speed external memory interface wasn’t going to cut it. Intel’s solution to the problem, like most of Intel’s solutions, involves custom silicon. As a owner of several bleeding edge foundries, would you expect anything less?

As we’ve been talking about for a while now, the highest end Haswell graphics configuration includes 128MB of eDRAM on-package. The eDRAM itself is a custom design by Intel and it’s built on a variant of Intel’s P1271 22nm SoC process (not P1270, the CPU process). Intel needed a set of low leakage 22nm transistors rather than the ability to drive very high frequencies which is why it’s using the mobile SoC 22nm process variant here.

Despite its name, the eDRAM silicon is actually separate from the main microprocessor die - it’s simply housed on the same package. Intel’s reasoning here is obvious. By making Crystalwell (the codename for the eDRAM silicon) a discrete die, it’s easier to respond to changes in demand. If Crystalwell demand is lower than expected, Intel still has a lot of quad-core GT3 Haswell die that it can sell and vice versa.

Crystalwell Architecture

Unlike previous eDRAM implementations in game consoles, Crystalwell is true 4th level cache in the memory hierarchy. It acts as a victim buffer to the L3 cache, meaning anything evicted from L3 cache immediately goes into the L4 cache. Both CPU and GPU requests are cached. The cache can dynamically allocate its partitioning between CPU and GPU use. If you don’t use the GPU at all (e.g. discrete GPU installed), Crystalwell will still work on caching CPU requests. That’s right, Haswell CPUs equipped with Crystalwell effectively have a 128MB L4 cache.

Intel isn’t providing much detail on the connection to Crystalwell other than to say that it’s a narrow, double-pumped serial interface capable of delivering 50GB/s bi-directional bandwidth (100GB/s aggregate). Access latency after a miss in the L3 cache is 30 - 32ns, nicely in between an L3 and main memory access.

The eDRAM clock tops out at 1.6GHz.

There’s only a single size of eDRAM offered this generation: 128MB. Since it’s a cache and not a buffer (and a giant one at that), Intel found that hit rate rarely dropped below 95%. It turns out that for current workloads, Intel didn’t see much benefit beyond a 32MB eDRAM however it wanted the design to be future proof. Intel doubled the size to deal with any increases in game complexity, and doubled it again just to be sure. I believe the exact wording Intel’s Tom Piazza used during his explanation of why 128MB was “go big or go home”. It’s very rare that we see Intel be so liberal with die area, which makes me think this 128MB design is going to stick around for a while.

The 32MB number is particularly interesting because it’s the same number Microsoft arrived at for the embedded SRAM on the Xbox One silicon. If you felt that I was hinting heavily at the Xbox One being ok if its eSRAM was indeed a cache, this is why. I’d also like to point out the difference in future proofing between the two designs.

The Crystalwell enabled graphics driver can choose to keep certain things out of the eDRAM. The frame buffer isn’t stored in eDRAM for example.

Peak Theoretical Memory Bandwidth

Memory Interface

Memory Frequency

Peak Theoretical Bandwidth

Intel Iris Pro 5200

128-bit DDR3 + eDRAM

1600MHz + 1600MHz eDRAM

25.6GB/s + 50GB/s eDRAM (bidirectional)

NVIDIA GeForce GT 650M

128-bit GDDR5

5016MHz

80.3 GB/s

Intel HD 5100/4600/4000

128-bit DDR3

1600MHz

25.6GB/s

Apple A6X

128-bit LPDDR2

1066MHz

17.1 GB/s

Intel claims that it would take a 100 - 130GB/s GDDR memory interface to deliver similar effective performance to Crystalwell since the latter is a cache. Accessing the same data (e.g. texture reads) over and over again is greatly benefitted by having a large L4 cache on package.

I get the impression that the plan might be to keep the eDRAM on a n-1 process going forward. When Intel moves to 14nm with Broadwell, it’s entirely possible that Crystalwell will remain at 22nm. Doing so would help Intel put older fabs to use, especially if there’s no need for a near term increase in eDRAM size. I asked about the potential to integrate eDRAM on-die, but was told that it’s far too early for that discussion. Given the size of the 128MB eDRAM on 22nm (~84mm^2), I can understand why. Intel did float an interesting idea by me though. In the future it could integrate 16 - 32MB of eDRAM on-die for specific use cases (e.g. storing the frame buffer).

Intel settled on eDRAM because of its high bandwidth and low power characteristics. According to Intel, Crystalwell’s bandwidth curve is very flat - far more workload independent than GDDR5. The power consumption also sounds very good. At idle, simply refreshing whatever data is stored within, the Crystalwell die will consume between 0.5W and 1W. Under load, operating at full bandwidth, the power usage is 3.5 - 4.5W. The idle figures might sound a bit high, but do keep in mind that since Crystalwell caches both CPU and GPU memory it’s entirely possible to shut off the main memory controller and operate completely on-package depending on the workload. At the same time, I suspect there’s room for future power improvements especially as Crystalwell (or a lower power derivative) heads towards ultra mobile silicon.

Crystalwell is tracked by Haswell’s PCU (Power Control Unit) just like the CPU cores, GPU, L3 cache, etc... Paying attention to thermals, workload and even eDRAM hit rate, the PCU can shift power budget between the CPU, GPU and eDRAM.

Crystalwell is only offered alongside quad-core GT3 Haswell. Unlike previous generations of Intel graphics, high-end socketed desktop parts do not get Crystalwell. Only mobile H-SKUs and desktop (BGA-only) R-SKUs have Crystalwell at this point. Given the potential use as a very large CPU cache, it’s a bit insane that Intel won’t even offer a single K-series SKU with Crystalwell on-board.

As for why lower end parts don’t get it, they simply don’t have high enough memory bandwidth demands - particularly in GT1/GT2 graphics configurations. According to Intel, once you get to about 18W then GT3e starts to make sense but you run into die size constraints there. An Ultrabook SKU with Crystalwell would make a ton of sense, but given where Ultrabooks are headed (price-wise) I’m not sure Intel could get any takers.

Post Your Comment

177 Comments

This is useless at anything above 1366x768 for games (and even that is questionable as I don't think you were posting minimum fps here). It will also be facing richland shortly not AMD's aging trinity. And the claims of catching a 650M...ROFL. Whatever Intel. I wouldn't touch a device today with less than 1600x900 and want to be able to output it to at least a 1080p when in house (if not higher, 22in or 24in). Discrete is here to stay clearly. I have an Dell i9300 (Geforce 6800) from ~2005 that is more potent and runs 1600x900 stuff fine, I think it has 256MB of memory. My dad has an i9200 (radeon 9700pro with 128mb I think) that this IRIS would have trouble with. Intel has a ways to go before they can claim to take out even the low-end discrete cards. You are NOT going to game on this crap and enjoy it never mind trying to use HDMI/DVI out to a higher res monitor at home. Good for perhaps the NICHE road warrior market, not much more.

But hey, at least it plays quite a bit of the GOG games catalog now...LOL. Icewind Dale and Baldur's gate should run fine :)Reply

Shimpi's guess as to what will go into the 15-inch rMBP is interesting, but I have a gut feeling that it will not be the case. Despite the huge gains that Iris Pro has over the existing HD 4000, it is still a step back from last year's GT 650M. I doubt Apple will be able to convince its customers to spend $2199 on a computer that has less graphics performance than last year's (now discounted) model. Despite its visual similarity to an Air, the rMBP still has performance as a priority, so my guess is that Apple will stick to discrete for the time-being.

That being said, I think Iris Pro opens up a huge opportunity to the 15-inch rMBP lineup, mainly a lower entry model that finally undercuts the $2000 barrier. In other words, while the $2199 price point may be too high to switch entirely to iGPU, Apple might be able to pull it off at $1799. Want a 15-inch Retina Display? Here's a more affordable model with decent performance. Want a discrete GPU? You can get that with the existing $2199 price point.

As far as the 13-inch version is concerned, my guesses are rather murky. I would agree with the others that a quad-core Haswell with Iris Pro is the best-case scenario for the 13-inch model, but it might be too high an expectation for Apple engineers to live up to. I think Apple's minimum target with the 13-inch rMBP should be dual-core Haswell with Iris 5100. This way, Apple can stick to a lower TDP via dual-core, and while Iris isn't as strong as Iris Pro, its gain over HD 4000 is enough to justify the upgrade. Of course, there's always the chance that Apple has temporary exclusivity on an unannounced dual-core Haswell with Iris Pro, the same way it had exclusivity with ULV Core 2 Duo years ago with MBA, but I prefer not to make Haswell models out of thin air.Reply

You are assuming that the next MBP will have the same chasis size. If thin is in, the dGPU-less Iris Pro is EXTREMELY attractive for heat/power considerations..

More likely is the end of the thicker MBP and separate thin MBAir lines. Almost certainly, starting in two weeks we have just one line, MBP all with retina, all the thickness of MBAir. 11" up to 15"..Reply

So the one you pick is the worst of the bunch to show GPU power....jeez. You guys clearly have a CS6 suite lic so why not run Adobe Premiere which uses Cuda and run it vs the same vid render you use in Sony's Vegas? Surely you can rip the same vid in both to find out why you'd seek a CUDA enabled app to rip with. Handbrake looks like they're working on supporting Cuda also shortly. Or heck, try FREEMAKE (yes free with CUDA). Anything besides ignoring CUDA and acting like this is what a user would get at home. If I owned an NV card (and I don't in my desktop) I'd seek cuda for everything I did that I could find. Freemake just put out another update 5/29 a few days ago.http://www.tested.com/tech/windows/1574-handbrake-...2.5yrs ago it was equal, my guess is they've improved Cuda use by now. You've gotta love Adam and Jamie... :) Glad they branched out past just the Mythbusters show.Reply

I wouldn't call myself an expert on computer hardware, but isn't it possible that Iris Pro's bottleneck at 1600x900 resolutions could be attributed to insufficient video memory? Sure, that eDRAM is a screamer as far as latency is concerned, but if the game is running on higher resolutions and utilising HD textures, that 128MB would fill up really quickly, and the chip would be forced to swap often. Better to not have to keep loading and unloading stuff in memory, right?

Others note the similarity between Crystalwell and the Xbox One's 32MB Cache, but let's not forget that the Xbox One has its own video memory; Iris Pro does not, or put another way, it's only got 128 MB of it. In a time where PC games demand at least 512 MB of video RAM or more, shouldn't the bottleneck that would affect Iris Pro be obvious? 128 MB of RAM is sure as hell a lot more than 0, but if games demand at least four times as much memory, then wouldn't Iris Pro be forced to use regular RAM to compensate, still? This sounds to me like what's causing Iris Pro to choke at higher resolutions.

If I am at least right about Crystalwell, it is still very impressive that Iris Pro was able to get in reach of the GT 650M with so little memory to work with. It could also explain why Iris Pro does so much better in Crysis: Warhead, where the minimum requirements are more lenient with video memory (256 MB minimum). If I am wrong, however, somebody please correct me, and I would love to have more discussion on this matter.Reply

The video memory is stored in main memory being it 4GB and above...(so minspecs of crysis are clearly met)... the point is bandwidtht.The article is telling there are roughly 50GB/s when the cachè is run with 1.6 Ghz.So ramping it up in füture makes the new Iris 5300 i suppose.Reply

Uhh, what? Games can use far more than that, seeing them push past 2GB is common. But what matters is how much of that memory needs high bandwidth, and that's where 128MB of cache can be a good enough solution for most games. Reply