Intel Haswell CPU: How it works

Haswell is a continuation of the previous-generation Ivy Bridge’s 22nm production process with changes across the three key areas: core CPU performance iGPU improvements and power consumption reductions. We’ve included tables here covering information on the specific CPU models we’ve found so rather than talk about specific chips we’ll look at the architecture and mention any CPU models as necessary along the way.

Low-level cache bandwidth

Haswell features the same low-level cache amounts as Ivy Bridge – a 32KB instruction/32KB data L1 cache and a 256KB unified L2 cache that can be accessed by all cores. The main difference over Sandy Bridge/Ivy Bridge is the bandwidth which doubles the number of bytes loaded and stored per clock cycle from 32 and 16 to 64 and 32 respectively. The inter-cache bandwidth also jumps from 32 bytes per cycle to 64 bytes.

What we used to call L3 cache is now known as last-level cache (LLC). While it’s not a unified cache like the L2 Intel says it scales according to the number of cores in operation which basically means it simply switches on as each core does.

Haswell’s LLC isn’t unified but scales with the number of cores in use.

Advanced Vector Extensions 2

From an instruction viewpoint today’s x86 CPUs are barely recognisable from the Pentium 4 chips we had 10 or so years ago thanks in part to the continual evolution in instruction extensions (SSE SSSE SIMD and so on) added to wider and wider instruction paths. While there’s growing interest in parallel GPU computing Intel continues to persist with CPU-only improvements adding Advanced Vector Extensions 2 (AVX2) to the Haswell architecture. AVX2 delivers features including 256-bit integer vectors and fused multiply-add (FMA) which means it calculates multiply-add operands to full precision before rounding calculations.

This has the effect of doubling single- and double-precision floating point operations (SP FLOPS and DP FLOPS) per clock cycle over the Sandy Bridge/Ivy Bridge architecture from 16 to 32 and from 8 to 16 respectively. Intel expects you to gain the benefit in gaming and high-performance computing applications such as video processing.

Overclocking

One of the disappointing aspects of the Sandy Bridge/Ivy Bridge era was Intel’s decision to unify all discrete system component clocks off the back of a single base clock (Bclk) which removed the fine-grained overclocking control that was the frontside bus (FSB) clock rate on Intel’s Nehalem platform. By also locking the multiplier setting on everything other than the premium-priced top-drawer K-series CPUs overclocking effectively became a rich person’s sport overnight.

Intel delivered something of a compromise solution on its high-performance Sandy Bridge-E/LGA2011 platform which it has repackaged and now added to Haswell.

Like LGA2011 Haswell provides a small +/-5 to 7% Bclk tweak range on what’s essentially now a three-speed gearbox called the PEG/DMI Ratio (PDR). PEG stands for PCI Express Graphics and DMI is the Direct Media Interface the bus that links the northbridge and southbridge sections together.

Those ratios are 1x (5:5) 1.25x (5:4) and 1.66x (5:3). Provided your Haswell CPU allows you to you can use a combination of these three options to set up an overclock rate. Before you start ripping up the dance floor in unbridled joy there’s some fine print – ‘only some processors enable part or all of these features’ says the Intel presentation slide. Some processors? Part of these features? It should be a given that all overclock options are on the table for the unlocked K-series chips. And as a bonus Intel is also increasing the K-series’ allowable core ratio multiplier up from 67 to 80. Throw in the 1.66x PDR an extra 7% base clock tweak for good luck and in theory that’s a top clock of 14GHz. No-one’s saying you’d ever get anywhere near that but there’ll be a few liquid nitrogen hardheads who’ll have fun giving the current land speed record of 8.8GHz (an AMD FX-8150) a real nudge.

For everything else we’re still in the dark about just how much overclocking non-K chips will allow. If PDR and BCLK settings are locked on everything but K-chips it’s far less interesting since K-chips already have an unlocked multiplier to play with. It’d be like having an everyday car with a single-gear straight-through transmission and a sports car with two manual gearboxes – neither would make much sense.

Sandy Bridge-E/LGA2011 overclocking comes to Haswell but which chips?

Integrated voltage regulation

The other half of the overclocking equation is voltage control and here Haswell gets one of the more significant changes in CPU history. Up until now you’ve probably seen the small heatsinks that lined the outer rim of the CPU socket. They’re voltage regulator modules (VRMs) that deliver precise voltage levels to various parts of the CPU: the cores the graphics module memory controller and so on.

For the first time Haswell integrates those VRMs into the silicon so instead of multiple voltage inputs that you had with Ivy Bridge Sandy Bridge and every CPU before that Haswell now has just one called VCCIN (there’s another for the DRAM controller too but that’s another story). You can still overvolt the CPU; in fact integrated voltage regulation (iVR) makes overvolting more reliable because the physical track distances between the VRMs and the actual CPU circuitry is drastically reduced. Integrating VRMs will add to the CPU’s heat load but you end up not only with a more compact board the VRMs now get active cooling under the CPU heatsink/fan combo. The main benefit from an overclocking viewpoint is ‘cleaner’ power. Because the physical VRM layout is no longer at the discretion of motherboard designers and is now buried inside the CPU you get a much greater consistency of power quality across all systems you couldn’t have before.

You can push the cores and other sections up to 2V and VCCIN which should run around between 1.8 and 2.3V and can go all the way to 3.04V. The bottom line is there are no artificial limitations on voltage control for Haswell overclocking allowing you to push the chip as far as you have the guts to do so.

Reducing power consumption

In our view Intel’s ‘tock’ chips are often better than their ‘tick’ models. Given the amount of work required to implement a new smaller-scale lithographic process Intel engineers have to spend time just making sure the new scale works. And that’s come through here with Intel claiming on Haswell that it has further refined the 22nm process with lower power consumption tri-gate transistors and made better use of its transistor budget per die.

Intel has used a number of standard power reduction techniques in the past that get a run here on Haswell. They include greater voltage/frequency scaling where the core voltage is reduced in proportion to the CPU clock speed. A lower voltage means lower current which equals significantly lower power consumption. Another trick CPU designers are increasingly using is what’s called gating whereby unused cores and general processor logic are switched on and off as required. However the gating mechanism needs to be lightning-quick in order to reduce propagation delays through the processing chain and integrating the VRMs onto silicon goes a long way to reducing those delays. Even so gating is a highly effective way of further reducing power consumption.

Haswell extends this further by decoupling each core from the LLC to tweak the power consumption further calling it a fine-grain control. Microsoft’s Windows 8.1 release is rumoured to link in with all of these.

Haswell’s new power states will be used by the upcoming Windows 8.1 OS release.

New integrated GPU

You have to imagine Intel engineers also much prefer playing with ‘tock’ CPUs than ‘ticks’ as it’s here with the ‘tock’ models they get to add more toys. And one of the big new pressies in Haswell is a vastly improved iGPU. In fact there’ll be four variants: GT1 GT2 GT3 and GT3e. The GT3/3e (GT3e includes embedded DRAM) options will be known as the HD 5100/5200 and deliver 40 EUs. By comparison Ivy Bridge’s HD 4000 had just 16 EUs. Disappointingly though most of the desktop chips will only include the GT2 option with its 20 EUs with one exception: the new R-model desktop chips.

New sockets

This is a good time to break into our analysis and look at one of the more controversial side issues of Haswell. The new platform also introduces a number of new sockets but if it happens as expected there’ll be two package options for the desktop: the standard LGA1150 for the majority of CPUs as well as a BGA/pre-soldered option for the new R-suffixed chips the only ones that will see the new top-grade GT3 iGPU.

On the back of rumours swirling at the end of last year was one that Intel was about to drop CPU sockets for non-upgradable motherboards. It’s worth noting from the tables that while the R-series CPUs may have the best GT3 iGPUs they’re slower/lower-powered parts with 65W TDPs. The Core i7-4770R Core i5-4670R and i5-4570R parts are well down on clock speed compared with their top family parts. That makes them far more likely to end up in NUC compact mini-ITX boxes or all-in-one PCs designs where TDPs need to be more carefully managed.

But the obvious question remains: why hasn’t Intel provided an LGA1150 option with its best GT3 iGPUs? Our initial guess is heat production. Cramming a 40-EU GT3 iGPU into a Core i7-4770K along with new iVR could see the TDP disappear well over 100W mark.

iGPU performance

Intel is trying to appeal to everyone given its new inclusions – DirectX 11.1 OpenGL 4.0 and OpenCL 1.2. AVX2 might give the impression that there’s not much interest from Intel in parallel processing but it’ll be interesting to see if Haswell’s iGPU now has enough performance to make OpenCL a serious option.

However the iGPU language additions only tell part of the story. While the GT3/3e is expected to double the performance of the HD 4000 there are significant additions to other capabilities particularly display and media coding.

Quick Sync is Intel’s iGPU-based hardware-accelerated MPEG-2 and AVC encoding. While it offers impressive encoding speed increases some users have complained the drop in encoding quality made it less palatable. However the chip giant says it has improved Quick Sync with full hardware encoding as well as adding MJPEG and SVC encoding/decoding acceleration.

The official introduction of Thunderbolt into Lynx Point also means that multi-monitor setups are easier to achieve with a Haswell iGPU. It’ll handle up to 4K resolution (4096 x 2160 pixels) and up to three independent displays directly connected or possibly more through DisplayPort ‘daisy-chaining’ or DisplayPort hub.

iGPU power consumption

All of these tricks are easy to do on the desktop but if Intel is to make its mark against ARM in the tablet market it’ll need to do much of this with greatly reduced power consumption. It’s here that the chip maker is happy to give us an insight into the problems.

Intel says there are two main components of power consumption: dynamic power and what it calls leakage. Dynamic power goes up with clock speed and transistor count and thanks to Ohm’s Law goes up quickly with voltage (power = V2/R).

However leakage goes up with transistor counts and occurs regardless of clock speed but apparently climbs even faster with voltage changes (something approaching V3). Without going into the circuitry theory Intel says you obtain your best power efficiency by running the maximum clock speed you can at the lowest voltage (Fmax@Vmin). That’s because once you increase beyond your minimum voltage power consumption climbs at a faster rate.

Another trick to reduce power demands is moving tasks from general-purpose execution engines to single- or fixed-function hardware (video processing is Intel’s example here). It also makes sense – purpose-designed hardware is always more efficient than general-purpose units because you need less transistors and it’s the number of transistors constantly switching that directly affects power consumption.

When it comes to the iGPU’s bread and butter – displaying video – Intel uses another technique called ‘race to halt’. If you’re displaying video that requires a fixed frame rate there’s no benefit in displaying it at a higher rate especially to power consumption. So Intel engineers use the idea of race to halt which means maintaining the Fmax@Vmin performance level and switching off for micro-sleeps as soon as that required frame rate is achieved. The other alternative is to drop the video iGPU clock rate to a just enough level but Intel says that’s not as efficient as maintaining Fmax@Vmin performance and micro-sleeping.

This is an important concept because there’s a general view that you can achieve lower device power consumption by simply dropping the CPU clock frequency and running at slower rate. What Intel is saying is that because of quiescent power draw (power used simply because transistors are there) there’s a better return if you stick to your most efficient clock rate and micro-sleep when you meet your time-dependent requirements.

You can be sure Intel doesn’t have a monopoly on these techniques – AMD and ARM will be using similar technology to reduce power consumption; in fact the Cortex-M3 one of ARM’s smaller CPUs aimed at embedded applications already has iVR. Doing it on a CPU the scale of Haswell is something different again but it’s still worth understanding how CPU/iGPU power consumption can be optimised.

The one interesting side note to all of this however has been the increase in TDPs of the top-performing parts such as the Core i7-4770K and i5-4670K up from 77W to 84W compared with their Ivy Bridge equivalents. That’s led to speculation that Haswell won’t have the power reductions Intel is claiming something the chip maker vigourously denies. Given the VRMs are now integrated TDPs will naturally go up although on the other side of the ledger Intel would also gain elsewhere through improved efficiencies. Those improvements will have to shine through particularly on the ULV parts if we’re to see anything like 9 hours-plus battery life. That said you don’t need to go to ULV parts to see the benefits – look at the T-series desktop LGA1150 options that offer full core/thread counts at slightly lower clock speeds but with impressive 45W and even 35W TDPs.

The ULT version of Haswell integrates even PCIe onto the chip silicon.

What’s next?

In the enterprise arena power consumption matters as much as performance so expect to see even greater levels of integration and more system on a chip (SoC) designs in the coming years.

Haswell isn’t the only show for Intel for 2013. The chip maker will launch Ivy Bridge-EP (Xeon E5) and EX (Xeon E7) server chips later this year. You can also expect to hear quite a bit about Avoton the next-generation 64-bit Atom CPU. Atom might be a dead loss on the desktop but it will compete hard against ARM in the burgeoning microserver market. Avoton is being built using the same 22nm process as Haswell and with 64-bit instructions Intel no doubt believes it’ll offer more features than ARM can deliver right now.