See you later, Sandy Bridge. Say hello to tablet-like power characteristics.

In the semiconductor world, integration is omnipresent, driven by Moore’s Law. Integration reduces power and cost while increasing performance. The latest realization of this trend is the System-on-a-Chip (SoC) approach pervasive among PCs, tablets, and smartphones. And the latest SoC is Haswell.

Haswell is the first new family of SoCs from Intel to target the 22nm FinFET process, which uses a non-planar transistor that wraps around the gate on three sides. While Ivy Bridge was the first family of 22nm products, it was not fully optimized for the 22nm process. The CPU was a shrink of the 32nm Sandy Bridge rather than a new design.

The Haswell family encompasses a new CPU core, a new GPU, and numerous system-level changes. More importantly, it marks the beginning of Intel's more unified approach to SoCs. The Haswell family is really a set of building blocks that architects will assemble for specific markets. The Haswell CPU core is a step forward in performance (as is to be expected from Intel) but more importantly, it is a huge step down in power. This SoC should be viable for high-end tablets. Variants of Haswell are aimed as low as 10W, and future improvements may reduce this further. Intel’s 22nm FinFET node is necessary to achieve this wider range, but it's not sufficient. Haswell's architecture fully exploits the benefits of the new process technology in a way that Ivy Bridge never did. It ultimately yields higher performance and lower power, which will translate into PCs and tablets that run faster with a longer battery life.

At the instruction set level, the Haswell core supports four extensions that profoundly transform the x86 ISA. AVX2 widens integer SIMD (Single Instruction Multiple Data, a form of vectors) to 256-bit vectors, and that adds a gather instruction for irregular memory access. The fused multiply-add (FMA) instructions improve performance for floating point (FP) workloads. For cryptography, networking, and certain search operations, there are new bit manipulation instructions. Lastly, Haswell is the first widely available product with transaction memory through the TSX extension. TSX is an incredibly powerful model for multi-threaded programming that improves performance and efficiency of software by better utilizing the underlying multicore hardware. Microarchitecturally, the Haswell core achieves even higher performance than Sandy Bridge. The improvements are mainly in the out-of-order execution—especially the memory hierarchy. It all strengthens Haswell's case to be the basis of Intel's upcoming generation of products in everything from tablets to servers.

Haswell instruction set and front-end

Haswell introduces four families of new instructions. The first is AVX2, which is a 256-bit extension of existing integer SIMD. It's essentially the counterpart of the floating-point AVX instructions. AVX2 also adds vector permutes and shifts, then gathers, instructions for loading data from non-contiguous addresses. Gather is crucial for compilers to take advantage of wider SIMD instructions (e.g., AVX2 can have 32 byte long elements).

On the floating point side, Intel's new Fused Multiply Add (FMA) extension includes both 256-bit and 128-bit instructions. Compared to the conventional separate multiply and add instructions in SSE, FMA doubles the theoretical throughput. In addition, the fused instructions eliminate the intermediate rounding stage which can improve accuracy for some approximation algorithms.

The third extension is smaller and more focused on integer bit manipulation (known as BMI) for use in cryptography and packet handling. As an aside, Haswell adds a big-endian move instruction (MOVBE) that can convert to and from traditional x86 little-endian format (big-endian data stores the most significant byte first, while little endian stores the least significant byte first). This was originally introduced in Atom, and it was added to ensure full compatibility and improve performance for embedded applications.

The most significant ISA extension is TSX, which has been extensively discussed in a previous article on Haswell's transactional memory. In short, TSX separates performance from correctness for multi-threaded programs. Programmers can write simple code that is easier to debug, while the hardware extracts concurrency and performance.

Coarse-grained locking (e.g., locking an entire data structure) is easy to develop, especially when starting with single threaded code. However, fine-grained locking (e.g., locking a portion of the data structure such as a single node in a B-tree) is almost always higher performance. Hardware Lock Elision (HLE) uses hint prefixes to transparently provide the performance and throughput of fine-grained locking, even when programmers use coarse-grained locks.

Restricted Transactional Memory (RTM) is a new programming model that exposes explicit transactions through new instructions. These transactions can span complex data structures and be composed easily. However, it does require linking new libraries using RTM and possibly rewriting software to get the full benefits.

The new instruction set extensions are clearly the biggest change in Haswell's front-end. At a high level, instruction fetch and decode microarchitecture is largely similar to Sandy Bridge, but there are numerous subtle enhancements to note.

The branch prediction for Haswell has improved, although Intel was unwilling to share the specific details. The instruction cache is still 8-way associative, 32KB, and dynamically shared by the two threads. Likewise, the instruction TLBs are the same capacity. The major changes are enhancing instruction cache misses and prefetching to make better use of the existing resources. Instruction fetching from the instruction cache continues to be 16B per cycle, but with more outstanding L1 misses and more timely prefetching.

The decoding for Haswell is largely identical to Sandy Bridge. There are four legacy decoders that take in x86 instructions and emit simpler uops. The first is a complex one that can emit 1-4 fused uops and three simple decoders that can emit one fused uop each. Like Sandy Bridge, there is compare+jump fusion and stack pointer elimination. The Haswell uop cache is also identical, with 32 sets of eight cache lines. Each cache line holds up to six uops.

The Haswell uop queue was redesigned to improve single threaded performance. Sandy Bridge had two 28 entry uop queues, one for each thread. However, in Ivy Bridge the uop queues were combined into a single 56 entry structure. The chief advantage is that when a single thread is executing on Ivy Bridge or Haswell, the entire 56 entry uop buffer is available for loop caching and queuing.

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Generally you need to specifically target those vector instructions like SSE4 to make use of them. They're only useful if you have parallel math operations to do, like video encoding/decoding, so your average application won't use them.

This is exactly the reason that everyone who said Intel couldn't compete with ARM should be wary. Intel has a bad habit of playing down to the level of their competition... until they start getting beat. Once challenged in a segment the market deems important, they pull out all the stops and come out with truly impressive products. It will be interesting to see how far and wide they can successfully deploy this architecture.

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Someone knows more about these instruction sets usage?

The answer is: Yes, programs need to be recompiled (and sometimes parts re-written) to take advantage of new instructions provided in processors. That is one of the "potential" upsides for writing code in languages that get compiled into an intermediate state, and then recompiled before use on the platform (like Java, .net, ironPython, etc..) because then the runtime can be updated and "in theory" the rest of your code can immediately take advantage of such.

Why are you guys referring to Haswell as an SoC? I get that Intel wants to use that terminology to try to put Haswell up against various ARM SoCs, but until someone starts producing packages with integrated RAM and the like, it seems kind of inaccurate. Haswell is an architecture that has the possibility of being used in a SoC, but SoCs are individual designs, often bespoke.

Generally you need to specifically target those vector instructions like SSE4 to make use of them. They're only useful if you have parallel math operations to do, like video encoding/decoding, so your average application won't use them.

That's generally true, but SSE4(.1 I assume?) was so useful that opportunities to use it are everywhere (and if you have a compiler with auto-vectorization, it likes to use them and add a generic-x86 fallback). SSE4.1 is a pretty big thing, including "things that should have been in SSE2" like PMULLD and PMAXSD that are widely applicable.AVX2's variable shifts will also fill a gap that will enable many more loops to be (auto-)vectorized.

I'm not convinced that Intel is going to be necessarily faster than the tablet CPU competition. Yes, Intel is putting out some pretty impressive numbers, but is a future product, so it's likely that while its faster and power equivalent to current ARM chips, it'd probably be no match power-wide (historically anyway) for even lower power ARM chips that do more on a given wattage (up to a point). Unless most of the advances that Intel touts aren't being used in future ARM chips to improve them.. which I doubt. Nonetheless, it should make an a very bright future in the world of SOCs!

Intel won't be able to touch the smartphone ARM chips with Haswell (that's what the revamped Atom is for), but the high end ARM chips in tablets consume 4-8 watts, usually. Haswell will likely consume a fair bit more power than them (but maybe not enough to matter), but on the other hand comparing the performance is silly.

Even the highest end ARM chip is dog slow. People keep reading articles about ARM chips making such massive performance gains, and throwing out quad or 8 cores, but they don't seem to realize that even the mightiest ARM chip would be stomped by a several year old ULV Celeron. The performance difference between every ARM chip and traditional (even low voltage) x86 is absurd.

Remember how laughably slow Atom was 5+ years ago when it was introduced? That's the same Atom that Intel is pitting against ARM, just with some relatively minor tweaks. Intel has no risk of losing the notebook/desktop market to ARM anytime soon... the only risk is the notebook/desktop markets themselves going away, replaced by much more limited tablets/phones (though I think the Surface is the way things will ultimately go... merging the notebook and tablet markets).

Regardless, this should provide for a good show, and hopefully massive efficiency increases (on top of the massive increases we've already seen).

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Someone knows more about these instruction sets usage?

I guess it depends on the compiler settings and the actual hardware. At work we use the Intel Fortran compiler and you can compile code with it that won't work on older intel hardware. When coding you don't actually sit and write code in assembler et al. to make use of sse4 or avx2. The compiler under the right circumstances will optimize the code for the correct extensions.

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Someone knows more about these instruction sets usage?

CLR and VM implementations could take advantage of the new instruction sets and just release a new framework. Most likely it would be video libraries like DirectX and GL.

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Someone knows more about these instruction sets usage?

On the Linux side you could for example recompile everything and tell the compiler that it should emit these new instructions. Distributions like gentoo make this easy.

However even a very advanced compiler which can auto-vectorize a lot of code (i.e. use these vector instructions for stuff that is not written as such) will not help you much. At least on the floating point side the compiler would often have to rearrange some terms in your code to vectorize them. A simple example in C:

float a = 1;float b[8] = { ... };for(i = 0; i < 8; ++i) { a *= b[i];}

This seems trivial to vectorize, for example if you can do 4 multiplies simultaneously you can turn a into a vector of 4 elements a1, ..., a4, all initially 1, then do

a1 a2 a3 a4* b1 b2 b3 b4* b5 b6 b7 b8

now a1 = b1*b5, a2 = b2*b6, etc., which used 2 multiplications, then at the end you multiply a1*a2*a3*a4 to get the final a. In total you used at most 5 multiplications instead of 8 (with larger b vectors you obviously get a better speedup, as you pay the fixed cost of 3 multiplications only once).

However, no matter how smart your compiler is, it is not allowed to do this. The reason is that floating point math is not associative, i.e. (a*b)*c != a*(b*c), due to rounding issues. By the C standard, the code above is to be interpreted as a = ((1*b1)*b2)*b3)*..., which is not the way the vector version does it. Unless you specifically allow your compiler to violate the standard it won't be able to help you. The gcc option -ffast-math allows you to tell it that you don't care about correct rounding (and other floating point trickery), but you have to explicitly ask for it. If you do enable that option on code you have not looked at and where you don't know whether it depends on standards-compliant rounding is just madness.

PS: yeah... my arrays are 1 based. I'm a math graduate, they just come out that way . I just noticed, but can't be asked to fix it.

This is exactly the reason that everyone who said Intel couldn't compete with ARM should be wary. Intel has a bad habit of playing down to the level of their competition... until they start getting beat. Once challenged in a segment the market deems important, they pull out all the stops and come out with truly impressive products. It will be interesting to see how far and wide they can successfully deploy this architecture.

It's likely to come down to cost more than performance. There's little doubt that Intel can produce a chip with decent performance and a small power envelope but will that product have a place in a $150 tablet?

Haswell, by its nature is a relatively expensive, high performance part that will be well suited to premium tablets like the Surface Pro. Cost alone means that it wouldn't appear in a future iteration of the Nexus 7, even if it had low enough power consumption.

I realize I'm probably late to notice this, but this is the first CPU write-up I've seen on this site by Mr. Kanter. I had already taken to checking out realworldtech after Mr. Stokes left, to bring David over to Ars is fantastic.

This is exactly the reason that everyone who said Intel couldn't compete with ARM should be wary.

Sigh. No it isn't.

Quote:

Intel has a bad habit of playing down to the level of their competition... until they start getting beat. Once challenged in a segment the market deems important, they pull out all the stops and come out with truly impressive products. It will be interesting to see how far and wide they can successfully deploy this architecture.

From the article:

Quote:

Haswell will be the first high performance x86 core that can really fit in a tablet, albeit in high-powered models around the 10W mark rather than the 4W devices.

So this isn't suitable for phones nor the vast majority of tablets. 10W, as best as I can gather, is the power budget for the SoC; most tablets on the market utilize allocate at best 3W to the SoC and 6W to the screen. A 10W Haswell still has to power a 6W screen, making it much more comparable to a power thrifty Surface Pro. Imaging the next generation Surface Pro running for 7 hours instead of 4.

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Someone knows more about these instruction sets usage?

There is a small but import set of applications and libraries which do.Overall, modern CPUs are quite overpowered but for those applications where there isn't such a thing as a fast enough CPU, they do tend to chase these new instructions quickly, if the new instructions do help.

Ivy Bridge was a shrink that was optimized for time to market. They discussed in detail at ISSCC how and why this was done (risk and time to market). To really take advantage of 22nm you need a full blown redesign.

For consumers, Haswell will offer a heady combination of Windows 8 (and compatibility with the x86 software base) with excellent performance and tablet-like power characteristics.

That seems, well, a bit hyperbolic. "A heady combination" is a fairly poor word choice for a CPU, no matter what it's running. And when you consider that it's Win8, which, combined with pretty much anything, appears to be a real market turkey, it seems even more out of place.

There's definitely some interesting stuff in Haswell, but the one thing that almost everyone can use is conspicuously absent from the presentations so far: higher per-thread performance. Anyone who's compute-bound can use more of that, and it appears to be the one thing that none of the chip makers can make.

So, they're here to sell us what they can actually build, rather than what the market truly and deeply needs. Most of it will be useful, but these features seem to mostly cover fringe cases. Most seem to address specific pain points, rather than pushing the architecture into truly new territory.

The transactional memory is probably the most powerful new idea, but I suspect it'll take years to matter on desktops. Servers can use it now, but at present chip scale, it seems more a reliability feature. In the long haul, it may eventually end up being a critical factor in letting ordinary human intelligences scale their algorithms to super-multicore offerings, but I'm not sure it's that helpful right now.

Am I underestimating that feature? Will it make that much difference on our 2- and 4-processor machines?

This is exactly the reason that everyone who said Intel couldn't compete with ARM should be wary. Intel has a bad habit of playing down to the level of their competition... until they start getting beat. Once challenged in a segment the market deems important, they pull out all the stops and come out with truly impressive products. It will be interesting to see how far and wide they can successfully deploy this architecture.

It's likely to come down to cost more than performance. There's little doubt that Intel can produce a chip with decent performance and a small power envelope but will that product have a place in a $150 tablet?

Haswell, by its nature is a relatively expensive, high performance part that will be well suited to premium tablets like the Surface Pro. Cost alone means that it wouldn't appear in a future iteration of the Nexus 7, even if it had low enough power consumption.

Bay Trail, running windows 8 or android, and with the first fully re-designed atom, will compete in that market.

Even the highest end ARM chip is dog slow. People keep reading articles about ARM chips making such massive performance gains, and throwing out quad or 8 cores, but they don't seem to realize that even the mightiest ARM chip would be stomped by a several year old ULV Celeron. The performance difference between every ARM chip and traditional (even low voltage) x86 is absurd.

Even the highest end ARM chip has a fraction of the power consumption though, right? When accounting for that, is it still fair to call the ARM chip "dog slow"? Furthermore, the ARM chips are considerably less expensive, so all considered, I think Intel still has a ways to go. It will be interesting when the 64-bit server oriented ARM chips start appearing.

At any rate, I have no desire to support Intel, and welcome the competition in the ARM market. Thank's to that, the chips will trend toward not "dog slow" and remain not overpriced.

As far as Haswell, I'm wondering if Intel will continue to cripple the i3 series by excluding the crypto instructions and so forth. The way Intel handles instruction extensions is truly obnoxious.

Ivy Bridge was a shrink that was optimized for time to market. They discussed in detail at ISSCC how and why this was done (risk and time to market). To really take advantage of 22nm you need a full blown redesign.

David

I understand.

It was this statement "The CPU was a shrink of the 32nm Sandy Bridge rather than a new design" left me thinking Haswell was the first to use the "new" fin-fet tech. I now see you mean "geometry" as opposed to individual transistor design.

Imaging the next generation Surface Pro running for 7 hours instead of 4.

Perfect! That's exactly what I want. Slightly smaller/cooler, slightly better battery life, and performance that utterly destroys anything that ARM makes. I don't need my tablet to blow away in a good stiff breeze, I don't mind if it's a little bigger/heavier. I just want to be able to use it as a real laptop replacement.

Ivy Bridge was a shrink that was optimized for time to market. They discussed in detail at ISSCC how and why this was done (risk and time to market). To really take advantage of 22nm you need a full blown redesign.

David

The article does get a little confusing at times though with the heavy comparisons to Sandy Bridge, yet bits like this as well,

Quote:

The Haswell uop queue was redesigned to improve single threaded performance. Sandy Bridge had two 28 entry uop queues, one for each thread. However, in Ivy Bridge the uop queues were combined into a single 56 entry structure.

which sounds like an Ivy Bridge redesign rather than a Haswell one. I'm guessing that with most of the rest of the article I could probably just swap Ivy Bridge in place of where it says Sandy Bridge, since they're similar in design, but it's not really clear that's the case.

I realize I'm probably late to notice this, but this is the first CPU write-up I've seen on this site by Mr. Kanter. I had already taken to checking out realworldtech after Mr. Stokes left, to bring David over to Ars is fantastic.

Glad to see ya here. =)

Thanks for the compliment and I'm glad you enjoy my site. The Ars folks and I are trying to work out something where I can write here from time to time. So if you like it, please ping Ken or Eric : )

The article does get a little confusing at times though with the heavy comparisons to Sandy Bridge, yet bits like this as well,

Quote:

The Haswell uop queue was redesigned to improve single threaded performance. Sandy Bridge had two 28 entry uop queues, one for each thread. However, in Ivy Bridge the uop queues were combined into a single 56 entry structure.

which sounds like an Ivy Bridge redesign rather than a Haswell one. I'm guessing that with most of the rest of the article I could probably just swap Ivy Bridge in place of where it says Sandy Bridge, since they're similar in design, but it's not really clear that's the case.

Correct, IVB redesign rather than Haswell in this case as the change to the decoder was done in IVB rather than SB. At the beginning of the paragraph you excerpted David mentions that the decoder is largely unchanged from the SB one. Except in the case where the uop queues were merged.

I realize I'm probably late to notice this, but this is the first CPU write-up I've seen on this site by Mr. Kanter. I had already taken to checking out realworldtech after Mr. Stokes left, to bring David over to Ars is fantastic.

Glad to see ya here. =)

Thanks for the compliment and I'm glad you enjoy my site. The Ars folks and I are trying to work out something where I can write here from time to time. So if you like it, please ping Ken or Eric : )

David

Your site is the best current site I'm aware of for in-depth information about microarchitectures, so I definitely agree with DigitalMan here. It'd be great to see you make regular contributions to Ars.

The TSX(transcation) instructions have me really excited. This alone could dramatically increase scaling.

Currently syncing contexts requires waiting for your value to be pushed out to L3, which is a lot of clock cycles. With TSX, you don't have to wait, you can just start a transaction, make your change, then commit your transaction.

If no conflict occurs, then you pay almost no penalty for locking. In many algorithms, there are rarely conflicts, and of those conflicts, many are caused by long lock times.

Example:

If you code spends 10% of its time locking, then no matter how many cores your throw at it, you will only ever scale up to 10 cores.

If you can cut that lock time by 10x, then you're only spending 1% of your time locking and you can now scale to 100 cores.

edit: a simplification of the issue, but still applies to many algorithms.

What about their new chipsets? I read a few posts that criticized the Atom line of products because while the processor itself was low power, nothing was done to reduce the powerconsumption of the chipset which itself added significant power consumption.

Hopefully X8X line of chipsets will also boasts new features and power reduction.

I understand why you probably need a chipset like to handle a desktop, but for laptops (and tablets) that do not have 6 PCI-E slots and 4 USB 3.0 ports and 4 ram slots, couldn't Intel implement a reduce functionality on the same die as their new processors?