New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Generally you need to specifically target those vector instructions like SSE4 to make use of them. They're only useful if you have parallel math operations to do, like video encoding/decoding, so your average application won't use them.

This is exactly the reason that everyone who said Intel couldn't compete with ARM should be wary. Intel has a bad habit of playing down to the level of their competition... until they start getting beat. Once challenged in a segment the market deems important, they pull out all the stops and come out with truly impressive products. It will be interesting to see how far and wide they can successfully deploy this architecture.

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Someone knows more about these instruction sets usage?

The answer is: Yes, programs need to be recompiled (and sometimes parts re-written) to take advantage of new instructions provided in processors. That is one of the "potential" upsides for writing code in languages that get compiled into an intermediate state, and then recompiled before use on the platform (like Java, .net, ironPython, etc..) because then the runtime can be updated and "in theory" the rest of your code can immediately take advantage of such.

Why are you guys referring to Haswell as an SoC? I get that Intel wants to use that terminology to try to put Haswell up against various ARM SoCs, but until someone starts producing packages with integrated RAM and the like, it seems kind of inaccurate. Haswell is an architecture that has the possibility of being used in a SoC, but SoCs are individual designs, often bespoke.

I'm not convinced that Intel is going to be necessarily faster than the tablet CPU competition. Yes, Intel is putting out some pretty impressive numbers, but is a future product, so it's likely that while its faster and power equivalent to current ARM chips, it'd probably be no match power-wise (historically anyway) for even lower power ARM chips that do more on a given wattage (up to a point). Unless most of the advances that Intel touts aren't being used in future ARM chips to improve them.. which I doubt. Nonetheless, it should make a very bright future in the world of SOCs!

Generally you need to specifically target those vector instructions like SSE4 to make use of them. They're only useful if you have parallel math operations to do, like video encoding/decoding, so your average application won't use them.

That's generally true, but SSE4(.1 I assume?) was so useful that opportunities to use it are everywhere (and if you have a compiler with auto-vectorization, it likes to use them and add a generic-x86 fallback). SSE4.1 is a pretty big thing, including "things that should have been in SSE2" like PMULLD and PMAXSD that are widely applicable.AVX2's variable shifts will also fill a gap that will enable many more loops to be (auto-)vectorized.

I'm not convinced that Intel is going to be necessarily faster than the tablet CPU competition. Yes, Intel is putting out some pretty impressive numbers, but is a future product, so it's likely that while its faster and power equivalent to current ARM chips, it'd probably be no match power-wide (historically anyway) for even lower power ARM chips that do more on a given wattage (up to a point). Unless most of the advances that Intel touts aren't being used in future ARM chips to improve them.. which I doubt. Nonetheless, it should make an a very bright future in the world of SOCs!

Intel won't be able to touch the smartphone ARM chips with Haswell (that's what the revamped Atom is for), but the high end ARM chips in tablets consume 4-8 watts, usually. Haswell will likely consume a fair bit more power than them (but maybe not enough to matter), but on the other hand comparing the performance is silly.

Even the highest end ARM chip is dog slow. People keep reading articles about ARM chips making such massive performance gains, and throwing out quad or 8 cores, but they don't seem to realize that even the mightiest ARM chip would be stomped by a several year old ULV Celeron. The performance difference between every ARM chip and traditional (even low voltage) x86 is absurd.

Remember how laughably slow Atom was 5+ years ago when it was introduced? That's the same Atom that Intel is pitting against ARM, just with some relatively minor tweaks. Intel has no risk of losing the notebook/desktop market to ARM anytime soon... the only risk is the notebook/desktop markets themselves going away, replaced by much more limited tablets/phones (though I think the Surface is the way things will ultimately go... merging the notebook and tablet markets).

Regardless, this should provide for a good show, and hopefully massive efficiency increases (on top of the massive increases we've already seen).

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Someone knows more about these instruction sets usage?

I guess it depends on the compiler settings and the actual hardware. At work we use the Intel Fortran compiler and you can compile code with it that won't work on older intel hardware. When coding you don't actually sit and write code in assembler et al. to make use of sse4 or avx2. The compiler under the right circumstances will optimize the code for the correct extensions.

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Someone knows more about these instruction sets usage?

CLR and VM implementations could take advantage of the new instruction sets and just release a new framework. Most likely it would be video libraries like DirectX and GL.

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Someone knows more about these instruction sets usage?

On the Linux side you could for example recompile everything and tell the compiler that it should emit these new instructions. Distributions like gentoo make this easy.

However even a very advanced compiler which can auto-vectorize a lot of code (i.e. use these vector instructions for stuff that is not written as such) will not help you much. At least on the floating point side the compiler would often have to rearrange some terms in your code to vectorize them. A simple example in C:

float a = 1;float b[8] = { ... };for(i = 0; i < 8; ++i) { a *= b[i];}

This seems trivial to vectorize, for example if you can do 4 multiplies simultaneously you can turn a into a vector of 4 elements a1, ..., a4, all initially 1, then do

a1 a2 a3 a4* b1 b2 b3 b4* b5 b6 b7 b8

now a1 = b1*b5, a2 = b2*b6, etc., which used 2 multiplications, then at the end you multiply a1*a2*a3*a4 to get the final a. In total you used at most 5 multiplications instead of 8 (with larger b vectors you obviously get a better speedup, as you pay the fixed cost of 3 multiplications only once).

However, no matter how smart your compiler is, it is not allowed to do this. The reason is that floating point math is not associative, i.e. (a*b)*c != a*(b*c), due to rounding issues. By the C standard, the code above is to be interpreted as a = ((1*b1)*b2)*b3)*..., which is not the way the vector version does it. Unless you specifically allow your compiler to violate the standard it won't be able to help you. The gcc option -ffast-math allows you to tell it that you don't care about correct rounding (and other floating point trickery), but you have to explicitly ask for it. If you do enable that option on code you have not looked at and where you don't know whether it depends on standards-compliant rounding is just madness.

PS: yeah... my arrays are 1 based. I'm a math graduate, they just come out that way . I just noticed, but can't be asked to fix it.

This is exactly the reason that everyone who said Intel couldn't compete with ARM should be wary. Intel has a bad habit of playing down to the level of their competition... until they start getting beat. Once challenged in a segment the market deems important, they pull out all the stops and come out with truly impressive products. It will be interesting to see how far and wide they can successfully deploy this architecture.

It's likely to come down to cost more than performance. There's little doubt that Intel can produce a chip with decent performance and a small power envelope but will that product have a place in a $150 tablet?

Haswell, by its nature is a relatively expensive, high performance part that will be well suited to premium tablets like the Surface Pro. Cost alone means that it wouldn't appear in a future iteration of the Nexus 7, even if it had low enough power consumption.

I realize I'm probably late to notice this, but this is the first CPU write-up I've seen on this site by Mr. Kanter. I had already taken to checking out realworldtech after Mr. Stokes left, to bring David over to Ars is fantastic.

This is exactly the reason that everyone who said Intel couldn't compete with ARM should be wary.

Sigh. No it isn't.

Quote:

Intel has a bad habit of playing down to the level of their competition... until they start getting beat. Once challenged in a segment the market deems important, they pull out all the stops and come out with truly impressive products. It will be interesting to see how far and wide they can successfully deploy this architecture.

From the article:

Quote:

Haswell will be the first high performance x86 core that can really fit in a tablet, albeit in high-powered models around the 10W mark rather than the 4W devices.

So this isn't suitable for phones nor the vast majority of tablets. 10W, as best as I can gather, is the power budget for the SoC; most tablets on the market utilize allocate at best 3W to the SoC and 6W to the screen. A 10W Haswell still has to power a 6W screen, making it much more comparable to a power thrifty Surface Pro. Imaging the next generation Surface Pro running for 7 hours instead of 4.

New instructionsets sound nice, but doesn't that mean software needs to be compiled specifically for Haswell? Are there any normal programs that use SSE4 for example? (It being already quite old, and still not used very much?)

Linux versions (and windows probably too) are compiled for generic instructionsets as far as I know, so even though we have AMD64 versions, the rest of the instructionsets are not used.

Or is it only useful for highly optimized software like video encoders? I assume they then create different paths in their code to integrate it.

Someone knows more about these instruction sets usage?

There is a small but import set of applications and libraries which do.Overall, modern CPUs are quite overpowered but for those applications where there isn't such a thing as a fast enough CPU, they do tend to chase these new instructions quickly, if the new instructions do help.

Ivy Bridge was a shrink that was optimized for time to market. They discussed in detail at ISSCC how and why this was done (risk and time to market). To really take advantage of 22nm you need a full blown redesign.

For consumers, Haswell will offer a heady combination of Windows 8 (and compatibility with the x86 software base) with excellent performance and tablet-like power characteristics.

That seems, well, a bit hyperbolic. "A heady combination" is a fairly poor word choice for a CPU, no matter what it's running. And when you consider that it's Win8, which, combined with pretty much anything, appears to be a real market turkey, it seems even more out of place.

There's definitely some interesting stuff in Haswell, but the one thing that almost everyone can use is conspicuously absent from the presentations so far: higher per-thread performance. Anyone who's compute-bound can use more of that, and it appears to be the one thing that none of the chip makers can make.

So, they're here to sell us what they can actually build, rather than what the market truly and deeply needs. Most of it will be useful, but these features seem to mostly cover fringe cases. Most seem to address specific pain points, rather than pushing the architecture into truly new territory.

The transactional memory is probably the most powerful new idea, but I suspect it'll take years to matter on desktops. Servers can use it now, but at present chip scale, it seems more a reliability feature. In the long haul, it may eventually end up being a critical factor in letting ordinary human intelligences scale their algorithms to super-multicore offerings, but I'm not sure it's that helpful right now.

Am I underestimating that feature? Will it make that much difference on our 2- and 4-processor machines?

This is exactly the reason that everyone who said Intel couldn't compete with ARM should be wary. Intel has a bad habit of playing down to the level of their competition... until they start getting beat. Once challenged in a segment the market deems important, they pull out all the stops and come out with truly impressive products. It will be interesting to see how far and wide they can successfully deploy this architecture.

It's likely to come down to cost more than performance. There's little doubt that Intel can produce a chip with decent performance and a small power envelope but will that product have a place in a $150 tablet?

Haswell, by its nature is a relatively expensive, high performance part that will be well suited to premium tablets like the Surface Pro. Cost alone means that it wouldn't appear in a future iteration of the Nexus 7, even if it had low enough power consumption.

Bay Trail, running windows 8 or android, and with the first fully re-designed atom, will compete in that market.

Even the highest end ARM chip is dog slow. People keep reading articles about ARM chips making such massive performance gains, and throwing out quad or 8 cores, but they don't seem to realize that even the mightiest ARM chip would be stomped by a several year old ULV Celeron. The performance difference between every ARM chip and traditional (even low voltage) x86 is absurd.

Even the highest end ARM chip has a fraction of the power consumption though, right? When accounting for that, is it still fair to call the ARM chip "dog slow"? Furthermore, the ARM chips are considerably less expensive, so all considered, I think Intel still has a ways to go. It will be interesting when the 64-bit server oriented ARM chips start appearing.

At any rate, I have no desire to support Intel, and welcome the competition in the ARM market. Thank's to that, the chips will trend toward not "dog slow" and remain not overpriced.

As far as Haswell, I'm wondering if Intel will continue to cripple the i3 series by excluding the crypto instructions and so forth. The way Intel handles instruction extensions is truly obnoxious.

Ivy Bridge was a shrink that was optimized for time to market. They discussed in detail at ISSCC how and why this was done (risk and time to market). To really take advantage of 22nm you need a full blown redesign.

David

I understand.

It was this statement "The CPU was a shrink of the 32nm Sandy Bridge rather than a new design" left me thinking Haswell was the first to use the "new" fin-fet tech. I now see you mean "geometry" as opposed to individual transistor design.

Imaging the next generation Surface Pro running for 7 hours instead of 4.

Perfect! That's exactly what I want. Slightly smaller/cooler, slightly better battery life, and performance that utterly destroys anything that ARM makes. I don't need my tablet to blow away in a good stiff breeze, I don't mind if it's a little bigger/heavier. I just want to be able to use it as a real laptop replacement.

Ivy Bridge was a shrink that was optimized for time to market. They discussed in detail at ISSCC how and why this was done (risk and time to market). To really take advantage of 22nm you need a full blown redesign.

David

The article does get a little confusing at times though with the heavy comparisons to Sandy Bridge, yet bits like this as well,

Quote:

The Haswell uop queue was redesigned to improve single threaded performance. Sandy Bridge had two 28 entry uop queues, one for each thread. However, in Ivy Bridge the uop queues were combined into a single 56 entry structure.

which sounds like an Ivy Bridge redesign rather than a Haswell one. I'm guessing that with most of the rest of the article I could probably just swap Ivy Bridge in place of where it says Sandy Bridge, since they're similar in design, but it's not really clear that's the case.

I realize I'm probably late to notice this, but this is the first CPU write-up I've seen on this site by Mr. Kanter. I had already taken to checking out realworldtech after Mr. Stokes left, to bring David over to Ars is fantastic.

Glad to see ya here. =)

Thanks for the compliment and I'm glad you enjoy my site. The Ars folks and I are trying to work out something where I can write here from time to time. So if you like it, please ping Ken or Eric : )

The article does get a little confusing at times though with the heavy comparisons to Sandy Bridge, yet bits like this as well,

Quote:

The Haswell uop queue was redesigned to improve single threaded performance. Sandy Bridge had two 28 entry uop queues, one for each thread. However, in Ivy Bridge the uop queues were combined into a single 56 entry structure.

which sounds like an Ivy Bridge redesign rather than a Haswell one. I'm guessing that with most of the rest of the article I could probably just swap Ivy Bridge in place of where it says Sandy Bridge, since they're similar in design, but it's not really clear that's the case.

Correct, IVB redesign rather than Haswell in this case as the change to the decoder was done in IVB rather than SB. At the beginning of the paragraph you excerpted David mentions that the decoder is largely unchanged from the SB one. Except in the case where the uop queues were merged.

I realize I'm probably late to notice this, but this is the first CPU write-up I've seen on this site by Mr. Kanter. I had already taken to checking out realworldtech after Mr. Stokes left, to bring David over to Ars is fantastic.

Glad to see ya here. =)

Thanks for the compliment and I'm glad you enjoy my site. The Ars folks and I are trying to work out something where I can write here from time to time. So if you like it, please ping Ken or Eric : )

David

Your site is the best current site I'm aware of for in-depth information about microarchitectures, so I definitely agree with DigitalMan here. It'd be great to see you make regular contributions to Ars.

The TSX(transcation) instructions have me really excited. This alone could dramatically increase scaling.

Currently syncing contexts requires waiting for your value to be pushed out to L3, which is a lot of clock cycles. With TSX, you don't have to wait, you can just start a transaction, make your change, then commit your transaction.

If no conflict occurs, then you pay almost no penalty for locking. In many algorithms, there are rarely conflicts, and of those conflicts, many are caused by long lock times.

Example:

If you code spends 10% of its time locking, then no matter how many cores your throw at it, you will only ever scale up to 10 cores.

If you can cut that lock time by 10x, then you're only spending 1% of your time locking and you can now scale to 100 cores.

edit: a simplification of the issue, but still applies to many algorithms.

What about their new chipsets? I read a few posts that criticized the Atom line of products because while the processor itself was low power, nothing was done to reduce the powerconsumption of the chipset which itself added significant power consumption.

Hopefully X8X line of chipsets will also boasts new features and power reduction.

I understand why you probably need a chipset like to handle a desktop, but for laptops (and tablets) that do not have 6 PCI-E slots and 4 USB 3.0 ports and 4 ram slots, couldn't Intel implement a reduce functionality on the same die as their new processors?