Post Your Comment

43 Comments

I'm going to be very curious to see if Intel can do something to dramatically reduce the amount of power their iGPU uses under load. Currently, HD 4000 at 1.1GHz can use around 10-12W, while the CPU portion of an IVB ULV chip can use another 12-16W depending on the task. Put them together and you have to throttle part of the chip in order to stay within a 17W TDP. At 10W TDP, I can't see Intel getting much better than HD 4000 performance.Reply

I'm also quite curious to see where it actually lands. In theory much of the reason for Intel's lackluster performance/power efficiency is that they're running a smaller number of EUs at higher frequency. By lowering operating frequency and hence voltage but increasing the number of EUs, performance can increase with lower power, but at the cost of greater die area.

In fact, there's an interesting little experiment that could be done here with IVB - run some basic benchmarks on HD4000 with a maximum graphics frequency of somewhere in the 600MHz-800MHz range and decreased graphics voltage to compare against stock HD2500. That'll give a good impression of what kind of power savings is possible with same performance via spending more die space and lowering frequency.Reply

I don't think its that simple in Intel's case. Their process is compared to TSMC, lower density, but has much superior drive current characteristics, which allows for higher frequency. It may be better for them to pursue higher clocks and smaller die than with other manufacturers.Reply

Low double digit increase means something in the teens? If this is marginally faster then IVB, which was marginally faster then SNB, what's prompting people to upgrade if they already have socket 1155?Reply

I don't think they are aiming at people with 1~2 years old computer. The target for this product should be 3 or more year old PCs.Compared to those the improvement in performance, TDP and iGPU are not so marginal.Reply

... is there ANY possibility of Microsoft getting these in the Surface Pro? I assume it won't have Haswell or better until its successor, but damn I'd love if they ended up delaying Surface Pro just barely long enough to use one of these chips.Reply

Why shouldn't the Surface Pro get the special 10W Ivy Bridge parts. It makes delaying that product all the more believable if it isn't just another run of the mill 17W ULV Ivy Bridge CPU. And you know Microsoft has to get the battery life right to make that product really worth investing in. Reply

There really is no difference between 17W IB and a hypothetical 10W IB. With configurable TDP and p-states and all that other jazz you should be able to just throttle down the 17W part so it fits in whatever envelope you need. A good windows tablet would have a 50Wh battery.

It shouldnt be too tough to dissipate 13W without a fan. I consider 13W to be the sweet spot.Keep in mind that even the iPad 3 can easily burn up 10 watts while doing heavy gaming. (I'll bet I could push well past 10W if I ever got my hands on one.) So it shouldnt exactly be rocket science to simply dissipate 30% more heat than the ipad.Reply

I long for smaller form factors. ITX being even more streamlined. Smaller everything, PSU's, even cables. Also the idea of cramming these babies into something super small, a nettop I could run a full OS on is always so attractive to me.

With that excitement said, it seems that CPU power is taking a backseat, makes sense as most things aren't cpu bound whatsoever. Are we entering a period where software needs to catch up? Or are we waiting SoC's to catch up? In which case then we'll start seing more well rounded growth on both hardware and software? Reply

As a software engineer, I'm incredibly excited about Haswell's support for Hardware Transactional Memory. Please get all of the details you can on that. For the layman, HTM has the potential to radically simplify and speed up mutli-threaded programs since Transactional Memory is currently one of the most promising parallelism techniques and software implementations have a 2-4x slowdown on a per thread basis.Reply

Each one of these comes with diminishing returns for programs with sequential dependences, and are not necessary for truly parallel programs.

For a software engineer, I would suggest exploring either better sequential algorithms for your problem (which are likely to yield much better performance gains than these HW techniques), or explore truly parallel algorithms (conflict/dependency free) that have the potential to scale on a lot of simple cores.Reply

I must respectfully disagree. I don't see HTM as a hardware optimization like out of order execution or branch prediction. Rather, it is a low level tool for building compilers and programming languages in a way that was otherwise impractical or even impossible. Developers really need a general purpose tool for sharing memory between threads and transactional memory just might be that tool.

There is a lot of good work being explored in the area of implicit parallelism and lock-free data structures, but some domains just require sharing memory. A good example is game object simulation where you have too many loosely interconnected objects to conceivably achieve good performance with a copy-based or message passing algorithms. Unfortunately, manual locking has proven so difficult for a large system that many game engines still do all of this work in a single thread.

Transactional memory isn't a magic bullet, but it offers a good balance of safety, expressiveness, and average performance. If hardware acceleration can drastically reduce the per-thread cost, it could usher in a new era of widespread threaded programming analogous to the development of virtual memory.

Or it could prove too limited or too complicated or too power expensive and never gain widespread adoption.Reply

I think there are a lot of similarities of HTM to OOO and branch prediction: they all are forms of speculative execution (yes even OOO because of precise exceptions, etc). With HTM you speculate that a thread that enters a critical section will not update memory in a conflicting way with another thread before it leaves the critical section. You have to be able to save enough state to completely undo all of the effects of the transaction in case it fails. Finally, if it actually does fail, you have to discard all of the work associated with the transaction, and back off in a way that does not cause livelock. That is very similar to how branch prediction speculates that a prediction is correct. I view it as being a much harder form of speculation than branch prediction because you need all to all core communication to decide if a transaction failed, which is significantly more costly than waiting for a branch instruction to complete, and scales terribly with increasing core counts. Other techniques like HW-assisted compiler speculation, multi-scalar, and speculative threading are even harder.

I agree that manual locking is clearly a bad idea, but I think that HTM (and lock-free data structures for that matter) buys into the same flawed programming model as fine-grained locking. Fundamentally, frequent interacting updates to the same memory from multiple threads is a sequential operation that is spread across multiple threads. The fact that you have multiple threads doesn't make it a parallel operation.

Single threads running on a single core are typically better at this type of operation because they avoid all to all communication between threads. This isn't completely true for all programs and all architectures (it assumes that the cost of all to all communication is high and the rate of interacting updates is high), but software transactional memory does just fine if the rate of interaction is low enough, and the cost of all to all communication goes up superlinearly with thread count. HTM plays in the middle, when the rate of interaction is high enough that STM or coarse grained locking has too much overhead, but is still low enough that it is faster to have more than one thread.

People need to and are starting to explore new algorithms that aren't fundamentally sequential.

In your game example with interacting objects, I would argue that there are good parallel algorithms that don't require locks or transactions.

An example would take the following high level form: 1) Record all of the interactions about to be made in parallel by threads during a time step. 2) Organize (sort/group) them based on the objects they interact with. Updates that do not interact should be placed in separate groups. 3) Perform all updates that do not interact in parallel. 4) Perform all updates that do interact using parallel reductions.

Of course this only makes sense if you have a relatively large number of threads. Probably a lot more than the ~4-8 cores on most CPUs these days.

The space of sequential and parallel algorithms for many problems is actually quite well explored, and I would argue that for many non-trivial problems, the gain from using a better sequential algorithm can be much greater than the 2-4x gain you can get from a few cores. Work on parallel algorithms is less mature, especially for problems that aren't close to being data parallel, and many of them need a very large number of threads (tens or hundreds) before they start competing. I don't think that HTM has a place in the highly parallel space because of the all to all communication on each transaction.Reply

You have to remember that Haswell includes two different sets of HTM. Hardware Lock Elision functions much like you describe, optimistically granting critical section access to multiple threads then automatically then retrying them sequentially if a collision occurs. Much more interesting is Restricted Transactional Memory which provides an explicit mechanism for performing transactions analagous to familiar database transactions. These instructions could be used to implement an efficient TM system that is impractical on current processors.

I think the simple, unfortunate fact is that we have not yet discovered methods for extracting the kind of implicit parallelism necessary to impliment something like a game engine in the manner you describe. You regularly have several thousand different objects that each *may* interact about a hundred other objects on a given frame.

To "record all interactions made in parallel" is essentially an unsolvable problem, similar to deciding if an arbitrary program will terminate normally. Several research languages attempt to restrict programs in such a way that implicit parallelism can be extracted by the compiler, but they are all either too slow or too challenging/unproductive for the programmer.

Transactional Memory works very well for complex, loosely coupled code because thread conflicts are rare enough that you can achieve good performance. This approach works best when coupled with optimized, pure and easily parallel libraries for computationally intense tasks like physic modelling, data compression, path-finding and the like.

This area of research is still young and very challenging. I am excited to solve these problems and I believe Haswell is a intriguing step in the right direction.Reply

Multiple writer threads is sequential only for the worst case, where the writer thread takes as much tor more time than the reader thread(s). Generally, the writer thread will generate data much faster than the reader thread can process it. Consider a writer thread that is reading images from a camera and pushing them into a shared FIFO buffer. Reader threads then pop one image at a time from the shared FIFO and do extensive image processing on them. Say the writer thread takes N ms to read an image and a reader takes 10N to process it. A single-thread process will take 11N ms per image. A multithread process SHOULD scale nearly linearly up to 11 threads and the per image procesing time is reduced to something much closer to N.

That sort of linear scaling is not achievable due to the locking overhead when using a mutex. Using a lock-free approach has many benefits. It guarantees that at least one thread will proceed with work at all times and eliminates deadlock and livelock if done properly with a compare-and-swap instruction. It can be accomplished with the LOCK CMPXCHG instruction in user-mode, but that is an expensive instruction, at least 22 cycles on a Core i-7 and has overhead of its own in that it causes cache coherency issues and negates OOOE.

HTM could appreciably reduce the overhead in lock-free algorithms, but more importantly, it makes lock-free programming far easier and less prone to programmer error. I agree that it still will not scale linearly with many cores, but it is better than what we have now. Ultimately, I think many-core designs will evolve to something like the Intel SSC, rather than SMP chips, and then functional programming will make more sense.Reply

I think it's really interesting (at the same time disappointing) where the whole industry is going at this point. When AMD essentially gave up competing with Intel in the high performance segment I was disappointed since that meant an end to competition on the high end. But in retrospect I don't think AMD has given up much at all since it looks like Intel has also thrown in the towel when it comes to the high performance market. I get that CPU's have become so powerful that just about any chip will satisfy the average consumer's needs but power users like me have been abandoned. Now, to get the kind of system I want I've either got to wait years for Xeon tech to catch up and end up paying exorbitant prices or settle for more mediocre options from AMD Opterons. And I can't be the only would-be Intel customer now actually considering turning to Opterons for new systems. In a way, it's as if it wasn't AMD who stopped competing but Intel who decided to abandon that segment to it's former rival. And where does that leave all the motherboard makers who are stuck developing ultra premium boards for Intel chips that are years (and years) old?Reply

I'm just hoping that ARM will catch up to the high end Intel and give them some competition. There's no reason why they or someone else shouldn't be able to do it.

ARM A15 is something like 2-3x slower than Ivybridge at single threaded integer code, and they or someone else can use the same techniques as Intel to catch up. Singled threaded perf isn't going anywhere fast barring a major revolution in device physics.

In the mid-range market you already have Power and SPARC. Xeon's are way more reliable and faster then Itanium systems for that matter. So too bad they killed off PA-RISC. But still so in the mid-range server market we got Oracle, IBM, Fujitsu and pretty much on par Intel and AMD. With the players on the market today being totally different service-oriented companies and so much more consolidated to how it was before I don't see much room. ARM aren't even competing against Atom yet. If you think servers or desktops at least. If you really like a small semi-embedded server system you can still turn to PPC and MIPS still, they still do for high-end network processors. ARM might be killing it in consumer products, for multimedia. On the other hand it replaces a plethora of custom VLIW, MIPS, x86, SuperH, Blackfin, ST20, different DSP chips and so on above the basic microcontrollers that is. Not high-performance markets even if they try to get there, it can't compare to 64-bit systems or systems with more then 4GB ram yet. Just because you can create a high-performance 16 core processor or something doesn't mean it will become a workstation CPU, network processors are way more powerful then a Intel quad-core chip yet we won't see MIPS workstations again any time soon and hasn't in more then 10 years.Reply

Memory interfaces and floating point are less competitive because of the market segment they are currently in (tablets/phones don't care enough to pay die area/power for them), not architecture reasons.

> Just because you can create a high-performance 16 core processor or something doesn't mean it will become a workstation CPU

Agreed, but that's not what I'm suggesting. I'm suggesting that ARM is on a path to rival Intel/AMD's single threaded perf in the next 3-5 years, and if it isn't ARM it will be someone else.Reply

What are you doing that you think could be significantly improved with more CPU power? Frankly, the days of 50% generation-to-generation gains are long gone. Barring a fundamental change in computer architectures, there isn't much left to do to increase IPC without a gigantic cost in die space.

Intel could certainly be more aggressive with their clocks on retail products, but then there wouldn't be anything for enthusiasts to do ;-).

Games are more GPU dependent and largely tied to console hardware cycles anyway.

Video encoding is better served by GPUs once all the software is updated.Reply

3D stacking is the wave of the future, but it won't help single threaded perf any time soon. Heat density degrades the performance of stacked high performance logic layers and vias are still huge (high capacitance) compared to metal layers.

The best we can hope for in the near term from 3D for single threaded codes is another level of cache.Reply

In spesific synthetic test definitely yes, even more. In normal applications... Harder to say... This will be same manufacturing node as ivy, so not much from there, but effiency upgrade seems to be good and that is the most important thing this time. The cpu is fast enough at this moment when compared to competition.Reply

Thats the problem. I'm disappointed with CPU performance and clock speeds, and its literally all because of AMD. If they actually made CPU's that were any good and could compete with Intel on performance, or just get close, then we would be seeing much faster stuff by now. Even with just clock speeds alone, its so obviously that Intel could be releasing WAY higher clocked stuff, as proven by how well their CPU's overclock. I've got my i7 running 1.4GHz faster than default speed without much effort.

When there was real competition from AMD (Athlon 64 days) then you would be lucky to overclock a high-end desktop Intel or AMD CPU by just 300MHz because the chips were being pushed to their limits already.

We could be seeing much higher clocked and lower priced CPU's. But obviously Intel have no reason to do this because of AMD's utter failure to compete on performance. This is exactly why competition is always needed.Reply

I've been waiting for some Haswell news and this sounds great :) I'm sure many people are disappointed that HW isn't going to bring that 20%+ performance bump but, frankly, people have way more CPU than they need. Recompiling for modern ISAs and software optimization potentially brings far more gains than any IPC and clock speed gains Intel can manage to squeeze out. The fact that this isn't happening is another matter, though.

I'm interested to see what the performance of the new on-die GPU will bring. The current Ultrabooks are just abysmal with only a few standouts and that mainly due to the other goodies -- the Zenbook with it's great IPS panel and the Samsung Series 9s with their great panel as well. I'm hoping HW is able to inject some more sense into the platform with lower GPU throttling and better GPU performance. Maturation of the 22nm node and some price decreases wouldn't hurt either ;)Reply

On the one hand it's a bit disappointing that Haswell won't be worth an upgrade for CPU speed addicts.

But let's face the current trend:People want ultraportable devices with long battery life, which ARM rules at the moment. Smartphones and tablets are a huge market, probably much larger than the PC business. So Intel has to compete in this low power sector to stay alive.

People want high resolution displays, but therefore you need a powerful GPU, which Intel doesn't has. So Intel must improve GPU performance in order to drive high resolution displays with the IGP.

More and more programs switch to GPU for parallel processing tasks to speed up the task by magnitudes, impossible with traditional CPU improvements. AMD understood this with their APUs, Intel must follow.

An increase in clock frequency causes higher power consumption. Multiple cores, which can get disabled individually however are faster in the sum and consume less power in idle. So better integrate many smaller cores than few large ones, if power consumption is important, which it is.

Thus Intel will follow AMD most probably and stop trying to release faster and faster CPUs and focus on parallel computing and GPU acceleration.ARM on the other hand does everything at the same time. They have the big.LITTLE tech to save power, A15 to improve per core performance, develop 64bit cores and constantly scale up their GPU performance, but ARM has the advantage that many companies further develop ARM processors.ARM products get sold in the currently most active market, Intel has no real competition there. On the other hand people won't upgrade their old computers because Haswell doesn't offer any visible advantage except reduced power consumption.If Win 8 on Intel tablets isn't a success, Intel will face a hard time.Reply

I doubt it. Intel ships tons of processors in new systems that people buy all the time who are upgrading from much older systems. There have been rare times that Intel has released something that is such an upgrade that whole scads of enthusiats abandoned their last generation processor and adopted the new one.

Gains over most generations have been in the 10-20% range in the last few releases. In large part it has less to do with no competition with AMD, though that helps, and more focusing on things like power consumption and iGPU.

Yeah, I want better CPU performance as well, as do most people, but CPU performance IS good enough for 90% of consumers out there, but iGPU performance is NOT good enough for 90% of consumers. Especially in the mobile sector if you want even higher resolution displays. As mentioned earlier, combine that with some of the things you can do with GPGPU computing along side of the CPU and the way to go, for now, is the concentrate most of your effort on lowering package power consumption as well as improving the GPU. If you can improve CPU performance along the way, then great, but it isn't critical.

Besides, lower power consumption, increased GPU performance and increased CPU performance sounds good to me in the mobile space BIG time. For a desktop system lower power consumption and increased CPU performance is nice (discrete GPU there just about always). I can always use quieter and cooler running desktop machines. Alas my life where I must wait 2 or 3 years between CPU generations to get "worthwhile" upgrade, which could still mean a 25-40% increase in performance over 2-3 generations as well as probably lower power consumption.

My games don't run slow because my 3570 running at 4Ghz can't cope. PS is just snappy as can be. My transcodes are pretty fast too. Oh, I'd love them to be faster, but going from 30s to export a dozen RAW files to JPEGs to 15s doesn't change my life significantly. Going from a 2hr BR->1080p transcode to 1hr is nice, but doesn't change my life that dramatically (not until we are getting in to take a few minutes to reencode a video file, then it might change my life).

I went from a C2D E7500~3.1Ghz to a 3570@4Ghz (4.2 turbo). Transcode performance improved 5x on average! Ballpark performance improvements in photo editing and some other stuff too that is heavily multithreaded. The ultrabook with a 3517u in it manages about 1.4-1.8x faster performance than my old C2D desktop does as well! I am pretty happy overall with CPU performance. Sure, I want improvements, but they stopped being game changers for me.

Now once 14nm is reached, lets start talking maybe hexa or octocore for mainstream desktops and laptops please. That could be a bit of a game changer too in heavily threaded stuff compared to current day improvementsReply