They've been lacking some QA since the X58 chipset and Bloomfield/Lynnfield CPU's , their last really solid products. People who bought into those platforms 5-6 years ago still have competitive systems TODAY. Sure, they lack SATA 6Gbps and USB 3.0, but like the SSD 320 (also from that era) they are virtually flawless.

Since the 6-series chipset and introduction of Sandy Bridge, there has been an unprecedented surge in errata. Most of it is irrelevant to the general consumer (even P67 wasn't an issue unless using ports other than 0 and 1) but it shows the lapse in quality control at Intel. They're better than this!Reply

got a 4 year old i7-980x x58 system and it still kicks major ass. 4.2ghz oc on all 6 cores 24gb triple channel ram. The only reason i want to upgrade to x99 is the lack of sata III and usb 3 both of those features would greatly help me. I could easily wait till skylake-e and x99's successor (x119????) and not feel bad at all tho.Reply

If it took this long to catch the issue, it sounds like it was not an easy bug to find. You make it sounds like it is easy to be the first one to implement some big new thing. The fact that there has only been one issue so far sounds pretty impressive to me.Reply

This is interesting. I bought a base model Fall 2013 rMBP 15" with the 2.0 GHz i7-4750HQ and the processor doesn't have TSX enabled. I was kinda mad about this, I wanted this in hopes that some software may implement it. But I suppose now it's a non-issue.Reply

This is reminiscent of the infamous Pentium FDIV bug in a sense that it was made available to the public by a non-Intel person much later than the relevant products were released to the market.20 years have passed since FDIV bug, but Intel still drops the ball from time to time - they still can't do their testing right and in advance...Reply

Yes it does. Every chip company takes risks because there is only so much validation you can make before your competitor beats you to the punch. It even mentions in the article that there have been a bunch of them and their competitors have a bunch too.Reply

Fortunately for Intel, most SNB-EP Xeons ended up being C2 stepping, since the bug was discovered while SNB-EP was still in C0/C1 qualification stage (SNB-E was not that lucky, since they were launched several months earlier).

This time, it does not seem that the staggered launch saved EP. EX, on the other hand, yes.

But in any case, the strategy Intel is employing is smart. There is a year between the first consumer and first EP SKUs, and two years between consumer and EX parts. During that time they do manage to kill lots of issues, which is especially important for EX line which is tailored for mission-critical operations.Reply

Well, considering that Haswell EP is in mass production already, and final qualification-stage samples were out for months, the timing is bad.

I doubt Intel is going to stop imminent release of HSW-EP, which is in some weeks anyway.

SNB-EP situation was lucky, because VT-d affected C0/C1 steppings only which was only mass produced for the HEDT (consumer) part, while most of the production-grade Xeons were of C2 stepping.

This time, the situation is different - if this bug affects the latest stepping of HSW-EP, first Xeons that will be sold in public are also going to be affected.

Not good, since TSX is exactly kind of a feature you'd see in heavy use in the market for EP SKUs.

The only good for some people might be that the ebay market might soon be loaded with diverted HSW-EP engineering samples since all OEMs will probably rapidly move to qualification of the next stepping (with TSX fixes).Reply

Well, I'm sure you can do it better, so please apply right away to Intel. </s>

All snark aside, c'mon, you have humans coming up with tests to try to confirm compatibility, and none of them are perfect. So if you think they can catch every esoteric situation, then I guess you'll have to adjust your viewpoint.

Well, in the past I happened to work in testing (software testing, however) for 5 years full time and part time, and found more than 1500 bugs during that time myself personally, so I know what testing is, and that's why I'm so critical regarding bugs. OK, hardware testing is not the same thing, but nevertheless...Reply

You make it seem like they don't know what they're doing, but your experience with software testing is much different. I have experience with hardware verification, and I can assure you that Intel does quite a bit of it. Its worth it for them to prevent these types of bugs. As you can imagine a widespread issue like this could end up costing billions. Luckily its just TSX, which probably doesn't affect too many people, and its possible that less verification effort was devoted to that piece and they launched it anyway and planned a fix for the next stepping. It sucks, but its not a showstopper.Reply

The more interesting question is how many bugs did you miss which were found downstream from you. If you found 1500 bugs, assuming that you were 90% effective, which is an unreasonably high percentage, then there were 1666 bugs originally, and you missed 166 of them. Per Capers Jones, the average test efficiency is 30%, while 70% of the bugs are missed.

Most hardware has some errata. I was part of a team which was fighting a bug for a couple of months, which turned out to be caused an errata (not Intel in this case). While the errata was known, the applicability to our software was not immediately apparent. The problem occurred when there was a new batch of hardware, and the problem was originally thought to be caused by the new batch. The real reason was that enough hardware had been produced that the high order bit in one byte of the (sequentially assigned) MAC address was set, and the way the MAC address was used exercised the errata. It's incredibly expensive to fix an IC after production, and even more so when the IC has been installed on a circuit board, and the worst case is when the circuit board has be shipped to many locations and is in use. Because of this, errata is usually addressed by a workaround, not by replacing the part, as long as most of the chip works.Reply

I had a similar reaction when I first read the comment.The tools for functional verification have improved over the past 10 years (emergence of OVM/UVM), so we'd expect to see a decrease in the occurrence of this type of issue, but the fact is that reaching 100% coverage is difficult given limited time and compute resources, and highly dependent upon writing good testbenches. Its not an easy thing to do.Reply

I agree with nbtech. I would just like to add that debugging and fixing a bug found in silicon is reeeally hard. Narrowing down the sequence of events to make the failure repeatable is an art. Remember, a 3GHz CPU is launching instructions roughly at the rate of 3 billion per second (not even counting multi-core and multiple issue). Software-based simulators and even hardware-based emulators run orders of magnitude slower-- if you can't cause the failure in a couple of seconds, you have to debug on the silicon itself, which has limited visibility of the internal state.Reply

It is highly likely that Haswell EP 2S will get a new stepping (C2 for high core versions).I suppose 4S models will ship with fixed stepping.

The question is, will Intel also update the 1S Haswells which are pretty much identical to the desktop versions with enabled support for ECC.

Destkop/Mobile Haswells, I do not think they'd bother, but if they update 1S server SKUs, I see no reason for Intel not to silently roll out updated steppings for desktop SKUs as well, since it is the same silicon as 1S server SKUs.Reply

TSX is essentially a convenience feature, to allow lock free code without as much work from a developer. The same thing can be accomplished by rewriting the code using compare and set instructions instead of blocking locks. So much like many new features, TSX doesn't enable any magic that couldn't be accomplished before, it just saves time on the development end.

From that perspective, I'm betting nobody will scream about it being missing, especially since the slightest inclination that it's unreliable would keep people from using it anyway.

When it's done and reliable, then release it. In the meantime, possibly allow an "at your own risk" feature toggle for people that live on the bleeding edge.

That's not an equal analogy, because hardware AES is a huge performance gain, and software AES isn't difficult (just use a peer reviewed open source library at the 128 byte level and wrap it in some multiplexing code). There is no "hard" way to do AES in software that's as fast as the hardware instructions, AFAIK.

TSX is also a huge performance gain over the "easy" method of using blocking locks, but likely has comparable performance to the "hard" way of compare and set. So in that sense, it's less of a real gain, assuming of course that you're not the person paying the developers.Reply

It is not a convinience feature. In fact, it is >easier< to write the code without it.

Have a potentially contended data (between threads)? Just lock the sucker, that's the easiest way.

But that is not the most efficient way. Basically, what TSX does is, it relies on CPU smarts to speed-up multi-threaded code >without< having to resort to even more complex (and error-prone) lock-free programming. In that way, one can call TSX a "convenience', but in reality it does require additional work, just not that much additional work.

TSX is roughly comparable to, say, SSE instructions (but it is not nearly as useful in terms of potential applications). In order to use SSE with some decent speedup, you have to do a bit more efforts in your code, so it is not really a "convenience", as it require developer to do more work, in order to achieve faster code execution.Reply

The issue of how to do multi-processor locks is complex. Until a decade ago, you used a semaphore or mutex, and took the timing hit of marking the lock as uncacheable. Hundreds of clocks even if you weren't modifying the lock. (Usually you would use a RMW read-modify-write instruction to put the identity of your thread in the lock and check if the lock was not reserved by comparing the returned value to zero, or -1 or whatever.) When you release the lock, you first have to check if it still contains the id that you put in, then either relase the lock or turn control over to a waiting thead. (I'm simplifying a lot, so don't shoot me.) Anyway, three main memory reads, or RMW cycles in the best case.

Then along came Opteron. Opterons, with all IO passing through a CPU chip, and cache-coherency connections to all other CPU chips, meant that requests for uncacheable memory could be ignored. The locks worked as before, but now you could have a fast (possibly 3 CPU clock latency) read or RMW cycle. I never measured 100x performance improvements unless thrashing was involved, but that tells the real story. It was no longer about how slow uncached memory was, but about how much more work your database or other application could do before it hit thrashing. Many times I could wind the CPUs up to 100% load for minutes at a time without starting to thrash, which was a very good thing. (Some other CISC and RISC CPUs also support/supported IOMMUs. Most that didn't are now dead.)

When Intel added IOMMU support to their x86/x64 CPUs they didn't duplicate what AMD had done. (And now works on all AMD CPU including single socket desktop chips, and ARM64 chips.) This is because AMD uses a MOESI protocol and Intel uses MESIF. Again way too much detail for here, but it means that in certain locking cases, you get a (relatively) slow ping-pong effect when two threads are sequentially accessing a lock. (Think producer/consumer.) By treating the transaction as speculative, and never touching the lock if there is no conflict, overall transactions are sped up.

AMD proposed a similar instruction set extension (ASF) in 2009, but AFAIK the best lcoking code was not significantly improved (on AMD CPUs) and the proposal has languished. Will this bug kill TSX? Probably not. There is an awful lot of ancient history embedded in the x86 ISA, this will just be a bit more. But I expect the effect on potential users to be the same as AMD's ASF leading programmers to better lock implementations.Reply

No, the reason Russia is designing a new chip is because they want to protect their market (protect, as in economic protection, not security).

Russia is not designing a new CPU architecture anyway, they will use ARM architecture. Now, if you think somebody can spot a deliberate security flaw in a CPU design consisting of hundreds of millions of blocks, yeah... good luck with that. The only way to be 100% sure is to design it by yourself from scratch, and even that does not guarantee it won't have flaws that can be silently exploited.

In any case, Intel's Microcode has nothing whatsoever to do with this. You can simply prevent any theoretical possibility that somebody can use Intel AMT against you by simply not giving it access to public Internet.

Not to mention that Microcode updates are not persistent, and have to be applied after every power-on. If you do not allow BIOS upgrades and control the OS and do not allow public Internet access to the system firmware, there is simply no way somebody can exploit your CPU remotely.Reply

Linux comes with CPU microcode, so when Intel updates the microcode in Linux, and the new Linux kernel is installed, the new microcode is there. I'm expect Windows does the same thing. If you never patch your operating system or BIOS, the TSX is likely to stay enabled.Reply

Got to suck for the ones that bought a 4770 instead of K specifically because of this feature. But then how cares? It wasn't available in most Intel CPUs anyway so almost no consumer software will use it for the next decade. And for servers they get replaced in shorter cycles anyway.Reply

I doubt Intel would initiate a recall just because the problem is found. In this case, it would be extremely expensive since Intel would need to replace one year of production and, now, including early batches of Xeon EPs as well.

However, if the legal pressure mounts (lawsuits filled, etc.), they might do it. But I am sure Intel would try to fight this, or limit the exposure only to certain SKUs for which it can be demonstrated that TSX is in use.

In any case, unlike FDIV bug which, basically, ruined calculation results and could affect pretty much anybody who was using Pentium CPU, this bug is less critical since it requires running software that uses TSX (not very common yet, at least not on the desktop/mobile where the biggest volume of Haswells were sold so far) and very specific conditions which are, presumably, hard to reproduce.Reply

Damn!I think we have an answer as to why Broadwell desktops and laptops are delayed...

I'm guessing they believe (probably correctly) that they can ship Broadwell Y without TSX and no-one will much care. Still not clear why the gap between the laptop and quad-core chips ship dates --- maybe other reasons, or maybe they have reason to believe that the problem is probably more easily fixed on dual-core chips.

Personally I'd score this (if this explanation is correct) as +1 for my earlier explanation for the delay of Broadwell --- a consequence of the insane complexity of x86 becoming unsustainable and causing Intel real harm. Intel's only official comment is‘a complex set of internal timing conditions and system events … may result in unpredictable system behavior’ and, yes, that COULD be a problem on any CPU --- but it's a whole lot more likely to occur (IMHO) on x86.

It also adds fuel to my argument that Apple is probably losing patience with Intel. As I've said, the Broadwell delays have screwed up a year of their product plans; if TSX on Haswell is broken, that also delays by at least a year the plans I believe they have to introduce an innovative set of parallel programming constructs into Swift which require HW TM.Reply

OK, so on reading further, I see that(a) this likely does not affect Apple because (near as I can understand Intel's maze of feature differentiation details) the relevant parts do not ship in any Apple products. So much for that theory. (On the other hand, well done Intel --- clearly the way to get developers to support a feature you expect to charge for/be a differentiator in future is to limit it to a tiny fraction of your chips...)

(b) in turn this suggests that my theory for this delaying Broadwell is nonsense. Unless Broadwell WAS supposed to have TSX across the laptop and desktop and Intel are still hoping they can get there by delaying a few months, with a plan B of, if necessary, simply launching without the feature?Reply

I've a system beside my bed for development, a laptop for convenience, and a new build for implementation, w/ the fun little Anniversary Edition serving as a 'place holder' for the Devil's Canyon that I no longer plan to buy.

SoOo ... to the pages of folks that consider the loss of TSX to be no big deal to consumers, or having no potential affect upon beyond servers/workstations?

It is a big deal, and effects us all.

We're talking increases to transactional throughput of not less than three times, and in excess of fives times, ultimately with little (or possibly any, at some point) effort on the developer's part.

To see this trivialized in reports/forums frustrates nearly as much as Intel's disabling of this feature, which I don't believe (even for a second) to have not been absolutely necessary:Claiming Devil's Canyon would easily/consistently overclock beyond 4.x GHz on air? Now, that *was* a marketing ploy. But, as for this scenario? I think everyone can safely remove the tin hats.

As for me? I reckon I'll try 'n cling to the hope that it may be enabled in some soon-to-be-released CPU that fits (or I've wasted even more of my limited resources/time )-;~Reply