In 2015, Microsoft senior engineer Dan Luu forecast a bountiful harvest of chip bugs in the years ahead.

"We've seen at least two serious bugs in Intel CPUs in the last quarter, and it's almost certain there are more bugs lurking," he wrote. "There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we've moved past that."

Thanks to growing chip complexity, compounded by hardware virtualization, and reduced design validation efforts, Luu argued, the incidence of hardware problems could be expected to increase.

This kind of Microsoft spam is not needed on the main page. Save that shite for personal journals. We really don't gain anything from hearing from Microsoft's marketeers unless they are paying for product placement. But I don't see that anywhere in the summary. So until then leave out the spam. If the topic is based in reality then it will have been covered already elsewhere. Cite those sources instead.

The man frequently used the object to relieve his sexual and violent urges. Since the object was used so often, it was heavily damaged and on the verge of breaking. As such, now was the time to dispose of it and choose another one to take its place. The man was excited. What would he choose next...?

Your premise is mistaken. TFA is an article on The Register. The Register article begins with a two-sentence quote form a Microsoft engineer, and a one-sentence summary of his point, but that's it. The rest is original reporting.

I'd object to this article because it's so darn elementary - yes, chips can have bugs, and Spectre/Meltdown aren't the only chip bugs out there. The article is a few quotes sprinkled with a list of a few recent flaws. But there's no interesting analysis. It's basically "here are some recent bugs." It would be awesome to have an article making a case WHY bugs might be more frequent now than in the past - other than the quote from Luu, the article offers no real support for that position. This article seems like it was written by someone who doesn't really understand the subject and has nothing really to say (some would argue that's hardly unique on El Reg) - when your best argument is a three-year old blog post from someone ELSE who might know what they're talking about, you should be asking why this article matters...

As a former Intel validation engineerAs a former Intel validation engineer(Score: 1, Interesting) by Anonymous Coward on Wednesday January 31 2018, @05:27PM
(25 children)

This is true. I did some work on x64 (Itanium) and later the Pentium D. Most of the problems in our labs were due to faulty hard disks (the spinning kind). I headed Linux validation for servers around 2000--hopefully that helped?

Re:As a former Intel validation engineerRe:As a former Intel validation engineer(Score: 0) by Anonymous Coward on Wednesday January 31 2018, @05:42PM
(6 children)

x64 is not marketing speakx64 is not marketing speak(Score: 3, Informative) by Anonymous Coward on Wednesday January 31 2018, @06:41PM
(3 children)

x86-64 was AMD's original namex86_64 was chosen by Linux people, based on the aboveIA-32e was Intel's fucked-up name for it, since IA-64 was already taken by ItaniumAMD64 was AMD's response to Intel's attempt to claim the architecture as IA-32ex64 was Microsoft's attempt to pick something simple and neutral

Re:As a former Intel validation engineerRe:As a former Intel validation engineer(Score: 2) by JoeMerchant on Wednesday January 31 2018, @06:51PM
(17 children)

Race to the bottom? I mean, can we at least have a reasonable option for a validated processor that works, and works correctly, instead of one that runs 10% faster but has bugs? Put another way, if there were 2 notebook PCs at NewEgg, identical in every way except that one had 2.4GFlops effective throughput on a typical task load - with 99.999% validated design, and another with 1.8GFlops performance on the same test, but with 99.99999% validated design - isn't there a market for the more reliable machine?

It's nowhere near that simple. They paid for a lot more expensive people (like me) for Xeon and Itanium validation than consumer stuff. Try ECC I guess? Don't overclock (and especially don't over-volt) your stuff! I ran a lab at Intel that did high temperature, high voltage stress tests on consumer (Pentium D), and we saw lots of errors. They basically died over a few months.

Re:As a former Intel validation engineerRe:As a former Intel validation engineer(Score: 3, Insightful) by JoeMerchant on Wednesday January 31 2018, @10:09PM
(2 children)

Well, on the one hand, you (and I) are "expensive," but when that cost is spread out over millions of copies it's not nearly as much, and I guess what worries me the most is the dismantlement of the validation program, because those things are a lot harder to set up than they are to keep running.

Re:As a former Intel validation engineerRe:As a former Intel validation engineer(Score: 0) by Anonymous Coward on Thursday February 01 2018, @01:27AM
(1 child)

Intel has beefed up validation after various issues--we didn't lack for money in the department. You mention spreading cost out--that's why server chips are so expensive. You have expensive people like me validating chip designs that are sold in fewer quantifies than the latest Android.

Re:As a former Intel validation engineer(Score: 2) by JoeMerchant on Thursday February 01 2018, @12:56PM

that's why server chips are so expensive. You have expensive people like me validating chip designs that are sold in fewer quantifies than the latest Android.

So, I get tiered marketing and that you need to sell some product at a higher price point, but... wouldn't it make a kind of sense to pour the heaviest validation onto the line that sells the most copies? Maybe not a marketing "juice 'em for maximal profits" kind of sense, but a "don't be dicks to the world" kind of sense?

Re:As a former Intel validation engineerRe:As a former Intel validation engineer(Score: 1) by khallow on Thursday February 01 2018, @02:35AM
(4 children)

I mean, can we at least have a reasonable option for a validated processor that works

How would validation catch the Spectre [wikipedia.org] bug? It's derived from subtle observation of memory caching and timing delays of the cache queues. Can't validate what you don't know you need to validate. Even if the CPU manufacturers fully fix this one, how will we validate all possible interactions of the internal components of the CPU?

Re:As a former Intel validation engineer(Score: 2) by JoeMerchant on Thursday February 01 2018, @04:03AM

In our industry we have a fancy acronym that means: get a bunch of people who know something about the issues, force them to sit in a room and seriously consider them at least long enough to write a report and file it. Lately, there's a lot of handwringing around cybersecurity, and I'm constantly pinged by the junior guys who get worried about X, Y, or Z - and 9 times out of 10 it's nothing, but once in a while they bring up a good point, and some of those good points are things like Spectre - things nobody had considered before. Our development process on a single product goes on for a couple of years, the process calls for these cybersecurity design reviews periodically throughout those years, and over that time people do actually come up with this stuff. So, our reports analyze X, Y, and Z, and either write them off as adequately handled, or shut down the project until they are.

The real problem is culture - like the Shuttle launch culture that couldn't be stopped for handwringing over ice in the O-rings, or a big corporate culture that doesn't want to pay its own engineers to discover vulnerabilities in the product early enough to fix them before the rest of the world.

I just gave a mini-speech today that included: "it needs to be tested, if we don't test it our customers will."

Can't validate what you don't know you need to validate.

No, you can't - but, as world leading experts in the field you should be able to figure out most of the things you need to validate before the world figures them out for you. In the case of processors that serve separate users partitioned by hypervisor, the industry could have (and likely did) think of this exploit before the hacker community. As soon as they thought of it, they should have (and likely did not) feed that knowledge back into the design process to work out effective fixes for the next generation of processors.

We show that it is possible to construct hardware-software systems whose implementations are verifiably free from all illegal information flows. This work is motivated by high assurance systems such as aircraft, automobiles, banks, and medical devices where secrets should never leak to unclassified outputs or untrusted programs should never affect critical information. Such systems are so complex that, prior to this work, formal statements about the absence of covert and timing channels could only be made about simplified models of a given system instead of the final system implementation.

That's just one IEEE paper - if you look at the home-page of one of the authors (Wei Hu [ucsd.edu]), you can see many other papers in pdf format, including the full text of the above IEEE reference [ucsd.edu]. There are plenty of references to earlier work listed in that paper.

Note that hardware can be messed with below the gate-level. Nonetheless, techniques for validating processors have been around for decades, they have 'simply' not been used in the general commercial market as they have been regarded as too time-consuming, expensive, or resource hungry. Military and aerospace markets have had different priorities. High Assurance, as a discipline, has been around for a very long time.

Nonetheless, techniques for validating processors have been around for decades, they have 'simply' not been used in the general commercial market as they have been regarded as too time-consuming, expensive, or resource hungry.

This. The key one is the sheer impracticality of it as a likely NP complete problem, but there are other issues as well.

Note that hardware can be messed with below the gate-level.

Hardware can also be messed with above the gate-level. Gates are merely an approximation.

Finally, an important way to simplify and make more efficient a CPU is to share various sorts of resources. But such sharing increases the number and complexity of interactions between components of the CPU.

This is not impossible, but I think the value of validation is being overplayed in this thread.

Thanks for the reply. I heartily recommend the first reference I gave. Give it a read - it is not overly technical.

You are likely right that the general problem is probably NP-complete: or at least difficult, if you assume things like unbounded memory and unbounded state-tables. However, if you place bounds on such things, the problem becomes tractable.

I put 'simply' in scare quotes because cost is a driver to the bottom as far as commercial business systems are concerned. If a business can make a short-term gain by ignoring security requirements, it will. You can keep the plates spinning for a while...

It is not impossible to produce formally-proven systems, merely difficult, and you have to be discerning about your axioms. As long as people choose cheapness over correctness, we will continue to have problems like Meltdown, Spectre, and multifarious side-channel attacks. It probably doesn't matter for most business systems, but aerospace will continue to provide a proving ground for such things, hopefully followed by medical applications (do you want your pacemaker to be hackable?). I hope that at some point in the future, the benefit of formally-proven systems will outweigh the cost-increment over the slapdash approach currently used. I don't think that time will come soon, unfortunately.

Re:As a former Intel validation engineerRe:As a former Intel validation engineer(Score: 2) by Wootery on Thursday February 01 2018, @10:11AM
(6 children)

How reliable do you want? Server hardware is pretty good, no? If you want near-perfection, there are CPUs out there rated for safety critical systems, but it'll likely cost you 50x the price, and the performance won't be anywhere close to that of a modern Intel CPU.

Fun fact: the RAD750 [wikipedia.org] radiation-resistant PowerPC chip clocked at 200MHz, from 2002. Its unit cost: around $200,000, back then when that was real money.

If that same $V effort were applied to the high volume product line (Nhuge) $V/Nhuge might = 0.05x the price of the chips, or less. More importantly, it would also slow delivery of product by x months on average, which is a perceived competitive cost...

I say perceived cost because, often I will buy a generation, or sometimes two, back from the bleeding edge just because they are the devils whose faces I know - Skylake was a clusterfuck, and only now am I starting to feel confident that we can deal with all of its quirks in a product. The performance gains of the next couple of generations are nice, but truly un-necessary for any application I have. Bugs, driver glitches, field patches - lack of those all matter much more to me.

Re:As a former Intel validation engineer(Score: 2) by JoeMerchant on Thursday February 01 2018, @09:21PM

Not talking about Facebook itself profiting, talking about the mass market electronics consumers of the world (Facebook users, among others) and their "collective wisdom" with respect to reliability, security, etc. For every Facebook server machine, there are hundreds of users who access it via multiple consumer gadgets each - that's the market that needs a nanny.

Re:As a former Intel validation engineer(Score: 2) by JoeMerchant on Thursday February 01 2018, @01:08PM

Nothing, of course, except that it's orders of magnitude better than 99.999%. When you're talking about catching the next Spectre before it's exploited in the wild, there are no metrics that mean anything, but effort invested in looking for the problems does pay off in proportion to the amount of effort invested.

It seems to be mostly an intel issue at this point. I really never had any opinion on cpus/gpus either way but seeing the recent PR attempt to muddy the waters has turned me anti-intel. Maybe people who matter to them aren't thinking the same, but it seems like a dangerous strategy. They are clearly not to be trusted.

First of all, Spectre and Meltdown are different. You can read details here [meltdownattack.com]

Spectre is a flaw where "speculative execution" can leak information (this is where a processor executes a branch of code that MIGHT be needed, but only in theory stores the result if it matters). The problem with speculative execution is that it's not checked whether a given command SHOULD be executed (for example, if the program has the right access level to execute the code). However, this security issue wasn't seen as a problem, because (in theory) the result of the speculatively executed code would be thrown away if it couldn't be used. So, it might be a mechanism to let untrusted code access core kernel memory (which is Very Bad), but it was thought to be acceptable because nobody could see the result. The problem is that CPU caching could "leak" those results and be visible to other code.

Spectre affects pretty much ALL manufacturers chips - the official paper [spectreattack.com] explicitly references Intel, AMD, and ARM architectures as being affected.

Meltdown is different - it's a "sideband" attack on kernel memory that relies on using the side effects of certain legal, carefully crafted code and information about the location and layout of memory to "leak" information, including kernel memory. Meltdown does not require the use of speculative execution to leak memory.

The proof of concept attack for Meltdown detailed officially [meltdownattack.com] only works against Intel hardware, but the paper specifically cautions that there's no reason to expect that AMD wouldn't be suseptible to a similar attack.

All people really care about is meltdown since patching for spectre seems to have minimal impact on performance. It is to the point where meltdown mitigations are being needlessly enabled for amd processors just to not make intel look so bad[1]. AMD says:

No, Meltdown is not applicable to AMD processors. AMD has already stated they do bounds checking when userland asks to read kernel memory to prevent this sort of thing. Something Intel inexplicably didn't think of or totally screwed up.

Also, there is a "near zero" chance that Spectre variant 2 can be exploited on AMD processors. It sounds like both AMD and Intel are equally impacted regarding variant 1. Spectre is far more difficult to take advantage of in general.

Depends.Meltdown, the currently known dangerous one, is definitely Intel and possibly a few other Intel designed chips.Spectre, the one that is *relative* harmless, so far, if present in both Intel and Amd...except, a few really low end models.

Meltdown has currently known exploits that can work through the browser if you allow Javascript. It also has several other exploit modes.Spectre doesn't *yet* have any known useful exploits. But it almost certain will.

P.S.: I'm not an expert here, there are several classes of Spectre, and I can't distinguish between them. If you're interested there's lots of info on the web, but unless you're working in the field distinguishing between them doesn't seem useful to me.

The reason to distinguish between them for the average person is the performance impact of the mitigation. Everyone expects a constant stream of bugs/vulns these days anyway, but not that patching for them will slow everything down to half speed or whatever. That is where intel has the main problem (according to what I've read).

Back when I was keeping track, and when both released Errata (functionally, the list of known bugs), AMD's errata list was generally about 3 times as long as Intel's. AMD dealt with this by not releasing any more errata lists.

The 12nm Zen+ is coming out this year, 7nm Zen 2 coming out next year presumably. Some were predicting that Spectre would be lingering in upcoming chip generations since it can just be addressed with a patch, but that's mostly not the case.

Intel is taking a lot of heat lately, but all the first-run Ryzen processors from AMD have a bug that causes random segfaults, especially when compiling under linux (a not uncommon occurrence if one likes to linux).

Here is an actual tech support letter I received from AMD. Some identifying information has been changed or obscured, otherwise it's 100% as I received it.

Your service request : SR #{ticketno:[######6680]} has been reviewed and updated.

Response and Service Request History:

Thank you for your email and background information about your issue. I’m sorry to hear that you’re experiencing stability issues with your system. Please be assured that I am here to help find a resolution to your problem

At this time, I would like focus on your system’s hardware configuration. I need to collect some more information about your system which can help with our troubleshooting.

Please provide the details of the following hardware components in your system:

Make and model of motherboard

Motherboard BIOS version

Make and model of RAM

Make and model of the power supply unit

Please could you let me know the current settings you have for the CPU VCORE, SOC, and RAM? It would be very helpful if you could provide with pictures of your BIOS screens with these settings.

In addition, through troubleshooting with other customers we have found that the layout of the components inside the system case have caused sub-optimal cooling of the CPU causing a variety of issues.

I would like to better understand your system cooling to rule out any thermal issues. Please could you provide a picture of the whole interior of your system showing the CPU cooler?

Also, could you let me know the reported CPU temperature during heavy load or when the errors occur?

In order to update this service request, please respond, leaving the service request reference intact.

Best regards,

Asok

AMD Global Customer Care

That's right, their answer was basically "pics or it didn't happen." I am working to comply with their request. Also, they sent me this before linux 4.15 was released, wanting to know what temperature was reported--and 4.15 is the first kernel version to feature Ryzen CPU temperature reporting.

The way I read their answer is: "we need accurate information to be able to figure out what the problem is". Which makes perfect sense. Many people are highly inaccurate when giving descriptions of things that went wrong (which is perfectly normal and nothing to blame them for), and in complex systems that can easily mean the problem solver keeps looking in the wrong places and won't come near pinpointing and solving the problem. You need to help the problem solver to help you by being accurate, and this problem solver is helping you to be accurate by asking for pictures. Just work together for the best result.