Share This article

Researchers working at Microsoft have analyzed the crash data sent back to Redmond from over a million PCs. You might think that research data on PC component failure rates would be abundant given how long these devices have been in-market and the sophisticated data analytics applied to the server market — but you’d be wrong. According to the authors, this study is one of the first to focus on consumer systems rather than datacenter deployments.

What they found is fascinating. The full study is well worth a read; we’re going to focus on the high points and central findings. There are two limitations to the data collected that we need to acknowledge. First, the data set we’re about to discuss is limited to hardware failures that actually led to a system crash. Failures that don’t lead to crashes are not cataloged. Second, the data presented here is limited to hardware crashes, with no information on the relative frequency of software to hardware crashes.

CPU overclocking, underclocking, and reliability

When it comes to baseline CPU reliability, the team found that the chance of a CPU crashing within 5 days of Total Accumulated CPU Time (TACT) over an eight month period was relatively low, at 1:330. Machines with a TACT of 30 days over the same 8 months of real-time have a higher failure rate, of 1:190. Once a hardware fault has appeared once, however, its 100x more likely to happen again, with 97% of machines crashing from the same cause within a month.

Overclocking, underclocking, and the machine’s manufacturer all play a significant role in how likely a CPU crash is. Microsoft collected data on the behavior of CPUs built by Vendor A and Vendor B (no, they don’t identify which is which). Here’s the comparison chart, where Pr[1st] is the chance of the first crash, Pr[2nd|1] the chance of a second subsequent crash, Pr[3rd|2] the chance of a third failure. In this case, overclocking is defined as running the CPU more than 5% above stock.

Are Intel chips just as good as AMD chips? At stock speeds, the answer is yes. Once you start overclocking, however, the two separate. CPU Vendor A’s chips are more than 20x more likely to crash at OC speeds than at stock, compared to CPU Vendor B’s processors, which are still 8x more likely to crash. The report notes that “After a failure occurs, all machines, irrespective of CPU vendor or overclocking, are significantly more likely to crash from additional machine check exceptions.” The team doesn’t break out overclocking failures by percentage above , but their methodology does prevent Turbo Boost/Turbo Mode from skewing results. Does overclocking hurt CPU reliability? Obviously, yes.

So what about underclocking? Turns out, that has a significant impact on CPU failures as well.

As you can see, underclocking the CPU has a significant impact on failure rates. The impact on DRAM might seem puzzling — the researchers only reference CPU speed as a determinant of underclocking, rather than any changes to DRAM clock rate. Our guess is that the sizable impact on DRAM is caused by a slower CPU alone rather than any hand-tuning of RAM clock, RAM latency, or integrated memory controller (IMC) speed. IMC behavior varies depending on CPU manufacturer and product generation in any case, while the size of the study guarantees that a sizable number of Intel Core 2 Duo chips without IMCs would still been part of the sample data.

Laptops vs. desktops, OEM vs. white box

Ask enthusiasts what they think about systems built by Dell, HP, or any other big brand manufacturer, and you aren’t likely to hear much good. Actual data proves that major vendors actually have fewer problems than the systems built by everyone else. The researchers identified the Top 20 computer OEMs as “brand names” and removed overclocked machines from the analysis of the data. Only failure rates within the first 30 days of TACT were considered among machines with at least 30 days of TACT. This is critical because brand name boxes have an average of 9% more TACT than white box systems, which implies that the computers are used longer before being replaced.

White box systems don’t come off looking very good in these comparisons. CPUs are significantly more likely to fail, as is RAM. Disk reliability remains unchanged.

How about laptops? The researchers admitted that they expected desktops to prove more reliable than laptops due to the rougher handling of mobile devices and the higher temperatures such systems must endure. What they found suggests that laptop hardware is actually more reliable than desktop equipment, despite the greater likelihood that mobile systems will be dropped, sat on, or eaten by a bear. Again, overclocked systems were omitted from the comparison.

Desktops don’t come off looking very good here despite their sedentary nature. The team theorizes that the higher tolerances engineered into the CPU and DRAM, combined with better shock-absorbing capabilities in mobile hard drives may be responsible for the lower failure rate. The difference between SSDs and HDDs was not documented.

More data needed

The limitations of the study are such that we can’t draw absolute conclusions from this data, but they suggest a need for better analysis tools and indicate that adopting certain technologies, like ECC, would help improve desktop reliability. It’s one thing to say that overclocking hurts CPU longevity; something else to see that difference spelled out in data. The impact of underclocking was also quite surprising, this is the first study we’re aware of to demonstrate that running your CPU at a lower speed reduces the chance of a hardware error compared to stock.

The Microsoft team conducted the research as one step towards the goal of building operating systems and machines that are more tolerant of hardware faults. The fact that systems which throw these types of errors are far more likely to continue doing so strikes at the idea that such problems are random occurrences, as does much of the reliability information concerning DRAM.

The report throws doubt on a good deal of “conventional” wisdom and implies reliability is rather sorely lacking. More data is needed to determine why that is, and to correct the problem.

Tagged In

Post a Comment

Bryan_S

Too little information to make a conclusion… Shatter enthusiasts myth…. I highly doubt those million were enthusiast…

John Pombrio

Interesting read. Realize that the data came from errors reported in 2008. However, I would not take as strong opinion as the authors that overclocking has that much effect on the crash rate. The paper states that only 2% of the machines that reported failures were overclocked at more than 5%. In trying to set my highest o/c, I could have several failures in a row until I found a reliable speed. This alone would create a huge “bubble” of failures that easily skew the data failure rate for o/c. Another factor to consider is that higher clock speeds alone account for more errors, so o/c at higher speeds would show more errors overall. Finally, there are so MANY factors that the study cannot measure, such as memory o/c, actual hardware failures vs a temporary hardware fault in the kernel, and software failures that mimic a kernel fault.
It’s a start but hopefully folks will not read into it (like this article did) that o/c is solely responsible for many hardware crashes.

VirtualMark

You made some good points. I overclocked mine to 5ghz, but had to have the voltage too high for my liking. So i backed the speed to 4.5, its rock solid now. Did a 24hr stress test, no problems. So i don’t think overclocking is unstable if you do it properly.

Joel Hruska

John,

2% of a million is still 20,000 machines. That’s quite a few. Furthermore, a report is only sent to Microsoft if you choose to do so. Presumably after OCing (adn crashing) you don’t send in a report every single time.

Finally, the fault-type control targeted MCEs, not all hardware faults. Errors that did not result in an MCE were not evaluated. The issues you discuss with regard to actual vs. temporary hardware faults were controlled for.

higherstandard2

The submissions wouldn’t be creating a “bubble” though that doesn’t represent the data. If you have several errors in a row that are caused by the the OC then it is a valid submission of the error. Just because an individual experiences errors while trying to dial in an appropriate overclocking of their system doesn’t mean that the errors never happened and those particular errors are solely a result of the OC. It actually backs up the reasoning that the systems are more stable at their intended clock speed.

(This isn’t saying that I would never or haven’t overclocked a computer – just think back to the days when you were better off buying a Celeron 300A (Malaysian built) processor that was stable to MHz at half the cost of a PII 400!)

WaltzinMatilda

My husband has built me several PCs. All have been trouble free.

The current one is 4-5 years old, runs like a clock, and has never missed a beat.

The custom PCs he builds are the same.

By contrast, his shop sees a constant parade of Dells and HPs.

:)

Waltzin Matilda

http://twitter.com/derekarnold Derek Arnold

Your anecdotal point doesn’t mean anything scientifically. It’s just something to make you and your husband feel good.

http://codeflow.org/ Florian Bösch

It does mean something. Building systems is a science/art/craft, acquired over many years of hard trial&error and lessons learned, blog posts consumed and so on. It means that if you where to examine this scientifically, you’d find that people who can build reliable systems, can do so repeatedly. And people who can’t, will build crappy systems until they learn to do it better.

AndrewBinstock

Derek’s right. A single data point means nothing. The fact that you choose to extrapolate her comment to mean all kinds of things that never said does not maker her sole data point any more meaningful.

shoalcreek5

What would be meaningful would be to see some measure of variance or deviation of the data.

My hypothesis would be in line with Florian and WaltzinMatilda–that skilled system builders would have slightly lower failure rates than assembly line machines, while hobbyists and average users that built their own machines would tend to have much higher failure rates than assembly line computers.

My hypothesis would have support if the variance in whitebox failure rates was higher than the variance in name brand computer failure rates. Lower or about the same variance in the two data sets would disprove my hypothesis.

bobdvb

I agree, and in particular because we know that HP+Dell ship substantially more volume they are more likely to appear for repair. If Mr Waltzin is selling a half a dozen of custom machines a week he’ll be doing OK, but by comparison HP+Dell are shipping containers full every day.

lostit2

Ok Derek,
scientifically maybe not if you say so, but as empirical data it’s valid.
I have historically received Compaq/HP’s with faults from new. Also brand name routers DOA.
I have installed systems that have accumulated over 700 days – no reboot (no Windows though).

As far as overclocking is concerned, a redline is a redline, if you know better go F1 racing, otherwise follow the instructions.

@AndrewBinstock – WaltzinMatilda’s example is a not spurious, if a view is only meaningful to the holder, withdraw yours.
@pyalot:disqus I’m with you, use what you have confidence in.

far1a

In my opinion much of this data is meaningless, take a look at the desktop vs laptop for example, there are a lot of desktops custom built and many of those are very badly put together. When I was younger I often built pcs with high end graphic cards and cheap RAM and motherboard, needless to say that those machines weren’t very stable. If you build a desktop following compatibility lists, common sense and overall know how it will be many times more stable than a laptop.

shoalcreek5

This data is aggregate and may or may not be similar to individual experience. I stress my computers when I first get them or build them to reveal any hardware weaknesses before the return period or warranty period is up. Some laptops just will not crash from hardware failure until after many years of hard use and abuse, while others crash every time you put a moderate load on all CPU cores.

For example, my little Toshiba powers through everything I throw at it. Sometimes I have to put it on an active laptop cooler to keep CPU temp in the acceptable range, but it keeps on going. In contrast, I used to have an Acer that looked way better on paper than my Toshiba, but it would blue screen in Windows and core dump in Linux (I dual boot) every time I even moderately loaded any 2 of the 4 CPU cores.

Typo333

“Ask enthusiasts what they think about systems built by Dell, HP, or any other big brand manufacturer, and you aren’t likely to hear much good. Actual data proves that major vendors actually have fewer problems than the systems built by everyone else.”

I think you’re missing the point. People who can install their own RAM value being able to fix it themselves more than they value it not crashing. Note that even brand-name systems did not have a 0% defect rate. So if your definition of “problem” is simply “crashes (ever)”, then yes, brand-name systems do better. But when a machine crashes, you don’t just throw it in the trash. On a home-built system, you can open it up and replace a bad part pretty easily, and home-builders probably have extras of many parts lying around anyway. On a brand-name system, you may have to special-order a part, e.g., do Dell power supplies still use custom pinouts that fit an ATX power supply plug but fry your motherboard if you use an actual ATX power supply?

It’s like static typing in programming languages. Can static typing find problems? Sure, it can happen. Even dynamic-type programming languages will admit that. But type errors just aren’t that common. The kind of people who want that are the kind of people want to pretend everything can work from the start, like someone who buys a brand-name PC and assumes that it can’t crash. If you know what you’re doing, though, the ability to work more quickly, and fix problems when (not if) they occur, is much more useful in the long run.

Joel Hruska

Typo333,

The report speculates that white box builders may be willing to cut corners more than Dell/HP/Lenovo. Certainly the failure rates imply an overall difference.

Unfortunately there’s no way to drill deeply enough to break out “Well made enthusiast boxes” from “Used cheap crappy parts.”

But to answer your overall question, AFAIK, the days of custom PSUs are long gone. On a desktop, you can always replace RAM, HDD, and CPU (though CPU choices may be limited by model, while RAM slots might be limited as well). Even lower-end systems often have a PCI-Express slot, and PSUs are built to the standard spec.

It won’t be much longer before the PCI-E x16 slot matters rather little. An x1 PCI-E slot built on 3.0 will deliver the same bandwidth as an x4 slot from the original PCI-e standard. That’s not much by enthusiast standards, but it’s far better than the 133MB/s we were stuck with on PCI when OEMs neglected to drop in an AGP card years ago.

bobdvb

I wouldn’t suggest that white label designers cut corners as being the reason, I would suggest that the brands are able to qualify the components better for the combination of hardware. For any given chipset, motherboard, CPU, PSU & cooling profile there is an optimum design and it is the business of the brands to ensure they reach that optimum not just first time but every time. The brands spend weeks trying dozens of combinations of components but the average Joe can’t afford to do that.

Zoel Krieger

i like the recurrent concept. however, i offer a cause: most users reboot on the first failure and *ignore it* as just the cost of using a PC. its only after that second failure that we start looking for a cause and something to fix. It also depends on how predictable an error is. If i know a specific game causes my video driver to throw a error and crash the PC, i might put up with that failure, waiting for a patch from the manufacturer to address it. so the recurrent find in my mind, should have been expected, rather than suprising.

do you know any helpdesk that doesnt tell people to first reboot and see if the problem repeats?

hotteamix

They also made Metro based on gathered user data. Take that as you may.

Ron007

They also designed the Ribbon using statistics.
“There are Lies, damned lies, and statistics”
Statistics can be twisted to meet any bias.

http://codeflow.org/ Florian Bösch

You know it’s faulty to analyze custom built boxes as if they where all built by the same “DIY person manufacturer” just as all Dell come from Dell.

The most fundamental truth (not speculation, not even statistic, truth) is that: if you stuff crap into your box, you get a crap box. If you pick crap ram, crap hardisks, crap processors, crap motherboards, etc. you cannot reasonably expect to get a non crap system.

And this my friend is where you see custom built systems really differ from Dell or HP. If you build a box, you know every single component that goes in there. You know exactly why you picked the component. And you know where you cut corners.

Now not all custom built boxes are the same, some people have been doing this longer than others, philosophies differ. But I gurantee you, and this is again truth, not statistic, no speculation, if you get an experienced system builder to build you a box, from scratch, and tell him not to skimp on any component, those systems will be *VASTLY* superior to anything that Dell or HP peddles.

Marrach

Thank you for a cogent statement.

For me the term ‘White Box’ covers too much territory. Whatever my opinion of Dell or HP may be– I can presume ‘Consistency’ from any particular line of their PC’s.

Even with a conscientious Tech in a local PC shop building a ‘White Box’ for a customer off the street- he can be hamstrung with price requirements– Splurge on the CPU, but the buyer doesn’t want to Pay for or wait for the properly matched and tested Memory…or a bad component matchup– usually a gaming Video Card. Ordinary Folks can’t really understand that component SPECs MEAN something! Motherboard components are not Lego Blocks. Just because it’ll fit the socket, doesn’t mean the two components will make nice with each other.

When I build a Server, I’m gonna spend more for that that Network Application Box than I will for the plain PC that just accesses the Application. And for the Workstation that is only gonna access eligibility Websites and make appointments, I will not splurge on CPU or Memory or Speed…but the Front Desk ladies will have a PC that does EXACTLY what they need in order to do their job. That’s why they last for over 10 years.

When I RARELY build, or help build, for a friend– I ALWAYS ask them– WHAT are they gonna use the PC for? WHICH PROGRAMS are they gonna use? Then I decide on the components and give them the list– with the proviso that they either accept it and let me build it– or they can call 1-800-DELL and spend at least $500 more.

Just the same, the Statistic results were interesting. The Laptop figures were particularly brow raising.

Daniel Revas

I actually built Business Class Desktops, Workstations, and Servers for HP. I don’t own an HP anything!

They had an entire line building Desktops where no one had ANY background in computers except the line supervisor, and the primary language on the line was Spanish. This is in the mid-west mind you.
Care to guess how the QA failure rate for that line was like compared to the other lines?

Predictibly horrible. Oh, and before you start screaming that I’m being “racist”, I happen to be Hispanic. The problem was a lack of competency to be doing the work.

Not that the other Temps that they often brought in as fillers were much better. Though in all fairness, the QA people worked aggressively to catch the problems. But watching those lines work was at times like watching sausage being made…you really don’t want to know how it was done.

http://www.lebenslustiger.com/ lebenslustiger

good study – white box though might have quite a wide spread on it – depending on how good the person/shop is that assembles the machines.

actionjksn

There are a lot of variables as far as weather an overclocked machine is going to be reliable or not. Such as, Did they build a PC with a high quality name brand motherboard with 100% solid capacitors and other extra durable components. And did they use a High quality power supply with low noise and ripple and tight voltage tolerances and over spec’ed on wattage . Also what kind of CPU cooling and case ventilation. And did they use RAM that was designed for overclocking. And then after using all these high end parts did they try to really push it to the limit. For instance I built my Core 2 Quad Q9550 system with all top shelf parts and overclocked from 2.83 GHz to 3.4 GHz with stock voltage. I did this several years ago and left Speed Step on. It stays overclocked 24/7 and has never crashed. But if someone else had the same machine they likely would have went for 3.6 GHz And then 3.8+ And it would probably go considerably higher. So to sum it up, Overclocking failures are caused by People using junk components, Not knowing what they’re doing, and just pushing the systems beyond their limits, and not just overclocking period. Another thing is different identical processors have more headroom than others depending on how lucky the purchaser got in the binning process. And just think how many people overclock who don’t know what they’re doing and can’t afford the right components for a properly overclocked machine. And also how many don’t even really care if they crash the machine.Because they want to see how high they can go. I’m not really interested in max OC because I know there is a diminishing return on investment the higher you overclock and I want to make my parts last.

http://pulse.yahoo.com/_VAXUPLWSPNK55YLKNWXTOQHOPY q

Apple propaganda around the corner, again! You should try at least to conceal a little bit more the Apple-coined “PC” term (with the meaning you’re using), or use a more specific naming – Windows machines.

Stacey Bright

There is one obvious problem with this data in regards to overclocking. The is the fact overclockers typically run some sort of stability test that will cause a crash or crashes, while attempting to find their hardware’s limits. These crashes are almost expected, to an extent, by the user until they find the point which is deemed stable. This would also skew the data for subsequent crashes to the same cause, due to the ‘trial and error’ nature of overclocking. It’s most likely impossible for them to discern if and data was from PC’s deemed seemingly stable, and crashing at a time other than during torture test. The only time you’re guaranteed subsequent crashes from overclocking attempts is from booting into windows with untested and unstable memory settings, which usually results in the need of an OS reinstall.

Tractorman2011

I’m with the people who say it would be nice if there were more information so we could make better conclusions. For instance, it would be great if we could control for a few issues: were the computers that got the bad electrolytic capacitors a few years ago (Dell, Asus, and others–maybe HP? Been a while) overrepresented or underrepresented in these statistics? What fraction of the people who got the “report the problem” popup actually sent it in and were the computers used by this group different from the ones who didn’t send the report? Are people who build their own computers (probably in the “white box” group here) more or less likely to send the report?

FWIW, these are the kinds of issues facing any researcher studying “real world” phenomena. The responsible ones admit the limitations of their studies, but it doesn’t stop the people trying to solicit clicks or newspaper sales from announcing that the study answered history-old questions, changed the lives of millions, or shattered the myths. When you read the headline, you can almost hear the “here we go again” from people experienced with interpreting research.

http://profile.yahoo.com/2BBKFXBVDFK3JYTMQJOO3IY5RI Richard

First off this is only reported errors/failures most people don’t send info to MS or any other manufacturer because they know that allot more info is sent that they are comfortable with.
Also Have built system with white box boards and Major manufacturer’s board and guess what they all come from the same few factories. Do some research… As long as you use the right tools and parts custom systems last twice as long and have fewer failures. Most of mine run eight years plus with no hardware failures. Most failures I have found are software related.

Arth86

So I should trust Microsoft, a corporation who’s software stability track record can’t hold a candle to most open source products made by similarly skilled communities, to tell me the foolishness of creating my own hardware configurations?

Thank you for the offer, but I’ll stick to what works for me.

Metasonix metasonix

These comments are Moronville. But then, this is an overclocking gamer site.
Hardcore gamers are generally arrogant dullards. Anyone stupid enough to
overclock a PC, use it to play EA sports games, and worship Gabe Newell,
isn’t someone I would ask “opinions” of. He’s bought and sold, like a sausage.
Old-fashioned cannon fodder.

You clowns need reminding: that modern PCs are staggeringly more efficient,
reliable, etc. etc. than mainframes or minicomputers of 30 years ago, and thus
you have no right to complain, or say anything. It’s like an improvement of millions,
or maybe hundreds of millions.

Nothing, IN HISTORY, has EVER improved as quickly as the personal computer.
Nothing. If things had proceeded at the usual pace of development, a PC
capable of running Windows XP would be the size of a very large refrigerator,
cost $500,000, and break down every 2-3 weeks. You’d be lucky to clock the thing
at 50 MHz, too. Modern PCs are so cheap they’re disposable.

But of course, you’re not listening.

Sometimes I wonder if cheap, fast PCs are a substantial reason why our fellow
Americans act like spoiled children and vote for right-wing lunatics who enact
laws that harm them directly, by making jobs scarce and bankrupting people.
It can’t be just the high-fructose corn syrup, and it’s damn well not the chemtrails.
(How many ExtremeTech regulars believe in chemtrails, I wonder?)

Joel Hruska

Wow. Did someone crap in your Cheerios on Tuesday morning?

I mean, while you’re at it, maybe you should make some comments on race, gender, and sexual orientation. Just to make sure you offend everyone equally. Consistency is important.

Use of this site is governed by our Terms of Use and Privacy Policy. Copyright 1996-2015 Ziff Davis, LLC.PCMag Digital Group All Rights Reserved. ExtremeTech is a registered trademark of Ziff Davis, LLC. Reproduction in whole or in part in any form or medium without express written permission of Ziff Davis, LLC. is prohibited.