OEM is more stable than white box.Laptops are more stable than desktops.An underclock of as little as .5% has a huge impact on stability.Overclocking has a substantial likelihood of failure.Once a hardware crash/failure has occurred in any of the three measured components (CPU, Memory, Disk) you are doomed from there forward.Disks failures have the most rapid recurrence rate.

Many of us won't be shocked by some of these details (and more within the document). In particular I find the last one regarding disks the least shocking of all.

"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"

CPU machine check exceptions are more likely to cause an OS crash than DRAM errors. I was initially somewhat surprised by this; however, after thinking about it a bit more, it makes sense. The whole point of machine check exceptions is to prevent the CPU from doing something off the wall, essentially a virtual panic button to immediately shut everything down. AFAIK *all* machine check exceptions result in an OS crash (BSOD). OTOH most DRAM errors will probably result in an *application* crash or silent data corruption (neither of which are represented in the data used for this study).

They're using OS crashes as a proxy for system instability. While I understand their motivation for doing so (the data is readily available via automated crash reports), I'd be much more interested in knowing how frequently user data is lost or corrupted. Unfortunately, collecting data to do this analysis would be impractical.

The years just pass like trains. I wave, but they don't slow down.-- Steven Wilson

Yes, the authors delved into the fact that their methodology was worthless for determining soft errors occurring in consumer level non-ECC RAM. Leaves us with a bit of mystery about how poor the memory is we use. As the rest of the document details that consumer level equipment does not stand up to the tolerances of server level equipment. That may not be shocking to some people, but I firmly believe there is a group of enthusiasts and IT Pros who believes the extra costs of say a Xeon versus an i7 is just raw profit. This document (and the other cited studies) details that you do get greater stability for your money.

"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"

And as I've noted on these forums (repeatedly), the question of RAM stability/reliability is why I prefer to use ECC RAM even for desktops. This, in turn, is one of the reasons I remain in the AMD camp and buy Asus motherboards almost exclusively. An inexpensive Asus motherboard plus an Athlon II, Phenom II, or FX CPU will get you an ECC capable platform for a fraction of the cost of an equivalent Intel-based solution (since Intel forces you to upgrade to a workstation/server mobo and Xeon CPU if you want ECC support).

The years just pass like trains. I wave, but they don't slow down.-- Steven Wilson

Ryu Connor wrote:Yes, the authors delved into the fact that their methodology was worthless for determining soft errors occurring in consumer level non-ECC RAM. Leaves us with a bit of mystery about how poor the memory is we use. As the rest of the document details that consumer level equipment does not stand up to the tolerances of server level equipment. That may not be shocking to some people, but I firmly believe there is a group of enthusiasts and IT Pros who believes the extra costs of say a Xeon versus an i7 is just raw profit. This document (and the other cited studies) details that you do get greater stability for your money.

Except that the conditions consumer PCs run in are much more variable. IE, less likely to be on battery backups, in dust-free environments (leading to overheating issues), disks are less likely to experience g-shock hazards. And then there is the case of secondary PC components being potentially less reliable, on average, than what most server racks use (ie, PSUs, etc). In the end, this simply does not provide enough data to conclude that the extra cost of a Xeon over an i7 is not in fact "all profit".

What stands out in opposition to your viewpoint is that laptops are more stable than desktops. Laptops have to endure these same poor conditions as desktops and yet their specialized consumer parts handle it better. It is no silver bullet to the question, but it does further support the concept that the design and market aims of the components matter.

A more curious question is why do OEMs have better stability than white boxes despite enduring similar conditions and using similar parts.

"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"

What is the author defining "white box" as? If "white box" population DIY system consists mostly of enthusiast systems, then it is no surprise that OEMs system are more stable in the samples. It is because the overwhelming majority of overclocked systems are in the "enthusiast" ring (almost 99%). Overclocking is always known to reduce long-term stability at the expense of more performance. OEM systems in the last 10-15 years are extremely difficult to overclock since manufacturers remove options for it at the software level. The OEM crowd have little or no interest in overclocking if they even know how to do it in the first place.

I'm willing to bet that once you remove overclocked systems from the samples. The differences between OEM and DIY are going to be marginal at best. They are both suffer from el cheapo, bargain basement components trying to work in tandem without blowing up in your face. They also both have a minority of users who are willing to send the extra $$$$ and time to make sure that they get quality components they have been thoroughly tested to work without incident (prosumers).

Memory issues are still the overwhelming cause of instability problems in a modern system. Memory doesn't like running beyond spec or enduring high temperatures for long periods of time. The only problem with the sampling is that fails to factor the motherboard and memory controller to possible problem spots. From my own personal experience, I have dealt with memory and motherboard combinations that refuse to work at all at certain memory divider/multipliers (example 1:1, 5:6) that are still running within "spec", but work "flawlessly" with other ratios (2:3).

I'm curious to see if relaxing timings have any affect on long-term reliably for memory.

Only 2% of the same size was overclocked with the caveat that only 477,464 machines within the sample could have their proper clock speed identified.

Overclocked is defined by being 5% outside of rated speeds.

Study wrote:We have divided the analysis between two CPU vendors, labeled “Vendor A” and “Vendor B.” The table shows that CPUs from Vendor A are nearly 20x as likely to crash a machine during the 8 month observation period when they are overclocked, and CPUs from Vendor B are over 4x as likely. After a failure occurs, all machines, irrespective of CPU vendor or overclocking, are significantly more likely to crash from additional machine check exceptions.

The data implies that AMD and Intel also have a substantial difference in the manufacturing quality of their chips. Who is who in this study is an interesting guess.

It also implies that overclocking will sooner rather than later bite you in the ass.

As for OEM vs White Box.

Study wrote:We identify a machine as brand name if it comes from one of the top 20 OEM computer manufacturers as measured by worldwide sales volume. To avoid conflation with other factors, we remove overclocked machines and laptops from our analysis.

So overclocking did not taint the result that OEM is more stable than white box. Anything not one of the top 20 OEMs is a white box, so DIY boxes do fall into the white box category.

Edit:

As one answer to my own musings. Most OEMs slightly underclock their machines.

Study wrote:Therefore, we further partitioned the non-overclocked machines into underclocked machines, which run below their rated frequency (65% of machines), and rated machines, which run at or no more than 0.5% above their rated frequency (32% of machines). As shown in Figure 5, underclocked machines are between 39% and 80% less likely to crash during the 8 month observation period than machines with CPUs running at their rated frequency.

A small change can have a rather large payback in stability.

"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"

Ryu Connor wrote:LinkOEM is more stable than white box.An underclock of as little as .5% has a huge impact on stability.Overclocking has a substantial likelihood of failure.

I think that OEMs do want more stable systems (less support costs) while consumers want more performance for buck. Therefore OEMs are more likely to use a decent case/airflow and especially ensure that the PSU is of sufficient quality for the rated performance instead of going for the most expensive high-end gfx card. These two parameters are often neglected.

I am also a fan of ECC memory, although it isn't strictly necessary for a gaming-only system. My AMD system with ECC RAM and a Seasonic 80+ Gold 750W PSU never crashes. Think 100% load 24/7 over several days. My point is, it can be done. It's just that enthusiasts don't care if they have to reset once in a while. They'd rather have 10% more performance.