Posted
by
kdawson
on Tuesday October 06, 2009 @01:57PM
from the forget-me-not dept.

An anonymous reader writes "A Google study of DRAM errors in their data centers found that they are hundreds to thousands of times more common than has been previously believed. Hard errors may be the most common failure type. The DIMMs themselves appear to be of good quality, and bad mobo design may be the biggest problem." Here is the study (PDF), which Google engineers published with a researcher from the University of Toronto.

Add to that the fact that Google (apparently) tends to run their data centers "hot" compared to what is commonly accepted, and use significantly cheaper components, and you've got a good explanation for why their error count is as high as it is.

Yeah, but let's look at the more common situation - a home. Variable temperatures, most likely QUITE variable power quality, low-quality PSU, and almost certaily no UPS to make up for it. Add that to low-quality commodity components (mobo & RAM).

I'd not be surprised to find the problem much more prevalent in non-datacenter environments.

Switching to high-quality memory, PSU & UPS has made my systems unbelievably reliable the last several years. YMMV, but I doubt by much.

No, I don't believe so. They use server boards, custom made to their specs.

I suppose it depends on how you define "server board". Room for tons of ECC RAM and two CPUs is server or serious-workstation class (or maybe I-just-use-Notepad-and-my-sales-guy-is-on-commission class), but I think once you're on to custom boards that only use certain voltages of electricity, you've moved into a class by yourself.

And, I'm pretty sure that those specs include ECC memory - that is the standard for servers, after all.

I find that more often then not, when people get blue screens or frequent crashes, that it's due to a bad RAM chip. I think it's kind of a bad thing that most motherboards don't really test the RAM when you book up. Usually running the real RAM test will pick up on most memory errors. You don't even need to run memtest. Sure you save a few seconds on boot up, but it's often better to know there is a problem with your memory then go on for months thinking there is some other problem.

I tried, based on the abstract. Wound up getting a figure of 8% of 2 gigabyte systems having 10 RAM failures per hour and the other 92% being just peachy. While a few bits going south is AFAIK the most common failure state for RAM, some of those RAM sticks must be complete no-POST duds and some are errors-up-the-wazoo massive swaths of RAM corrupted, so that throws my back of the envelope math WAY off....

In other words, big numbers make Gronk head hurt. Gronk go make fire. Gronk go make boat. Gronk go make fire-in-a-boat. Gronk no happy with fire-in-a-boat. Boat no work, and fire no work, all at same time.

The fact that these DIMMs were "stressed" makes me wonder about the validity of the test. Heat stress, among other things, will multiply errors far beyond what you will see in normal service.

The problem with something like this is the assumption that Google world == real world.

This RAM is all running on custom Google boards that no one else has access to, with custom power supplies in custom cases in custom storage units. To the researchers' credit, they split things by platform later on, but that just means Google-custom-jobbie-1 and Google-custom-jobbie-2, not Intel board/Asus board/Gigabyte board. Without listing the platforms down to chipsets and CPU types (not gonna happen), it's hard to compare data and check methodology.

While Google is the only place you're going to find literal metric tons of RAM to play with, the common factor that it's all Google might be throwing the numbers. At least some confirmation that these numbers hold at someone else's data center would be nice.

That window looked out to a pile of coal, so the culprit was assumed to be low level alpha radiation.

Alpha radiation is stopped by a sheet of office paper. It certainly wouldn't make it through the window, through the machine case, electromagnetic shield, circuit board, chip case, and into the silicon. Even beta radiation would be unlikely to make it that far.

What is much more likely: thermal effects. IE, infrared from the sun heating up machines near the window.

My takeaway from this paper is that maybe google should hire more technicians who are experienced with non-ecc ram systems. They even believed, prior to this study, that soft errors were the most common error state. I could have told you from the start that was bunk. In over 15 years of burn-in tests as part of pc maintenance, the number of soft-errors observed is... 0. Either the hardware can make it through the test with no error, or there is a DIMM that will produce several errors over a 24 hour test. This doesn't mean that random soft errors never happen when I'm not looking/testing, but the 'conventional wisdom' that soft errors are the predominant memory error doesn't even pass the laugh test.

From looking at the numbers on this report, I get the feeling that hardware vendors are using ECC as an excuse to overlook flaws on flaky hardware. I would now be really interested in a study that compares the real world reliability of ECC vs non-ECC hardware that has been properly QC'd. I'll wager the results would be very interesting, even of ECC still proves itself worth the extra money.

Yeah, but let's look at the more common situation - a home. Variable temperatures, most likely QUITE variable power quality, low-quality PSU, and almost certaily no UPS to make up for it. Add that to low-quality commodity components (mobo & RAM).

The vast majority of people have laptop's now which come with a built in UPS.

Comparing ECC to mob protection is not a very good analogy. ECC lets you detect and in some cases fix memory errors. The key is the detection part.

If you get a single bit error which results in corrupt data, unless you verify that data some other way you won't know about it unless you have ECC. Verifying data multiple times is computationally expensive and degrades performance, and most server OSs and software don't do it anyway.

As well as error detection the fact that you know it was the memory which corrupted the data (rather than, say, a HDD read error or a malfunctioning CPU) is valuable. It's much better to be able to say "DIMM 3 is failing" than "there is a fault, let me spend time and effort figuring out where it is". Of course it isn't always as easy as that, but it's still better than non-ECC.