In my small home office, I have a dedicated mahcine for databases. It's ~4 years old, but works well for my needs. It has 2 velociraptors for the temp and log files and two other Caviar Blacks.

Soon after building it, I wished that I used ECC instead. I'm considering upgrading with two of these unbuffered Kingston kits. Is this a good idea? Will I gain enough added protection to make this worth it?

$100 seems very reasonable for this. (By the way, this itch is partly the fault of JBI's recent thread.)

The memory controller is what determines ECC support, rather than the CPU in this case (since the memory controller isn't on the CPU), so you should be OK to do that without a CPU swap.

As to what you gain, I dunno. It seems to me that ECC isn't worth the cost simply because if you're getting data corruption errors due to memory, you just have bad memory in the first place and replacing it with cheaper unbuffered non-ECC should fix it just as well. Someone may correct me here, though.

I do not understand what I do. For what I want to do, I do not do. But what I hate, I do.

The second C in ECC stands for Correcting, so ECC memory will fix single-bit errors that non-ECC memory won't, and at least find multi-bit errors. I use ECC in my desktops and while it's probably true that high-quality non-ECC should be alright, if you're using the system for something serious like a database, I'd say go for the ECC.

Evan wrote:if you're using the system for something serious like a database, I'd say go for the ECC.

Cool, I'll place the order. I'm glad to hear that this specific setup will have real protection. I was concerned that there was some catch and the full benefits wouldn't be realized on this system (because it's essentaillly an enthusiast machine with a quasi-server single-socket motherboard, and doesn't have true server parts throughout).

derFunkenstein wrote:The memory controller is what determines ECC support, rather than the CPU in this case (since the memory controller isn't on the CPU), so you should be OK to do that without a CPU swap.

Memory is cheap enough now that I'd go for it, *especially* if you've got another system that can make use of the non-ECC stuff.

derFunkenstein wrote:As to what you gain, I dunno. It seems to me that ECC isn't worth the cost simply because if you're getting data corruption errors due to memory, you just have bad memory in the first place and replacing it with cheaper unbuffered non-ECC should fix it just as well.

The point of ECC isn't really to compensate for defective memory (though it can do that too if the errors are all single-bit). The main reason for having ECC is to correct for random "flipped bits" which are caused by cosmic rays (or other background radiation) hitting the memory cells. The rate of these errors actually increases (quite a bit...) with altitude because there's less atmosphere (which acts as a shield) between your RAM chips and outer space. So if you are in (say) Denver, ECC is more important than if you live in Chicago!

This old FAQ from Corsiar makes some claims regarding the average statistics of a soft error:

The RAM Guy wrote:Q: Come on... cosmic rays? Really, how often does this occur?

A: The Ram Guy consulted with some experts on this one. Basically, it's a statistics problem. But, when you do the math, a soft error is likely to occur in a system with 256 Mbytes of memory about every 750 hours! And, the more memory you have, the more frequently soft errors will occur.

Those numbers seem a tad high. One would think my 24GB system would be falling over rapidly with values as reported.

"Welcome back my friends to the show that never ends. We're so glad you could attend. Come inside! Come inside!"

Ryu Connor wrote:This old FAQ from Corsiar makes some claims regarding the average statistics of a soft error:

The RAM Guy wrote:Q: Come on... cosmic rays? Really, how often does this occur?

A: The Ram Guy consulted with some experts on this one. Basically, it's a statistics problem. But, when you do the math, a soft error is likely to occur in a system with 256 Mbytes of memory about every 750 hours! And, the more memory you have, the more frequently soft errors will occur.

Those numbers seem a tad high. One would think my 24GB system would be falling over rapidly with values as reported.

It is because most of those bit flips go unnoticed until it messes up something critical. This is unacceptable for CAD, engineering, medical and scientific related stuff where a single bit flip could mess-up a critical calculation.

For OP, you need to get a X38/X48 board in order to get ECC support for your Core 2 Quad. FYI, not all X38/X48 boards have ECC support, so research carefully on the motherboard features. Asus seems to be relatively safe beat as they tend to offer ECC support on platforms that offer it.

Ryu Connor wrote:This old FAQ from Corsiar makes some claims regarding the average statistics of a soft error:

The RAM Guy wrote:Q: Come on... cosmic rays? Really, how often does this occur?

A: The Ram Guy consulted with some experts on this one. Basically, it's a statistics problem. But, when you do the math, a soft error is likely to occur in a system with 256 Mbytes of memory about every 750 hours! And, the more memory you have, the more frequently soft errors will occur.

Those numbers seem a tad high. One would think my 24GB system would be falling over rapidly with values as reported.

Those numbers are consistent with what I saw first-hand a number of years ago when I worked at Fermilab, and we were looking at DRAM error rates pretty seriously.

Yes, DRAM tech has improved quite a bit over the years, which tends to drive error rates down. On the other hand, as densities go up, the amount of electrical charge required to distinguish a '0' from a '1' gets smaller, making a cosmic ray "hit" more likely to change the charge in a cell enough to flip a bit. I honestly don't know how much effect (or in which direction) these competing processes have had on real-world error rates.

And no, I wouldn't expect your 24GB system to fall over on a regular basis even if the rates are still roughly 1 error per 750 hours per 256MB. Here's why: In order for the system to "fall over" (i.e. BSOD or application crash), you need to either A) corrupt a data address that containts a pointer or array index, in a way which eventually causes an invalid memory access that is caught by the hardware; or B) corrupt a code address (program instruction), in a way which results in either jumping to an invalid code address, or (indirectly) causes an invalid memory access along the lines of mechanism A. At any given time, only a small percentage of your RAM contents will be vulnerable to these types of errors.

Flipped bits in the rest of your RAM will either be C) completely benign (corrupted data will be overwritten by something else before it is actually used for anything); D) mostly benign (result in a fleeting glitch in audio or video playback, or a one-time application misbehavior that goes away if you attempt the same operation again); or E) potentially harmful but not easily detectable (silently flipped bits here or there in your data).

A and B give people who worry about system uptime nightmares; E is the bugaboo for people who worry about data integrity.

The years just pass like trains. I wave, but they don't slow down.-- Steven Wilson

Captain Ned wrote:How deep are the tunnels at FermiLab, and how deep do they have to be to block cosmic rays (mainly highly-energetic protons)?

The main accelerator ring tunnels are not very deep (just barely below the surface). The tunnels/halls for the neutrino experiments are deeper (few hundred feet down). The stuff I was dealing with was not underground though; it was part of a supercomputer system used for QCD simulations. We started seriously looking into error rates because we had error *detection* (single-bit parity) but no error *correction*, and soft errors were causing simulation runs to crash out at an unacceptable rate (multiple times/day).

Somewhat surprisingly, most data on soft error rates (circa early 1990s) was classified, and since our project was not classified we were not allowed to see that data. So we did our own tests and analysis. We actually looked at leakage from the accelerator as a possible source (and ruled it out). Ultimately we concluded that the errors were being caused by cosmic rays, and low-level radioisotope contaminants (in the packaging of the RAM chips themselves and the solder used to assemble the circuit boards).

Since retrofitting ECC to the system was not feasible, we adopted a checkpointing system where the state of the simulation was periodically dumped to a RAID-0 array, allowing the simulation to be reloaded from the checkpoint and restarted when memory errors were detected.

The electronics industry has subsequently gotten better at controlling radioisotopes in chip packages and solder, leaving cosmic rays as the dominant source of soft errors.

Edit: I don't know how deep you need to be to block cosmic rays.

The years just pass like trains. I wave, but they don't slow down.-- Steven Wilson

wibeasley wrote:There's no guarantee that my research is flawless, but the motherboard manual (and the page I linked to in the OP) claim that ECC is supported. Are you seeing something I'm missing?

The 3200 northbridge supports ECC. I would also be rather surprised if any motherboard marketed as a "server" product (other than ultra-low power Atom-based stuff, but that's another story...) lacks ECC support. It is a feature that enterprise server customers demand, and it would be silly for Asus to leave it out.

Many *desktop* boards indicate that they can run with ECC DIMMs, but they are actually just ignoring the extra parity bits, and not using the ECC capabilities. I think your P5BV-C should be fine though.

I did run across the following interesting tidbit, however: https://bugzilla.redhat.com/show_bug.cgi?id=564274 (The TLDR version: Some Asus server boards have a BIOS bug that results in the ECC controller not getting initialized properly if you enable "Quick Boot" in the BIOS; so if you want to use ECC RAM, make sure that feature is turned off!)

The years just pass like trains. I wave, but they don't slow down.-- Steven Wilson

Yes, AMD's continued support of ECC across most of their product line (unlike Intel, who apparently sees it as a way of segmenting their product into "desktop" and "server" classes) is one of the reasons I have tended to stick with AMD over the years. This seems to be changing with the Fusion APUs, however; I don't think the FM1 socket has pins reserved for the ECC bits.

The years just pass like trains. I wave, but they don't slow down.-- Steven Wilson

The ECC capability is recognized in CPUZ. But Memtest86+ 4.20 reports, "Chipset: Intel 320/3210 (ECC : Disabled)" Does this mean that it's disabled for testing purposes? Or that it's not working (and won't be hen I boot into the real OS)?

I've Googled for 30 minutes and don't see anything convincing. I swapped th CPUs so it now has the Xeon 3360, but that didn't change anything.

Regardless of Memtest, what's the best way to verify that the chipset is taking advantage of the ECC capability?

I'm pretty sure the Memtest86+ display is showing the ECC configuration coming in (i.e., it appears to not be working).

Per my post from about a week ago, did you disable "quick boot" in the BIOS? Also check the BIOS for any other settings related to the memory controller.

I'm not sure how to tell whether ECC is enabled in Windows. In Linux you can search the system logs for lines containing the string "EDAC". I suppose you could try booting a Linux live CD to check it...

...which actually told me something I didn't realize until just now: the version of Linux I'm running on this system apparently doesn't know how to log ECC errors if you're using DDR3 RAM! Oh well, at least ECC is enabled...

The years just pass like trains. I wave, but they don't slow down.-- Steven Wilson

That was with Quick Boot enabled and disabled (as well as Memory Remap en/disabled). Sorry JBI- after the effort you spent finding and summarizing that long bugtracker, I shouldn't have forgotten to address it (and thank you) in the last post.

And thanks for the example grep code. I got confused interpreting pages of output from the dmidecode command. The grep code says:

After re-enabling Quick Boot, I went back to run JBI's grep code. The output is unchanged. This with a Ubuntu 11.04 Live USB key last night, and 11.11 this morning. This is with the latest bios (released in 2008). Memtest ran 6 hours last night without an error, albeit with ECC disabled.

Is it time to mess with Asus support?

Edit: Corrected "EDAC MCO" to "EDAC MC0

Last edited by wibeasley on Mon Nov 28, 2011 2:19 pm, edited 1 time in total.

You can search for the relevant section of the file by typing "/EDAC" (without the quotes) and hitting Enter; arrow keys scroll up/down through the file. The fact that your log output is very different from mine isn't conclusive since you're on an Intel platform and I'm on AMD.

I *think* the fact that you were getting the errors when Quick Boot was enabled implies that ECC is turned on, but I am not sure.

The years just pass like trains. I wave, but they don't slow down.-- Steven Wilson

One other thought: On the systems I have access to, there does appear to be a 100% correlation between the presence of the "EDAC MC0: Giving out device to ..." line, and ECC being enabled. Systems running the same kernel that don't have ECC RAM installed have a different line, indicating that ECC is disabled.

So maybe you're good.

The years just pass like trains. I wave, but they don't slow down.-- Steven Wilson

wibeasley wrote:I hope so, but I'm skeptical. Did you omit that line for the first post this morning?

D'oh, my bad; that'll teach me to over-generalize. That system is the one exception, due to the version of EDAC bundled with Ubuntu 10.04 not supporting AMD's DDR3 controller. But in that case it clearly indicates that the BIOS has enabled ECC...

On the DDR2 systems I've looked at, there does indeed appear to be 100% correlation. Furthermore, on systems where the memory controller does not support ECC, there's nothing in the log about EDAC at all; and if the memory controller supports it but ECC RAM is not installed/enabled, there's a line that specifically indicates that it is disabled.

The years just pass like trains. I wave, but they don't slow down.-- Steven Wilson