If this is your first visit, be sure to
check out the FAQ by clicking the
link above. You may have to register
before you can post: click the register link above to proceed. To start viewing messages,
select the forum that you want to visit from the selection below.

That's quite a paradox there. In order to see the problem if there is one, you need ECC RAM. But if you don't have ECC RAM, there's no way to see it.

Normally RAM errors are quite obvious. The vast majority of my RAM is used by programs, not data, and therefore if there is a single-bit error it would most likely crash a program. And most RAM errors are from faulty sticks (or similar) anyway, and are therefore very obvious.

If you don't store any data on the PC itself and use it only as browsing / video streaming client, then you don't need ECC. In fact, using ECC will make the system slower due to the additional error detection calculations.

But once you start storing data on it, ECC becomes important. This has been shown in a Google publication from 2009:

DRAM errors can cause nasty data corruption on disk, which can be catastrophic (if important filesystem structures are affected) or silent. An error rate of 8% per DIMM per year (for server memory, which usually sits on a well-designed mobo behind high-quality PSUs) is not negligible.

Except the article said that only 8% of dimms were affected, not 8% per dimm per year.

I have 8gigs and that's more than enough for complicated 1080p video processing, and if I am only working with 720p material then I can start up a virtual machine to that in parallel. Or up to 2 larger virtual machines, each of them (including the host OS) running development environments, which are also known to take up quite some RAM space. Anyway, I think 8gigs will be enough for most people, and even if you are wanting to be future proof, above 16GBs you'll be most likely just wasting money. Sure there are use cases for 32gigs of RAM, I just doubt you fall into that category. TBH, if you've been using 2gigs until now and you're not planning to change your habits by a very large amount, you probably don't even need 16gigs.

Except the article said that only 8% of dimms were affected, not 8% per dimm per year.

This is what the article says (emphasis mine, extra backslash in original):

Originally Posted by http://research.google.com/pubs/pub35162.html

we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8\% of DIMMs affected by errors per year.

Do note that the study was published in 2009, and observed servers over a period of 2.5 years. So you are mostly talking about machines which were commissioned in 2006 or earlier. And the errors are clustered, with DIMMs observing one error being more likely to observe another error.

I've been looking at this issue, for either later this year or early next year. In my case, it's for a work-from-home machine. (There are other reasons for putting my own money into this, instead of my employer's, but that's a different issue.)

For the things I wish to do, basically silicon CAD, I need as much memory as I can get. These days 8G doesn't really cut it for me, any more. As others have said, if a bit flips in code, you'll likely crash. If a bit flips in data, it might later cause a crash due to becoming invalid - the real problem is if that data becomes "reasonable, but wrong." In that case there will be a lurking, potentially undetected error. I want ECC.

Given that it's my personal budget, that also means AMD. To get ECC in Intel you need to get a XEON CPU, then a XEON motherboard, then typically buffered ECC DIMMs. System cost more than doubles. With AMD, by careful motherboard selection you get ECC for "free". Then you just need to by unbuffered ECC DIMMs, which are more expensive, but not as much as buffered ECC DIMMs. System cost goes up a few hundred or so.

One caveat... That "32G" motherboard will only take 24G once you're using ECC DIMMs. The 32G capacity appears to be right at the drive capacity of the chipset/CPU. Once you add the extra 11% chips to support ECC you're over spec, and drop back to 24G.

I had a motherboard picked out a few months ago, basically a 990-series chipset, but I presume things have changed since. My CAD needs are primarily 2D, so while I want decent graphics performance, I'm not talking SLI/Crossfire and such. I had also been thinking of a Piledriver/Vishera, like you. But I've waited long enough now, I'm kind of wondering what Steamroller will bring. My current workload is primarily integer-intensive, which shouldn't suffer from AMD's poorer floating point, as compared to Intel. But Piledriver is supposed to improve floating point, and there's always future workload.

But, is it really worth it?
How common are bit-flip errors in modern RAM?
Is the performance drop noticeable?

1. I think it is, partly because I leave my computers running 24/7 and don't reboot for many months. If you power them off at night, or reboot frequently, it's less of an issue.

2. Bit-flip errors are quite common in "modern" RAM. I'm not sure how you define "modern", but the likelyhood of experiencing a bit-flip error increases as the number of bits increases. I.e. if you have 8 GB of RAM, you are eight times more likely to experience an error than if you only had 1 GB. As typical RAM capacities continue to grow, so does the likelyhood of experiencing an error - And that assumes a constant error rate; some believe the smaller sized memory cells in modern RAM are more error prone than the larger cells of older fab processes.

3. There is no performance drop. Zero. Benchmark the same system back to back with ECC enabled or disabled and any observed difference will be within the margin of error. Any "performance drop" is a purely theoretical but is not measurable in the real world, not with any standard desktop or server applications.

Originally Posted by PreferLinux

I've got 12 GB of RAM, and never seen a problem.

Exactly, and that's why its dangerous. You frequently won't "see" a problem with bit-flip errors. It's called silent corruption. Read the wikipedia page on silent corruption. It happens to hard drives and it happens in RAM. Your data becomes corrupt and you don't notice until its too late. Ever opened a JPG file you had saved, a photo from your camera for example, and found it has weird corruption, colored stripes, etc? That's a bit-flip error. The bit-flip may have occured on your hard drive (very common) or it may have occured in RAM (also common).

The only way to completely eliminate the effects of a bit-flip error are to use ECC memory, and to use RAID disks for your storage.

About error rates:
In the Google study they found that as memory chips aged, the number of errors would rise. This somewhat offset the higher number of errors due to higher capacity, as newer modules would typically have higher capacity.

About performance:
On my AMD FX-8350 I measured kernel compile times with ECC enabled and disabled, and found no difference apart from the normal variation.
Modern high-end GPUs since NVidia Fermi and AMD Southern Islands use a different method of ECC which uses "normal" memory and not extra address lines. There you can measure a performance impact.

On data corruption:
RAID does not typically detect bit flip errors on hard disks, unless it employs some kind of data integrity check. Some expensive hardware RAID controllers do that, and ZFS RAID-Z does it too.