> Date: Mon, 8 Feb 2016 17:57:48 +0100
> From: Arno Wagner <arno at wagner.name>
> From my experience shoveling a few hundred TBs of research data
> around when 200GB disks where standard, the only undetected errors
> I ever found were due to memory corruption due to a weak RAM bit
> in one server that did not have ECC memory. Those amounted to
> 3 errors in 30TBs of recorded data. I never had undetected read
> errors from disk (and since all data was bzip2 compressed,
> errors would have been found), so I tend to view these as not
> a disk problem, but likely happening someplace after the data
> leaves the disk.
I can confirm scenarios like this. Some years ago, I moved a couple
TB from one machine to another and was paranoid enought to individually
checksum the files and discovered a few that weren't right. Since
both the source and destination disks were LUKS, I could immediately
rule out large swaths of the disk subsystems, since that would have
lead to entire blocks of corruption due to the operation of the
cipher, whereas the errors I found were a handful of incorrect bits
in each case.
I narrowed this down reasonably quickly to the source machine getting
rare and data-dependent read errors from its RAM, but -only- when the
machine had automatically reduced its CPU clock rate because it was
unloaded. If I nailed the CPU at 100% in some other process, the RAM
errors went away, as they did if I simply disabled CPU clock rate
adjustment at all.