Posted
by
timothy
on Friday December 04, 2009 @07:50AM
from the don'tcha-love-finding-corrupted-files dept.

storagedude points out this article about one of the perils of digital storage, the author of which "says massive digital archives are threatened by simple bit errors that can render whole files useless. The article notes that analog pictures and film can degrade and still be usable; why can't the same be true of digital files? The solution proposed by the author: two headers and error correction code (ECC) in every file."

If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.
Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.

"says massive digital archives are threatened by simple bit errors that can render whole files useless.
Isn't this what filesystem devs have been concentrating on for about 5 years now?

Not just 5 years. ZFS's CRC on every datablock and Raid Z (no raid hold) are innovative and obviously the next step in filesystem evolution. But attempts at redundancy aren't new. I'm surprised the article is discussing relatively low teck old hat ideas such as two filesystem headers. Even DOS's FAT used this raid0 type of brute force redundancy by having two FAT tables. The Commodore Amiga's Intuition filesystem did this better than Microsoft back in 1985 by having forward and backward links in every block which made it possible to repair block pointer damage by searching for a reference to the bad block in the preceding and following block.
And I suppose if ZFS doesn't catch on, 25 or 30 years from now Apple or Microsoft will finally come up with it and say, "Hey look what we invented!"

Not all RAID implementations check for errors. Some cheaper hardware will duplicate data or make parity, but never check for corruption. Instead, they only use the duplicated data for recovery purposes. Nothing says fun like rebuilding a RAID drive only to find your your parity data was corrupt, but you won't know that until you try to use your new drive and weird things happen..

Will the corruption affects your FS or will it affect your data.. 8-ball says........

RAID is not backup or archive. If you have a RAID1 system with bit errors on one disk - you now have them on the other disk.

Using more complex RAID configs does not necessarily solve the problem.

With archiving, the problem becomes apparent after you pull out your media after 7 years of not using it. Is that parallel ATA hard drive you stored it on still good? Do you have a connection for it? What about those zip and jazz drives you used? Do you have a method of reading them? Are those DVDs and CDs still goo

File Systems are in the software domain. If you arent getting good data (what was written) off the drive, the File System ideally shouldn't be able to do any better than the hardware did with the data. Of course, in reality the hardware uses a fixed redundancy model that offers less reliability than some people like. The danger of software-based solutions is that it allows hardware manufacturers to offer even less redundancy, or even NO redundancy at all, causing a need for even MORE software based redundan

I agree that filesystem level error correction is good idea. Having the option to specify ECC options for a given file or folder would be great functionality to have. The idea presented in this article, however, is that certain compressed formats don't need ECC for the entire file. Instead, as long as the headers are intact, a few bits here or there will result in only some distortion; not a big deal if it's just vacation photos/movies.

By only having ECC in the headers, you would save a good deal of storage

If this type of thing is implemented at the file level every application is going to have to do its own thing. That means to many implementations most of which wont be very good or well tested. It also means applications developers will have to be busy slogging though error correction data in their files rather than the data they actually wanted to persist for their application. I think the article offers a number of good ideas but it would be better to do most of them at the filesystem and perhaps some at the storage layer.

Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.

Precisely. This is what things like torrents, RAR files with recovery blocks, and filesystems like ZFS are for: so every app developer doesn't have to roll their own, badly.

Don't forget PAR2 [wikipedia.org]. I never burn a DVD without 10%-20% redundancy as par2 files. Even if the filesystem gets too damaged to read, I can usually dd the whole disk and let par2 recover the files.

A simple solution would be to add a file full of redundancy data alongside the original on the archival media. A simple application could be used to repair the file if it becomes damaged, or test it for damage before you go to use it, but the original format of the file remains unchanged, and your recovery system is file system agnostic.

How about an archive format, essentially a zip file, which contains additional headers for all of it's contents? Something like a manifest would work well. It could easily be an XML file with the header information and other meta-data about the contents of the archive. This way you get a good compromise between having to store an entire filesystem and the overhead of putting this information in each file.

Add another layer by using a striped data format with parity and you have the ability to reconstruct any

Also if we can present the same logical file when read to the application even if every 9th byte is parity on the disk that is a plus because it means legacy apps can get the enhanced protection as well.

Not to nitpick, but I'm going to nitpick...

Parity bits don't allow you to correct an error, only check that an error is present. This is useful for transfer protocols, where the system knows to ask for the file again, but the final result on an archived file is the same: file corrupt. With parity, though, you have an extra bit that can be corrupted.

In other words, parity is an error check system, what we need is ECC (Error Check and Correct). Single parity bit is a 1-bit check, 0-bit correct. we need

It already exists, it's called ZFS on solaris boxxen. Each block uses ECC, it can correct itself on each read, and generally can indicate a failing disk. This truly is the filesystem every other one is playing catchup with.

I agree with you about ZFS. Getting Sun to dual license it under the GPL would mean it becoming a de facto standard everywhere but Macs and Windows. However, we should have ECC in the file format, so regardless of what type of machine a file is stored on, there is a good chance of repairs for a file.

On an ISO basis, one can use a utility like DVDDisaster to append ECC info to the end of an ISO file before burning to CD or DVD. Since this is transparent to user action, the use of the added ECC only really

Actually ZFS does not use ECC unless you are using a RAIDZ. It does use checksums for all blocks, but that can only detect error, if you're not using RAIDZ or a mirror or have copies=(2|3) the error will be fatal, but at least you'll know about it.

Having ECC in metadata is fine, but in a perfect world, applications shouldn't ignore ECC. Especially when a file is being used and modified. Instead, either the app should update the ECC records when any writes are performed, or at least when the file is being closed.

I've been doing this for a while now for files that I back up to CD or DVD, just in case the media doesn't have that long a life. I've actually had much more problem with sectors going bad on hard drives than on CD or DVD, but that could change. I really hate that I lost an irreplaceable movie file I took when I was in Japan, because a sector went bad on my server (long since gone now) that kept me from even copying the good sectors from it. I couldn't even run fsck or anything since it was a leased ser

>>>"...analog pictures and film can degrade and still be usable; why can't the same be true of digital files?"

The ear-eye-brain connection has ~500 million years of development, and has learned the ability to filter-out noise. If for example I'm listening to a radio, the hiss is mentally filtered-out, or if I'm watching a VHS tape that has wrinkles, my brain can focus on the undamaged areas. In contrast when a computer encounters noise or errors, it panics and says, "I give up," and the digital radio or digital television goes blank.

What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.

When I was looking for a digital-to-analog converter for my TV, I returned all the ones that displayed blank screens when the signal became weak. The one I eventually chose (x5) was the Channel Master unit. When the signal is weak it continues displaying a noisy image, rather than go blank, or it reverts to "audio only" mode, rather than go silent. It lets me continue watching programs rather than be completely cutoff.

And how well did that work for your last corrupted text file? Or a printer job that the printer didn't know how to handle? My guess you could pick out a few words and the rest was random garble. The mind is good at filtering out noise but it is an intrinsically hard problem to do a similar thing with a computer. Say a random bit is missed and the whole file ends up shifted one to the left, how does the computer know that the combinations of pixel values it is displaying should start one bit out of sync so t

Extremely well. I read all 7 Harry Potter books using a corrupted OCR scan. From time to time the words might read "Harry Rotter pinted his wand at the eneny," but I still understood what it was supposed to be. My brain filtered-out the errors. That corrupted version is better than the computer saying, "I give up" and displaying nothing at all..

>>>Say a random bit is missed and the whole file ends up shifted one to the lef

Sure, but you'd need to not compress anything and introduce redundancy. Most people prefer using efficient formats and doing error checks and corrections elsewhere (in the filesystem, by external correction codes - par2 springs to mind, etc).

What we need is a smarter computer that says, "I don't know what this is supposed to be, but here's my best guess," and displays noise. Let the brain then takeover and mentally remove the noise from the audio or image.

Audio CDs have always done this. Audio CDs are also uncompressed*.

The problem, I suspect, is that we have come to rely on a lot of data compression, particularly where video is concerned. I'm not saying this is the wrong choice, necessarily, because video can become ungodly huge without it (NTSC SD video -- 720 x 480 x 29.97 -- in the 4:2:2 colour space, 8 bits per pixel per plane, will consume 69.5 GiB an hour without compression), but maybe we didn't give enough thought to stream corruption.

Mini DV video tape, when run in SD, uses no compression on the audio, and the video is only lightly compressed, using a DCT-based codec, with no delta coding. In practical terms, what this means is that one corrupted frame of video doesn't cascade into future frames. If my camcorder gets a wrinkle in the tape, it will affect the frames recorded on the wrinkle, and no others. It also makes a best-guess effort to reconstruct the frame. This task may not be impossible with more dense codecs that do use delta coding and motion compensation (MPEG, DiVX, etc), but it is certainly made far more difficult.

Incidentally, even digital cinemas are using compression. It is a no-delta compression, but the individual frames are compressed in a manner akin to JPEGs, and the audio is compressed either using DTS or AC3 or one of their variants in most cinemas. The difference, of course, is that the cinemas must provide a good presentation. If they fail to do so, people will stop coming. If the presentation isn't better than watching TV/DVD/BluRay at home, then why pay the $11?

(* I refer here to data compression, not dynamic range compression. Dynamic range compression is applied way too much in most audio media)

You're right about CD's always having contained this type of ECC mechanism. For that matter you will see this type of ECC in Radio based communications infrastructure and data that gets written to Hard disks too! In other words - all modern data storage devices (except maybe Flash..) contain ECC mechanisms that allow burst error detection and correction.

So now we're talking about being doubly redundant. Put ECC on the ECC? I'm not sure that helps.

One thing that helps is to specifically look at damage mechanisms and then come up with a strategy that offers the best workaround. As an example, in the first-generation TDMA digital cellular phone standard, it was recognized that channel noise was typically bursty. You could lose much of one frame of data, but adjacent frames would be OK. So they encoded important data with a 4:1 forward error correction scheme, and then interleaved the encoded data over two frames. If you lost a frame, the de-interleavin

but how much of our ability to "read through" noise is because we know what data to expect in the first place?

I know when listening to a radio station that's barely coming in, many times it will sound like random noise, but at some point I will hear a certain note or something and suddenly I know what song it is. Now that I know what song it is, I have no problems actually "hearing" the song. I know what to expect from the song or have picked up on certain cues and I'm guessing that pattern recognition goes

That would be trivial to do if we were still doing BMP and WAV files, where one bit = one speck of noise. But on the file/network level we use a ton of compression, and the result is that a bit error isn't a bit error in the human sense. Bits are part of a block, and other blocks depend on that block. One change and everything that comes after until the next key frame changes. That means two vastly different results are suddenly a bit apart.

What files does a single bit error irretrievably destroy? Obviously it may cause problems, even very annoying problems when you go to use the file. But unless that one bit is in a really bad spot that information is pretty recoverable.

That's complete nonsense. Just for one example, if the bit is part of a numeric value, depending on where that bit is, it could make the number off anywhere from 1 to 2 BILLION or even a lot more, depending on the kind of representation being used.

Most modern compression formats will not tolerate any errors. With LZ a single bit error could propagate over a long expanse of the uncompressed output, while with Arithmetic encoding the remainder of the file following the single bit error will be completely unrecoverable.

Pretty much only the prefix-code style compression schemes (Huffman for one) will isolate errors to short sgements, and then only if the compressor is not of the adaptive variety.

Doesn't rar have a capacity to have some redundancy though? I seem to recall that downloading multi-part rar files from usenet a while ago that included some extra files that could be used to rebuild the originals in the case of file corruption (which happened fairly often with usenet).

I'd venture to say TrueCrypt containers, when that corruption occurs at the place where they store the encrypted symmetrical key. Depending on the size of said container it could be the whole harddisk.:)

Fortunately, newer versions of TC have two headers, so if the main one is scrozzled, you can check the "use backup header embedded in the volume if available" under Mount Options and have a second chance of mounting the volume.

I've got a 10 gig.tar.bz2 file that I've only been partially able to recover due to a couple of bad blocks on a hard drive. I ran bzip2recover on it, which broke it into many, many pieces, and then put them back together into a partially recoverable tar file. Now I just can't figure out how to get past the corrupt pieces.:(

Many of the jpgs that I took with my first digital camera were damaged or destroyed to bit corruption. I doubt I'm the only person who fell to that problem; those pictures were taken back in the day when jpg was the only option available on cameras and many of us didn't know well enough to save it under a different file format afterwards. Now I have a collection of images where half the image is missing or disrupted - and many others that just simply don't open at all anymore.

With modern compression and encryption algorithms, a single bit flipped can mean a *lot* of downstream corruption, especially in video that uses deltas, or encryption algorithms that are stream based, so all bits upstream have some effect as the file gets encrypted. A single bit flipped will render an encrypted file completely unusable.

Most of the storage media in common use (disks, tapes, CD/DVD-R) already do use ECC at sector of block level and will fix "single bit" errors at firmware level transparently. What is more of an issue at application level are "missing block" errors; when the low-level ECC fails and the storage device signals "unreadable sector" and one or more blocks of data are lost.

Off course this can be fixed by "block redundancy" (like RAID does), "block recovery checksums" or old-fashioned backups.

RAID 0 does not offer "block redundancy".
If I have old-fashioned backups, how can I determine that my primary file has sustained damage? The OS doesn't tell us. In fact, the OS probably doesn't even know.
Backup software often allows you to validate the backup immediately after the backup was made, but not, say, a year later.
The term "checksum" is over-used. A true check-sum algorithm *may* tell you that a file has sustained damage. It will not, however, tell you how to correct it. A "CRC" (cyclic-redu

Have you ever wondered why the numbers in RAID levels are what they are? RAID-1 is level 1 because it's like the identity property; everything's the same (kinda like multiplying by 1). RAID-0 is level 0 because it is not redundant; ie, there's zero redundency (as in, multiplying by 0). It's an array of inexpensive disks, ure, but calling RAID-0 a RAID is a misnomer at best. RAID-0 was not in the original RAID paper, in fact.

No one talking about data protection uses RAID-0, and it's therefore irrelevant

It means we need error correction at every level---error correction at physical device (already in place, more or less) and error correction at file system level (so even if a few blocks from a file are missing, the file system auto-corrects itself and still functions---upto some point of course).

It is about time that somebody (hopefully some of the commercial vendors AND the open source community too) get wise to the problems of digital storage.

I always create files with unique headers and consistent version numbering to allow for minor as well as major file format changes. For storage/exchange purposes, I make the format expandable where each subfield/record has an individual header with a field type and a length indicator. Each field is terminated with a unique marker (two NULL bytes) to make the format resilient to errors in the headers with possible resynchronisationthrough the markers. The format is in most situations backward compatible to a certain extent as an old program can always ignore fields/subfields it does not understand in a newer format file. If that is not an option, the major version number is incremented. This means that a version 2.11 program can read a version 2.34 file with only minor problems. It will not be able to write to that format, though. The same version 2.11 program would not be able to correctly read a version 3.01 file either.

I have not implemented ECC in the formats yet, but maybe the next time I do an overhaul... I will have to ponder that. Maybe not, my programs seem to ephemeral for that... Then again, so did people think about their 1960es COBOL programs.

Because all of those are compressed, and take up a tiny fraction of the space that a faithful digital recording of the information on a film reel would take up. If you want lossless-level data integrity, use lossless formats for your masters.

Old documents, saved in 'almost like ascii' is still 'readable'. I once salvaged a document from some obscure ancient word processor by opening it in a text editor. I also found some "images" (more like icons) on the same disk (a copy of a floppy), even these I could "read" (by changing the page width of my text editor to fit the width of the uncompressed image).

I remember reading a story of a guy who had to download a file from Apple that was over 4 gigabytes, and had to attempt it several times because each came back corrupted due to some problem with his internet. Eventually, he gave up and found the file on bit torrent, but realized if he saved it in the same location as the corrupted file, it would check the file and then overwrite it with the correct information. He was able to fix it in under an hour using bittorrent rather than trying to re-download the file while crossing his fingers and praying for no corruption.

Quite frankly data is so duplicated today bit-rot is not really an issue if you know what tools to use, especially if you use tools like quickpar on important data that can handle bad blocks.

Much data is easily duplicated, the data you want to save if it is important should be backed up with care.

Even though much of the data I download is easily downloaded again, the stuff I want to keep I quickpar the archives and burn to disc, and really important data that is irreplacable I make multiple copies.

It basically allows you to create an archive that's selectively larger, but contains an amount of parity such that you can have XX% corruption and still 'unzip.'

"The original idea behind this project was to provide a tool to apply the data-recovery capability concepts of RAID-like systems to the posting and recovery of multi-part archives on Usenet. We accomplished that goal." [http://parchive.sourceforge.net/]

I would like this. Some options I could work with: Extensions to current CD/DVD/Bluray ISO formats, new version of "ZIP" files and a new version of True Crypt files. If done in an open standards way I could be somewhat confident of support in many years time when I may need to read the archives. Obviously backwards compatibility with earlier iso/file formats would be a plus.

Ten years ago my old company used to advocate that for individuals who wanted to convert paper to digital, they first put them on microfilm and then scan them. That way when their digital media got damaged or lost they could always recreate it. Film last for a long long time when stored correctly. Unfortunately that still seems the be the best advice, at least if you are starting from an analog original.

I believe that Forward-error correction is an even better model. Already used for error-free transmission of data over error-prone links in radio, and USENET using the PAR format, what better way to preserve data than with FEC?Save your really precious files as Parchive files (PAR and PAR2). You can spread them over several discs or just one disc with several of the files on it.

It's one thing to detect errors, but it's a wholly different universe when you can also correct them.

Problem is not in error correction, but actually in linearity of data. Using only 256 pixels you could represent an image brain can interpret. Problem is, brain can not interpret an image form first 256 pixels, as that would probably be a line half long as the image width, consisting of mostly irrelevant data.
If I would want to make a fail proof image, I would split it to squares of, say, 9(3x3) pixels, and than put only central pixel(every 5th px) values in byte stream. Once that is done repeat that for s

I know I could still read the book even if every fifth letter was replaced by a incorrect one.

That would depend on the book. Your brain could probably error-correct a novel easily enough under those conditions (especially if it was every fifth character, and not random characters at a rate of 20%). But I doubt anyone could follow, say, a math textbook with that many errors.

Tahoe-LAFS uses zfec, encryption, integrity checking based on SHA-256, digital signatures based on RSA, and peer-to-peer networking to take a bunch of hard disks and make them into a single virtual hard disk which is extremely robust: http://allmydata.org/trac/tahoe

Because no-one yet has ever managed to pull things from this theoretical "historical" layer without at least something like a electron microscope costing tens or hundreds of thousands, thousands of hours of skilled *manual* work and having to crack the damn harddrive open and destroy it (if at all)? I believe there is a still a challenge going around with a hard drive that was "zeroed" quite simply and if anyone can recover the password in the single file that was on it before it was zeroed, then can get a

That would like, totally be possible! Except of course that you would probably find that there are a _LOT_ of files of 20GB or less that would give you those exact same hashes. Some of those files will be a valid blu-ray file. But apart from that, way to go!

It'd be interesting if someone where to do the math on this one, I think you'd be disappointed.