Re: [SOLVED] EXT4 Data Corruption Bug Linux 3.6.2 & 3.6.3

Ted released a patch in this post which he believes may fix the issue. He goes on to write, "...we know that my patch definitely restores the behaviour previous to commit eeecef0af5, so it can't hurt, but we do want to make 100% sure that it really fixes the problem. "

Re: [SOLVED] EXT4 Data Corruption Bug Linux 3.6.2 & 3.6.3

graysky wrote:

Ted released a patch in this post which he believes may fix the issue. He goes on to write, "...we know that my patch definitely restores the behaviour previous to commit eeecef0af5, so it can't hurt, but we do want to make 100% sure that it really fixes the problem. "

I have patched this into 3.6.3 just fine.

Do you mean you patched it into your Linux-CK kernel repo you maintain?

Re: [SOLVED] EXT4 Data Corruption Bug Linux 3.6.2 & 3.6.3

It should be fine since, in Ted's own words, "...we know that my patch definitely restores the behaviour previous to commit eeecef0af5, so it can't hurt, but we do want to make 100% sure that it really fixes the problem. " But I am still reluctant to push it since this seems to be an evolving situation... perhaps I'll change my mind. I dunno.

Re: [SOLVED] EXT4 Data Corruption Bug Linux 3.6.2 & 3.6.3

Thanks, mcover! I suppose I'll stick with 3.5.6 in that case until 3.7 comes out. I had originally downgraded from 3.6 because of network issues, which may have been entirely unrelated to the 3.6 kernel of course. But I'm still glad I reverted early. Better safe than sorry where the filesystem is concerned. I've seen quite a few hard drives that were totaled by ReiserFS 3, so I was really hoping EXT4 would prove rock-solid by comparison.

Re: [SOLVED] EXT4 Data Corruption Bug Linux 3.6.2 & 3.6.3

It looks like my original analysis may not have been correct. At least, Eric and I haven't been able to figure out a way to trigger the problem based on my hypothesis of what had been going wrong. Still, the commit in question *does* change things, and so it's still the most likely culprit. (There were no ext4-related changes between v3.6..v3.6.1 and v3.6.2..v3.6.3, and I've looked at all of the changes between v3.6.2 and v3.6.3; all of the other changes look innocuous.) I have a patch (sent around 1:23 am Eastern on Wed., Oct. 24th to the ext4 list on the relevant mail thread) which should revert the problematic change in behavior, as well as put it a check which looks for the original conditions which might have triggered the patch, and prints a warning plus a stack trace so we can really understand what is going on. I don't want to consider this fixed until we have a reproduction case, so we can state with 100% certainty that we understand how it was triggered, and so we know that the proposed patch really does fix things.

That being said, please note that Fedora 17 is apparently on 3.6.2, and so far we only have two users who have reported the problem (or more specifically, both have reproduced file system corruptions with very similar symptoms, one running v3.6.2 and one running v3.6.3). The fact that they have reported the problem on very different hardware (one using a USB stick, the other using a Software RAID-5 setup), means it's not likely a hardware induced problem. However, this could potentially just be bad luck, since the fs corruption that was reported could have been explained by a random hardware glitch. With two users reporting it, though we have to treat it as potentially a real bug, and so I've gone back and re-audited all of the ext4 related commits that went into the v3.6.x stable kernel series.

If you think you have a related, similar bug, please check which kernel version you are using, and get the EXT4 error messages from the syslogs, and report it to me and the ext4 list. And if you can reproduce it reliably, I definitely want to hear from you. :-)

Re: [SOLVED] EXT4 Data Corruption Bug Linux 3.6.2 & 3.6.3

When i woke up my system was unresponsive for the most part and when i tried to log in in a vt i got some I/O error for /dev/sda.A hard rebooot and the recovery of the journal at startup made the above go away, but i think it must be cause of this.As there is nothing in logs, how do you check for corrupted data?

There shouldn't be any reason to learn more editor types than emacs or vi -- mg (1)[You learn that sarcasm does not often work well in international forums. That is why we avoid it. -- ewaller (arch linux forum moderator)