Comments

Hi,
The attached is a patch to fix this problem,
please have a look.
2010/1/29 丁定华 <dingdinghua85@gmail.com>:
> Hi,> I think what we need to do is to distinguish the buffers which can> be discarded after> this transaction commits and which we must wait until the next transaction> commits, and as you wrote, set b_next_transaction to running> transaction and file them> into t_forget list is a good way.>> 2010/1/28 Jan Kara <jack@suse.cz>:>> Hi,>>>> On Thu 28-01-10 09:23:43, 丁定华 wrote:>>> As you wrote, if T2!=T1, then T2 is committing transaction while T1>>> is running transaction, and if T1 complete commit, we don't care>>> about the content of buffers. But there is a prerequisite -->"T1>>> complete commit", if T1 start commit and another transaction T3>>> becomes the new running transaction, T3 may need to reuse T2 log>>> space and force checkpoint, and since we have clean the BH_dirty bit>>> of buffers after T2 commits, so T2 may be freed before T1 complete>>> commit, and unfortunately, T1 doesn't complete commit, so after>>> replay, updates of T2 get lost, fs becomes inconsistent.>> Hmm, you are right. That could probably happen. It only hits ext3/4 in>> data=journaled mode but the bug is there. But it's hard to fix it in a>> reasonable way - if we would just leave the dirty bit set, we will have>> aliasing problems - buffer B is attached to some page which used to be from>> file F, so unmap_underlying_metadata will not find it because it looks only>> into page cache the block device, not to the pagecaches of individual>> files. So if we reallocate the block of B for some other file G before the>> buffer B is checkpointed, we have two dirty buffers representing the same>> block and thus data corruption can happen.>> Maybe we could handle them by setting b_next_transaction to the>> transaction that deleted the buffer (in jbd2_journal_unmap_buffer) and>> setting buffer freed like we do now. Commit code would handle freed>> buffers like:>> If b_next_transaction is set, file buffer to forget list of the next>> transaction. If b_next_transaction isn't set, clear all dirty bits.>> What do you think?>>>> Honza>>>>> 2010/1/27 Jan Kara <jack@suse.cz>:>>> > Hi,>>> >>>> > On Wed 27-01-10 10:32:18, 丁定华 wrote:>>> >> I'm a little confused about BH_Freed bit. The only place it is set>>> >> is journal_unmap_buffer, which is called by jbd2_journal_invalidatepage when>>> >> we want to truncate a file. Since jbd2_journal_invalidatepage is called>>> >> outside of transaction, We can't make sure whether the "add to orphan">>> >> operation belongs to committing transaction or not, so we can't touch the>>> >> buffer belongs to committing transaction, instead BH_Freed bit is set to>>> >> indicate that this buffer can be discarded in running transaction. But i>>> >> think we shouldn't clear BH_JBDdirty in jbd2_journal_commit_transaction, as>>> >> following codes does:>>> >> /* A buffer which has been freed while still being>>> >> * journaled by a previous transaction may end up still>>> >> * being dirty here, but we want to avoid writing back>>> >> * that buffer in the future now that the last use has>>> >> * been committed. That's not only a performance gain,>>> >> * it also stops aliasing problems if the buffer is left>>> >> * behind for writeback and gets reallocated for another>>> >> * use in a different page. */>>> >> if (buffer_freed(bh)) {>>> >> clear_buffer_freed(bh);>>> >> clear_buffer_jbddirty(bh);>>> >> }>>> >> Note that, *We can't make sure "current running transaction" can complete>>> >> commit work.* If we clear BH_JBDdirty bit here, this buffer may be freed>>> >> here, the log space of older transaction may be freed before the "current>>> >> running transaction" complete commit work, and if this happends, filesystem>>> >> will be inconsistent.>>> > Let me sketch the situation here:>>> > The file F gets truncated. The inode is added to orphan list in some>>> > transaction T1, only then jbd2_journal_invalidatepage can be called.>>> > As you wrote above, it can happen that jbd2_journal_invalidatepage on>>> > buffer B runs when some transaction T2 containing B is being committed and>>> > in that case we set BH_Freed. If T2 != T1 - i.e., T2 is being committed>>> > and T1 is the running transaction, note that we clear the dirty bit only>>> > when T2 is fully committed and we are processing forget list. So buffer has>>> > been properly written to T2 and we just won't write it in the transaction>>> > T1. And that is fine because as soon as transaction T1 finishes commit, we>>> > don't care about what happens with buffers of F because the fact that F is>>> > truncated is recorded and in case of crash we finish truncate during>>> > journal replay. And if we crash before T1 finishes commit, we don't care>>> > about contents of T1 either. If T2 == T1, the above reasoning applies as>>> > well and the situation is even simpler.>>> >>>> > Honza>>> > -->>> > Jan Kara <jack@suse.cz>>>> > SUSE Labs, CR>>> >>>>>>>>>>>>> -->>> 丁定华>> -->> Jan Kara <jack@suse.cz>>> SUSE Labs, CR>>>>>> --> 丁定华>