Re: [Qemu-devel] [PATCH] bdrv_aio_flush

From:

Andrea Arcangeli

Subject:

Re: [Qemu-devel] [PATCH] bdrv_aio_flush

Date:

Mon, 1 Sep 2008 14:25:21 +0200

On Mon, Sep 01, 2008 at 12:27:02PM +0100, Ian Jackson wrote:
> I think this is fine. We discussed this some time ago. bdrv_flush
You mean fsync is just fine, or you mean replacing fsync with
aio_fsync is fine and needed? ;)
> guarantees that _already completed_ IO operations are flushed. It
> does not guarantee that in flight AIO operations are completed and
> then flushed to disk.
In case you meant fsync is just fine, Linux will use the
WIN_FLUSH_CACHE/WIN_FLUSH_CACHE_EXT see
idedisk_prepare_flush:
if (barrier) {
ordered = QUEUE_ORDERED_DRAIN_FLUSH;
prep_fn = idedisk_prepare_flush;
}
so if we don't want guest journaling to break with scsi/virtio, we've
to make sure the AIO is committed to disk before the flush returns.
To be clear: this is only a problem if there's a power outage in the
host.
> [..] I can't see any reason to think that the
> `write cache' which is referred to by the spec is regarded as
> containing data which has not yet been DMAd from the host to the disk
> because the command which does that transfer is not yet complete.
I'm not sure I follow, IDE is safe because it submits a command at
once and we don't simulate dirty write cache. So by the time
bdrv_flush is called, the previous aio_write is already completed, and
in turn the dirty data is already visible to the kernel that will
write it to disk with fsync.
But anything a bit more clever than IDE that allows the guest to
submit a barrier in a TCQ way, like scsi or virtio, will break the
guest journaling if fsync is used. By the time the flush operation
returns all previous data must be written to disk. Or at least the
flush operation should return in order, so anything after the barrier
operation should be written after the previous stuff. And fsync can't
guarantee it, because it'll return immediately even if the aio queue
is huge, and after the aio queue is flushed to kernel writeback cache,
the kernel is free to write the writeback cache in whatever order it
wants (in linux it'll try to write it in dirty-inode order first, and
then in logical order according to the offset of the dirty data into
the inode looking up the inode radix tree).