O_DIRECT kicks off bios and returns -EIOCBQUEUED to indicate its intention tocall aio_complete() once the bios complete. As we return from submission wemust preserve the -EIOCBQUEUED return code so that fs/aio.c knows to let thebio completion call aio_complete(). This stops us from returning errors afterO_DIRECT submission.

But we have a few places that are very interested in generating errors afterbio submission.

The most critical of these is invalidating the page cache after a write. Thisavoids exposing stale data to buffered operations that are performed after theO_DIRECT write succeeds. We must do this after submission because the userbuffer might have been an mmap()ed buffer of the region being written to. Theget_user_pages() in the O_DIRECT completion path could have faulted in staledata.

So this patch introduces a helper, aio_propogate_error(), which queuespost-submission errors in the iocb so that they are given to the usercompletion event when aio_complete() is finally called.

To get this working we change the aio_complete() path so that the ringinsertion is performed as the final iocb reference is dropped. This gives thesubmission path time to queue its pending error before it drops its reference.This increases the space in the iocb as it has to record the two result codesfrom aio_complete() and the pending error from the submission path.

This was tested by running O_DIRECT aio-stress concurrently with buffered readswhile triggering EIO in invalidate_inode_pages2_range() with the help of adebugfs bool hack. Previously the kernel would oops as fs/aio.c and biocompletion both called aio_complete(). With this patch aio-stress sees -EIO.

- /* add a completion event to the ring buffer.- * must be done holding ctx->ctx_lock to prevent- * other code from messing with the tail- * pointer since we might be called from irq- * context.- */+ /*+ * We queue up the completion codes into the iocb. They are combined+ * with a potential error from the submission path and inserted into+ * the ring once the last reference to the iocb is dropped. Cancelled+ * iocbs don't insert events on completion because userland was given+ * an event directly as part of the cancelation interface.+ */ spin_lock_irqsave(&ctx->ctx_lock, flags);

+/*+ * This function is used to make sure that an error is communicated to+ * userspace on iocb completion without stopping -EIOCBQUEUED from bubbling up+ * to fs/aio.c from the place where it originated.+ *+ * If we have an existing -EIOCBQUEUED it must be returned all the way to+ * fs/aio.c so that it doesn't double-complete the iocb along with whoever+ * returned -EIOCBQUEUED.. In that case we put the new error in the iocb. It+ * will be returned to userspace *intead of* the first result code given to+ * aio_complete(). Use this only for errors which must overwrite whatever the+ * return code might have been. The first non-zero new_err given to this+ * function for a given iocb will be returned to userspace.+ */+static inline int aio_propogate_error(struct kiocb *iocb, int existing_err,+ int new_err)+{+ if (existing_err != -EIOCBQUEUED)+ return new_err;+ if (!iocb->ki_pending_err)+ iocb->ki_pending_err = new_err;+ return -EIOCBQUEUED;+}+ /* for sysctl: */ extern unsigned long aio_nr; extern unsigned long aio_max_nr;diff -r 8a740eb579d4 mm/filemap.c--- a/mm/filemap.c Mon Feb 19 13:12:20 2007 -0800+++ b/mm/filemap.c Mon Feb 19 13:16:00 2007 -0800@@ -2031,7 +2031,7 @@ generic_file_direct_write(struct kiocb * ((file->f_flags & O_SYNC) || IS_SYNC(inode))) { int err = generic_osync_inode(inode, mapping, OSYNC_METADATA); if (err < 0)- written = err;+ written = aio_propogate_error(iocb, written, err); } return written; }@@ -2396,7 +2396,7 @@ generic_file_direct_IO(int rw, struct ki int err = invalidate_inode_pages2_range(mapping, offset >> PAGE_CACHE_SHIFT, end); if (err)- retval = err;+ retval = aio_propogate_error(iocb, retval, err); } } return retval;-To unsubscribe from this list: send the line "unsubscribe linux-kernel" inthe body of a message to majordomo@vger.kernel.orgMore majordomo info at http://vger.kernel.org/majordomo-info.htmlPlease read the FAQ at http://www.tux.org/lkml/