On Wed, Dec 27, 2006 at 09:08:56PM +0530, Suparna Bhattacharya wrote:> (2) Most of these other applications need the ability to process both> network events (epoll) and disk file AIO in the same loop. With POSIX AIO> they could at least sort of do this using signals (yeah, and all associated> issues). The IO_CMD_EPOLL_WAIT patch (originally from Zach Brown with> modifications from Jeff Moyer and me) addresses this problem for native> linux aio in a simple manner. Tridge has written a test harness to> try out the Samba4 event library modifications to use this. Jeff Moyer> has a modified version of pipetest for comparison.>

Enable epoll wait to be unified with io_getevents

From: Zach Brown, Jeff Moyer, Suparna Bhattacharya

Previously there have been (complicated and scary) attempts to funnelindividual aio events down epoll or vice versa. This instead lets oneissue an entire sys_epoll_wait() as an aio op. You'd setup epoll asusual and then issue epoll_wait aio ops which would complete once epollevents had been copied. This will enable a single io_getevents() eventloop to process both disk AIO and epoll notifications.

From an application standpoint a typical flow works like this:- Use epoll_ctl as usual to add/remove epoll registrations- Instead of issuing sys_epoll_wait, setup an iocb using io_prep_epoll_wait (see examples below) specifying the epoll events buffer to fill up with epoll notifications. Submit the iocb using io_submit- Now io_getevents can be used to wait for both epoll waits and disk aio completion. If the returned AIO event is of type IO_CMD_EPOLL_WAIT, then corresponding result value indicates the number of epoll notifications in the iocb's event buffer, which can now be processed just like once would process results from a sys_epoll_wait()

There is obviously a little overhead compared to using sys_epoll_wait(), dueto the extra step of submitting the epoll wait iocb, most noticible whenthere are very few events processed per loop. However, the goal here is notto build an epoll alternative but merely to allow network and disk I/O tobe processed in the same event loop which is where the efficiencies reallycome from. Picking up more epoll events in each loop can amortize theoverhead across many operations to mitigate the impact.

Thanks to Arjan Van de Van for helping figure out how to resolve thelockdep complaints. Both ctx->lock and ep->lock can be held in certain wait queue callback routines, thus being nested inside q->lock. However, thisexcludes ctx->wait or ep->wq wait queues, which can safetly be nestedinside ctx->lock or ep->lock respectively. So we teach lockdep to recognizethese as distinct classes.

/* * Calculate the timeout by checking for the "infinite" value ( -1 )@@ -1569,16 +1643,13 @@ retry: * We need to sleep here, and we will be wake up by * ep_poll_callback() when events will become available. */- init_waitqueue_entry(&wait, current);- __add_wait_queue(&ep->wq, &wait);- for (;;) { /* * We don't want to sleep if the ep_poll_callback() sends us * a wakeup in between. That's why we set the task state * to TASK_INTERRUPTIBLE before doing the checks. */- set_current_state(TASK_INTERRUPTIBLE);+ prepare_to_wait(&ep->wq, wait, TASK_INTERRUPTIBLE); if (!list_empty(&ep->rdllist) || !jtimeout) break; if (signal_pending(current)) {@@ -1587,12 +1658,16 @@ retry: }

/*+ * Same as schedule_timeout, except that it checks the wait queue context+ * passed in, and in case of an asynchronous waiter it does not sleep,+ * but returns -EIOCBRETRY to allow the operation to be retried later when+ * notified, unless it has been cancelled in which case it returns -EINTR+ */+fastcall signed long __sched schedule_timeout_wait(signed long timeout,+ wait_queue_t *wait)+{+ struct kiocb *iocb;+ if (is_sync_wait(wait))+ return schedule_timeout(timeout);++ iocb = io_wait_to_kiocb(wait);+ if (kiocbIsCancelled(iocb))+ return -EINTR;++ return -EIOCBRETRY;+}+++/* * We can use __set_current_state() here because schedule_timeout() calls * schedule() unconditionally. */_