Recognizing the limitations of both poll() and
select()
, the 2.6 Linux kernel* intro
duced the event poll (epoll) facility. While more complex than the two earlier interfaces, epoll solves the fundamental performance problem shared by both of them, and adds several new features.

Both
poll()
and
select()
(discussed in Chapter 2) require the full list of file descriptors to watch on each invocation. The kernel must then walk the list of each file descriptor to be monitored. When this list grows large—it may contain hundreds or even thousands of file descriptors—walking the list on each invocation becomes a scalability bottleneck.

Epoll circumvents this problem by decoupling the monitor registration from the actual monitoring. One system call initializes an epoll context, another adds monitored file descriptors to or removes them from the context, and a third performs the actual event wait.

Creating a New Epoll Instance

An epoll context is created via epoll_create() :

#include <sys/epoll.h>

int epoll_create (int size)

A successful call to epoll_create()instantiates a new epoll instance, and returns a file descriptor associated with the instance. This file descriptor has no relationship to a real file; it is just a handle to be used with subsequent calls using the epoll facility. The
size
parameter is a hint to the kernel about the number of file descriptors that are going to be monitored; it is not the maximum number. Passing in a good approximation will result in better performance, but the exact number is not required. On error, the call returns
-1
, and sets
errno
to one of the following:

EINVAL
The
size
parameter is not a positive number.

ENFILE The system has reached the limit on the total number
of open files.

A successful call to
epoll_ctl()
controls the epoll instance associated with the file descriptor
epfd
. The parameter
op
specifies the operation to be taken against the file associated with
fd
. The
event
parameter further describes the behavior of the operation.

Here are valid values for the
op
parameter:

EPOLL_CTL_ADD

Add a monitor on the file associated with the file
descriptor
fd
to the
epoll instance associated with epfd
, per the events defined in
event
.

EPOLL_CTL_DEL

Remove a monitor on the file associated with the file descriptor
fd
from the epoll instance associated with
epfd
.

EPOLL_CTL_MOD

Modify an existing monitor of
fd
with the updated events specified by
event
.

The
events
field in the
epoll_event
structure lists which events to monitor on the given file descriptor. Multiple events can be bitwise-ORed together. Here are valid values:

EPOLLERR

An error condition occurred on the file. This event is always monitored, even if it’s not specified.

EPOLLET

Enables edge-triggered behavior for the monitor of the file (see the upcoming section “Edge- Versus Level-Triggered Events”). The default behavior is level- triggered.

EPOLLHUP

A hangup occurred on the file. This event is always monitored, even if it’s not specified.

EPOLLIN

The file is available to be read from without blocking.

EPOLLONESHOT

After an event is generated and read, the file is automatically no longer monitored. A new event mask must be specified via
EPOLL_CTL_MOD
to reenable the watch.

EPOLLOUT

The file is available to be written to without blocking.

EPOLLPRI

There is urgent out-of-band data available to read.

The
data
field inside the
event_poll
structure is for the user’s private use. The contents are returned to the user upon receipt of the requested event. The common practice is to set
event.data.fd
to
fd
, which makes it easy to look up which file descriptor caused the event.

Upon success,
epoll_ctl()
returns
0
. On failure, the call returns
-1
, and sets
errno
to one of the following values:

EBADF

epfd
is not a valid epoll instance, or
fd
is not a valid file descriptor.

EEXIST

op
was
EPOLL_CTL_ADD
, but
fd
is already associated with
epfd
.

EINVAL

epfd
is not an epoll instance,
epfd
is the same as
fd
, or
op
is invalid.

ENOENT

op
was
EPOLL_CTL_MOD
, or
EPOLL_CTL_DEL
, but
fd
is not associated with
epfd
.

ENOMEM

There was insufficient memory to process the request.

EPERM

fd
does not support epoll.

As an example, to add a new watch on the file associated with
fd
to the epoll instance
epfd
, you would write:

Note that the
event
parameter can be
NULL
when
op
is
EPOLL_CTL_DEL
, as there is no event mask to provide. Kernel versions before 2.6.9, however, erroneously check for this parameter to be non-
NULL
. For portability to these older kernels, you should pass in a valid non-
NULL
pointer; it will not be touched. Kernel 2.6.9 fixed this bug.

{mospagebreak title=Waiting for Events with Epoll}

The system call epoll_wait() waits for events on the file descriptors associated with the given epoll instance:

A call to epoll_wait() waits up to timeout
milliseconds for events on the files associ
ated with the epoll instance
epfd
. Upon success,
events
points to memory containing
epoll_event
structures describing each event, up to a maximum of
maxevents
events. The return value is the number of events, or
-1
on error, in which case
errno
is set to one of the following:

EBADF epfd
is not a valid file descriptor.

EFAULT The process does not have write access to the
memory pointed at by
events
.

EINTR
The system call was interrupted by a signal before it
could complete.

EINVAL epfd
is not a valid epoll instance, or
maxevents
is
equal to or less than
0
.

If
timeout
is
0
, the call returns immediately, even if no events are available, in which case the call will return
0
. If the
timeout
is
-1
, the call will not return until an event is available.

When the call returns, the
events
field of the
epoll_event
structure describes the events that occurred. The
data
field contains whatever the user set it to before invocation of
epoll_ctl()
.

If the EPOLLET value is set in the events
field of the
event
parameter passed to
epoll_ctl()
, the watch on
fd
is edge-triggered, as opposed to level-triggered.

Consider the following events between a producer and a consumer communicating over a Unix pipe:

The producer writes 1 KB of data onto a pipe.

The consumer performs an
epoll_wait()
on the pipe, waiting for the pipe to contain data, and thus be readable.

With a level-triggered watch, the call to
epoll_wait()
in step 2 will return immedi
ately, showing that the pipe is ready to read. With an edge-triggered watch, this call will not return until after step 1 occurs. That is, even if the pipe is readable at the invocation of
epoll_wait()
, the call will not return until the data is written onto the pipe.

Level-triggered is the default behavior. It is how
poll()
and
select()
behave, and it is what most developers expect. Edge-triggered behavior requires a different approach to programming, commonly utilizing nonblocking I/O, and careful checking for
EAGAIN
.

The terminology comes from electrical engineering. A level-triggered interrupt is issued whenever a line is asserted. An edge-triggered interrupt is caused only during the rising or falling edge of the change in assertion. Level-triggered interrupts are useful when the state of the event (the asserted line) is of interest. Edge-triggered interrupts are useful when the event itself (the line being asserted) is of interest.