Thursday, October 30, 2008

In recent Linux kernels, especially 2.6.27, a number of system calls have changed, or new versions of existing system calls have been added, to allow more control over the file descriptors created by those system calls. (Most of this work has been done by Ulrich Drepper.) These changes have taken the form of either adding new bits to the flags bit-mask argument of an existing system call, if it had such an argument, or creating a new version of the system call that adds an extra flags argument. In most cases, two new flags have been added: a close-on-exec flag, and a non-blocking flag, which we describe shortly.

The changes are summarized in the table below. In this table, the Kernel column indicates the kernel version where the change occurred, and the Glibc column indicates the version of glibc that adds the corresponding wrapper functions and/or header file definitions. (Note: glibc 2.9 is not yet released.)

A proposed analogous change for accept(2), paccept(), supporting flags SOCK_CLOEXEC and SOCK_NONBLOCK and treatment of a signal mask argument like pselect(2), was debated and then spent some time in limbo, but has recently re-emerged in a somewhat modified form, accept4() (which was in fact the original proposal), that will probably go into Linux 2.6.28 or 2.6.29.

Perhaps one day there might even be an analogous change for mq_notify(3), since (on Linux, but not on most other systems) a message queue descriptor is really just a file descriptor.

The close-on-exec flag (*_CLOEXEC)

The addition of a close-on-exec flag was the primary motivator for the system call changes. Specifying this flag causes the file descriptor created by the system call to automatically have its close-on-exec flag set. (This flag causes the file descriptor to automatically be closed if the process does a successful execve(2).)

Before the existence of this flag, it was possible to change the close-on-exec flag of a file descriptor after it has been created, using the fcntl(2)F_GETFL and F_SETFL operations. The fact that this required two additional system calls was not so problematic as the fact that the need for multiple (non-atomic) steps to set the flag on a new file descriptor meant that there were certain race conditions that could lead to races in multithreaded programs where one thread was trying to set a file descriptor's close-on-exec flag at the same time as another thread was performing a fork() plus execve(). Ulrich Drepper explains the resulting security issues in more detail.

The non-blocking flag (*_NONBLOCK)

The *_NONBLOCK flag causes the non-blocking flag to be set on the open file description associated with the new file descriptor. (For a discussion of the relationship of a file descriptor to an open file description, see the open(2) man page.)

Unlike the *_CLOEXEC flag, the *_NONBLOCK flag exists merely as a convenience: it saves two system call operations (fcntl(2)F_GETFL and F_SETFL) if we want to immediately set the non-blocking flag when opening a file descriptor.

Note that there deliberately is no *_NONBLOCK flag for dup3(2). This would not be sensible, since the new file descriptor shares an open file description with the old file descriptor.

The flags argument added for the new system calls allows for other kinds of functionality to be added to these system calls in the future.

Future standards?

Ulrich Drepper already did some work on getting some of these interface changes into the POSIX.1-2008 standard, which includes specifications of the O_CLOEXEC flag for open() and the F_DUPFD_CLOEXEC operation for fcntl(). In the future, some the other changes may also make their way into the standard.

A note on the new system call names

The numbers in the names of the new system calls refer to the number of arguments that each system call has. This is an extension of a convention that was used for some existing Unix system calls, notably dup2(2), wait3(2), and wait4(2). Note that while the wrapper function for signalfd(2) has three arguments, the underlying signalfd4() system call really does have four arguments, as described in the man page. (However, this suggests that, in the end, this naming scheme might not have been the best choice.)

Documentation is added for a set of new and changed system calls (which will be the subject of a future post) that extend the functionality of existing system calls that work with file descriptors. (The changes occurred in kernel 2.6.27.) The new and modified system calls add flags that allow the close-on-exec file descriptor flag to be set, and the non-blocking file status to be set on a file description, as the file is opened. The modified pages are: