Monday, June 30, 2008

So I came to write the man pages for utimensat() and futimens(), interfaces that are specified in the upcoming POSIX.1 revision, and recently added to Linux (kernel 2.6.22, glibc 2.6), and it's a familiar story: bugs, and yet more bugs.

This case is a little worse than usual (the gory details are below), but hardly exceptional: it was a similar story with other recent APIs, such as splice(), signalf(), and the timerfd API (the last of which was ultimately redesigned after some pushing by me). In each case, an interface that was already released in a stable kernel turned out to have easy-to-find bugs when I came to document and test it. All that was required was to come up with something like a reasonable written specification (i.e., a man page), and then start testing against that specification.

The following seems (to me) like a reasonable development model for new kernel userland APIs:

Write a more or less complete design specification (preferably something like a man page, which I'm more than happy to edit and review at this stage in the process), and perhaps even code up an initial version of the interface.

Get review comments about the design; revise the design if necessary.

Implement something like a final version of the interface.

Test the interface. Preferably: test in collaboration with other people who had nothing to do with the implementation (they will think about the interface in different ways from the implementer). If it's sensible to do so (and usually it is), produce a test suite to check the correctness of as many operational cases of the interface as possible (and then send that suite to the Linux Test Project).

Write the interface documentation (i.e., a man page) for userland programmers, and send it to me for editing and review. (If you do this smart, then you can just recycle the work from the first step.)

Get the interface accepted into mainline.

However, as far as I can see, the actual development model for kernel-userland interfaces seems to work something like this:

Code up the new interface.

Post the code to LKML, explaining that "I tested it, and it works", and, maybe, include some brief documentation of the interface.

Often, someone looks at the code and suggests some fixes. Usually, no one else does any testing.

Code gets accepted into mainline.

Usually, some few dot releases later: someone (often me) who had nothing to do with the earlier steps writes the man page.

Often, some few dot releases later: find and fix all of the bugs (e.g., in the case of utimensat(), released for 2.6.22, the fixes will only make it into 2.6.26 at the earliest).

Occasionally, some few dot releases later: realize that the interface could have been better designed, but it's too late now to change it, because we can't break the ABI.

Userland programmers end up dealing with kernel bugs they shouldn't have to deal with. (When this happens often, it can damage the reputation of the Linux kernel-userland interface, and userland programmers may become wary of adopting new interfaces, which can further delay any kind of real-world testing and adoption of the interface.)

Code that uses new interfaces may have to special case for ABI bugs that were present in the first few kernel releases that contained the bug.

Similarly, many kernel-userland interfaces could do with more design review as well. This can hurt even more than interfaces released with bugs, since, once an interface is released, it is at least difficult, but often impossible to modify the design, since doing so would break the kernel-userland ABI. (With timerfd, we got lucky: the interface was in any case broken by a bug, which gave a more or less blank slate for a redesign.) As a result, userland programmers usually just have to live with the bad design.

I'll have more time for testing and review nowadays, and hopefully I'll catch more problems before they are released into stable kernels, but there's certainly more than one person can do. And, as described above, there is a culture problem when it comes to adding kernel-userland interfaces on Linux. The question is whether the culture can be changed. (And here its worth repeating a point that I often make: one could perhaps try and blame individual developers for buggy interfaces, but that doesn't really get to the core issue (everyone is going to write bugs): the real problem is the process by which kernel-userland interface changes are accepted into the kernel.)

The Gory Details
For those who are interested, here are the problems I found with utimensat() and futimens(), explained by annotating the current draft of the man page.

This is the first of the advantages of the new interfaces: greater precision for setting timestamps. (The next revision of POSIX.1 specifies nanosecond file timestamps, and extends stat(2) to allow their retrieval. File system support is needed, and can't necessarily be retrofitted to some older file systems. Currently, Linux supports nanosecond timestamps on XFS, JFS, and ext4.)

With utimensat() the file is specified via the pathname given in pathname. With futimens() the file whose times- tamps are to be updated is specified via an open file descriptor, fd. For both calls, the new file timestamps are specified in the array times: times[0] specifies the new "last access time" (atime); times[1] specifies the new "last modifica- tion time" (mtime). Each of the elements of times speci- fies a time in seconds and nanoseconds since the Epoch (00:00:00, 1 Jan 1970, UTC), in a structure of the fol- lowing form: struct timespec { time_t tv_sec; /* seconds */ long tv_nsec; /* nanoseconds */ }; If the tv_nsec field of one of the timespec structures has the special value UTIME_NOW, then the corresponding file timestamp is set to the current time. If thetv_nsec field of one of the timespec structures has the special value UTIME_OMIT, then the corresponding file timestamp is left unchanged. In both of these cases, the value of the corresponding tv_sec field is ignored.

These are the other advantages of the new interfaces.

With utime(2) and utimes(2), to change just one of the timestamps, we must make a call to stat(2) to retrieve the current timestamps, use one of the timestamps to initialize the times element that we don't want to change, and then call utimensat() with the desired value for the other timestamp. This can be subject to race conditions: if another process updates the file timestamps between the two calls, then that update will be lost. UTIME_OMIT allows us to avoid this problem.

UTIME_NOW exists mainly as a convenience: it allows us to avoid fetching the current time (using gettimeofday(2) or similar) in order to set a file timestamp to "now".

If times is NULL, then both timestamps are set to the current time.Permissions requirements To set both file timestamps to the current time (i.e.,times is NULL, or both tv_nsec fields specify UTIME_NOW), either: 1. the caller must have write access to the file; 2. the caller's effective user ID must match the owner of the file; or 3. the caller must have appropriate privileges. To make any change other than setting both timestamps to the current time (i.e., times is not NULL, and bothtv_nsec fields are not UTIME_NOW and both tv_nsec fields are not UTIME_OMIT), either condition 2 or 3 above must apply. If both tv_nsec fields are specified as UTIME_OMIT, then no file ownership or permission checks are performed, and the file timestamps are not modified, but other error conditions may still be detected.utimensat() specifics [Details of dirfd and flags arguments omitted.]

RETURN VALUE On success, utimensat() and futimens() return 0. On error, -1 is returned and errno is set to indicate the error.ERRORS [Various other errors that are irrelevant to this post have been omitted.] EACCES times is NULL, or both tv_nsec values are UTIME_NOW, but the effective effective ID of the caller does not match the owner of the file, the caller does not have write access to the file, and the caller is not privileged (Linux: does not have either the CAP_FOWNER or the CAP_DAC_OVERRIDE capability). EACCES times is NULL, or both tv_nsec values are UTIME_NOW, and the file is marked immutable. EPERM The caller attempted to change one or both times- tamps to a value other than the current time, or to change one of the timestamps to the current time while leaving the other timestamp unchanged, (i.e., times is not NULL, both tv_nsec fields are not UTIME_NOW, and both tv_nsec fields are not UTIME_OMIT) but the caller's effective user ID does not match the owner of file, and the caller is not privileged (Linux: does not have the CAP_FOWNER capability). EPERM times is not NULL, and the file is marked append- only or immutable.NOTES On Linux, futimens() is a library function implemented on top of the utimensat() system call. To support this, the Linux utimensat() system call implements a non-standard feature: if pathname is NULL, then the system call modi- fies the timestamps of the open file referred to by the file descriptor dirfd (which may refer to any type of file). Using this feature, the call futimens(fd, times) is implemented as:utimensat(fd, NULL, times, 0);BUGS Several bugs afflict utimensat() and futimens() on ker- nels before 2.6.??.

Really, there should have been a test suite to go along with the initial implementation. There wasn't, unfortunately, so I wrote one (later revised a little before it went to LTP) whose results can be seen here.

I've posted patches to fix all of the bugs, and Andrew Morton has accepted them into -mm, and from there, they seem to have gone upstream to Al Viro. It's just a question of when they will get pushed into mainline. It'd be nice if they make it into 2.6.26, but maybe they won't, since it is already getting very late in the -rc cycle. (Update, 6 Jul 08: the patches have gone into -rc9.)

These bugs are either non-confor- mances with the POSIX.1 draft specification or inconsis- tencies with historical Linux behavior. * POSIX.1 specifies that if one of the tv_nsec fields has the value UTIME_NOW or UTIME_OMIT, then the value of the corresponding tv_sec field should be ignored. Instead, the value of the tv_sec field is required to be 0 (or the error EINVAL results).

This was a simple and obvious divergence from the specification.

* Various bugs mean that for the purposes of permission checking, the case where both tv_nsec fields are set to UTIME_NOW isn't always treated the same as specifyingtimes as NULL, and the case where one tv_nsec value is UTIME_NOW and the other is UTIME_OMIT isn't treated the same as specifying times as a pointer to a structure containing arbitrary time values. As a result, in some cases: a) file timestamps can be updated by a process that shouldn't have permission to perform updates; b) file timestamps can't be updated by a process that should have permission to perform updates; and c) the wrong errno value is returned in case of an error.

There are multiple error cases here:

If one of the tv_nsec fields is UTIME_OMIT and the other is UTIME_NOW, then the error EPERM should occur if the process's effective user ID does not match the file owner and the process is not privileged. Instead, the call successfully changes one of the timestamps.

If the file is not writable by the effective user ID of the process and the process's effective user ID does not match the file owner and the process is not privileged, and times is NULL, then the error EACCES results. This error should also occur if times points to a structure in which both tv_nsec fields are UTIME_NOW. Instead the call succeeds.

If a file is marked as append-only (see chattr(1)), then Linux traditionally (i.e., utime(2), utimes(2)), permits a NULLtimes argument to be used in order to update both timestamps to the current time. For consistency, utimensat() and futimens() should also produce the same result when given a times argument that points to a structure in which both tv_nsec fields are UTIME_NOW. Instead, the call fails with the error EPERM.

If a file is marked as immutable (see chattr(1)), then Linux traditionally (i.e., utime(2), utimes(2)), gives an EACCES error if times is NULL. For consistency, utimensat() and futimens() should also produce the same result when given a times that points to a structure in which both tv_nsec fields are UTIME_NOW. Instead, the call fails with the error EPERM.

* POSIX.1 says that a process that has write access to the file can make a call with times as NULL, or withtimes pointing to a structure in which both tv_nsec fields are UTIME_NOW, in order to update the both timestamps to the current time. However, futimens() instead checks whether the access mode of the file descriptor allows writing.

This means that a process with a file descriptor that allows writing could change the timestamps of a file for which it does not have write permission; conversely, a process with a read-only file descriptor won't be able to update the timestamps of a file, even if it has write permission on the file.

Tuesday, June 17, 2008

The POSIX man pages (man pages sections 0p, 1p, 3p) are now split out into a separate package, which can be downloaded here. This makes sense because the POSIX pages are logically separate: they are copies of specifications from the POSIX.1 standard (to be precise, we are currently redistributing pages from the 2003 Technical Corrigendum 1), that we have kindly been given permission to publish by the IEEE and The Open Group. Furthermore, the POSIX pages are only updated rarely (e.g., occasional formatting fixes, and periodic updates when a new version of the standard is released, such as the revision due to appear later this year), so there is no need to redistribute them with each release of man-pages.

After Stuart Brady noted that the quotes around single-quoted characters were not formatted well in UTF-8 output, I've globally replaced the characters used in the source files for quoting in man-pages in order to try and produce better formatted output for both ASCII and UTF-8 xterms, and also to bring greater consistency in general to the use of single and double quotes in the pages. This may not be the final word on the subject, but at least things should be somewhat improved.

Monday, June 9, 2008

Last month I started on what is, for the moment, my dream job: man-pages finally has a paid, full-time maintainer, thanks to a fellowship from the Linux Foundation. For the foreseeable future, that means I'll be working on:

Documenting every new Linux kernel-userland (and glibc) API, and every API change, that is released into the mainline kernel, ideally before actual release. (That's the ideal, but there's a quite a backlog, so I'm not going to achieve the ideal immediately.)

Testing new APIs, again ideally before they are released into the mainline kernel, and probably doing some light bug fixing while I'm at it.

Other than that, I'll be helping out with in-kernel documentation (the /Documentation directory) by providing editorial, review, and possibly other assistance.

When I find time, I'll try to post a longer job description (but have a look here in the meantime). In the meantime, I'm looking forward to going to the Linux Plumbers Conference, September 17-19, in Portland, OR, USA, where there is currently a planned track on kernel-userland APIs.

Friday, June 6, 2008

The Linux man-pages maintainer, Michael Kerrisk, and the maintainer of the French translation of man-pages, Alain Portal, proudly announce the simultaneous release of man-pages-2.80 in English original and French translation.