I have come up with a plan for switching to read-write vnode locks. I
believe the general concept has already been discussed in various
places, but the advantages of doing so are, roughly:
* It will give us better parallelism with SMP and (to a smaller
extent) preemptive scheduling on a single processoer.
* It will fix some existing vnode locking problems (which there are
PRs about).
* It will allow us (if done as proposed below) to get rid of almost
all of the special cases to avoid double-locking with `.', thereby
reducing code size substantially.
To facilitate this discussion, I will define two classes of operations
that affect vnodes:
* File operations. These affect only the `leaf' vnode. This class
includes things like stat(2), utimes(2) and truncate(2).
* Directory operations. These affect the directory containing the
target, and possibly the target as well. This class includes calls
like mkdir(2), rmdir(2), creat(3), rename(2), link(2) and unlink(2).
VOP_LOOKUP() shall be defined to take a directory vnode locked for
read (i.e. shared). If successful, it shall leave the parent locked
for read, and return the child vnode also locked for read. If
unsuccessful, it shall leave the parent locked as before.
Likewise, if successful, namei() shall leave both the parent and the
child locked for read. If unsuccessful, namei() shall unlock all
locks it obtained. The LOCKPARENT, WANTPARENT and LOCKLEAF flags
shall be eliminated; it shall be the responsibility of individual
operations to unlock the parent and/or leaf if they so desire before
continuing. File operations shall always do so; directory operations
shall never do so. All vnodes passed down from the high-level routine
(e.g. sys_mkdir(2)) to the corresponding file system routine
(e.g. VOP_MKDIR()) shall be locked for read.
Justification: Since we're using shared locks here, it doesn't matter
whether we unlock the parent in namei() or not; for uniformity, we
leave it up to the caller to do so. This does not cause us to do any
more (un)locking operations than we otherwise would have. The
restriction on arguments to VOP_*() routines is also for uniformity,
and to allow the file system to do additional tests (possibly such as
the subdirectory test for the directory-move case) before upgrading
any locks.
The locking order shall be defined by the five permissible locking
states:
* parent read, leaf read
* parent write, leaf read
* parent write, leaf write
* parent unlocked, leaf read
* parent unlocked, leaf write
In other words, the leaf shall not be upgraded to a write lock while a
read lock is held on the parent. (The two `child unlocked' cases
don't make sense.)
To enforce this locking order, and to prevent locking against oneself,
it is necessary for all directory operations to test for the `.' and
`..' cases. This can be (and should already be) done in the
individual operations similarly to how it is currently done in
VOP_LOOKUP() (e.g. with an expression like `vp == dvp || (flags &
ISDOTDOT) != 0'. All directory operations should fail for `.' and
`..' anyway.
Justification: This is a simple extension of the current locking
order. Since we're using shared locks, and it's permissible to hold
more than one in the same process, there's no reason for VOP_LOOKUP()
to know about `.' and `..' at all; the interface will, in this sense,
become `pure'. Since namei() will already return `.' or `..' to
directory operations, they already need to check for these cases; but
it's not clear that they all do it consistently, and this should be
audited. (E.g., it looks like sys_rmdir() ignores the `..' case and
relies on the link count check in the file system catching it. We
have to catch it early enough that we don't try to upgrade the parent
lock and end up locking in the wrong order.)
This will require substantial, but fairly straightforward, changes to
the file system code. It should, ultimately, eliminate a fair amount
of code, and fix a number of existing difficulties.
Comments, volunteers?