What's this?

In short: currently (i.e., prior to applying this patch), Linux has
capabilities, but they are (deliberately) crippled, and thus, essentially useless, because
nobody could agree on coherent semantics for them; this patch uncripples them and attempts to give them reasonable semantics that will, hopefully,
neither break legacy Unix programs nor those that use the current
capabilies system (essentially, Bind9 and NTP); basically,
capabilities are currently useless because they are never
inheritable (=preserved across execve()) and
this patch makes them so (but carefully enough so as not to confuse
existing programs). Furthermore, whereas the current Linux
capabilities are only “additional” capabilities (meaning
that normal, non-root, processes, have none, and adding capabilities
leads up to root), the patch also suggests (and, to some
extent, implements) a new bunch of
“regular” capabilites, which are present on all normal
processes and can be removed so as to provide some measure of
fault-containment for partially untrusted or potentially buggy
programs (thus, these new capabilities can be said to lead
down).

Note: Although I believe that this patch will not break anything, it is still little
tested and should be considered alpha quality: it should on
no account be applied on security-critical systems or on a system were
local users are not to be trusted: the security implications are quite
complex and I could quite possibly be wrong in thinking that it
doesn't open any local root hole.

Why is it abandoned?

This patch has been abandoned due to heavy criticism on the
linux-kernel mailing list: essentially because it abandoned
POSIXsemantics, because
it made capabilites inheritable by default, which some people
do not want, and because it used the capabilities model (designed for
overprivileged processes) to also model underprivileged processes,
contrary to what was intended. So it was obvious that the patch could
never gain sufficient acceptance as to be included in the kernel.
Rather than pursuing it independently, I am trying a more consensual
approach: I am splitting the changes in two completely independent
parts:

a very small and very
simple patch which adds a inhcaps mount option to
make capabilities inheritable by default on that filesystem (while
otherwise retaining the POSIX.1e semantics),
and

a new approach, cuppabilities
(not capabilities), at handling underprivileged
processes.

Where can I get it?

The present version (0.4.4) is to be applied against Linux version
2.6.18-rc6, although it should not be too picky about that. (The
possibility of serving a git tree is
being considered.)

What are capabilities?

Traditional Unix semantics know
only two levels of privileges: root and non-root.
Root processes are able to bypass essentially all security checks
(mandatory access controls) in the kernel, whereas non-root processes
are subject to all of them. There is no intermediate situation. This
all-or-nothing solution has the merit of simplicity, but it also means
that a program that requires any level of privileges must be made
suid root, making it a privileged
target for attack and thus dangerous. What capabilities do is split
the single “root” privilege in thirty-odd mostly
independent bits so that programs requiring special privileges can be
given just those required and not full root privileges.

For example, the NTP
daemon needs only the CAP_SYS_TIME capability, in order
to set and skew the system clock: so a capability-aware version of it
starts as root (so with all capabilities) but drops all the
unnecessary ones early in the code.

Furthermore, the patch discussed here adds a new bunch of capabilities which are
present in normal (non-root) processes and which can be removed to
give a process even lower abilities: for example, all daemons could be
run without the CAP_REG_SXID capability, thus making them
incapable of elevating privileges by executing s[ug]id
executables, so even if the daemon is compromised, the attacker could
less easily exploit possible local root holes on the attacked machine
(this could offer some measure of protection when running under chroot
is not feasible).

How are Linux capabilities currently crippled?

Currently (i.e., prior to applying this patch), Linux has a notion
of capabilities. However, it is almost entirely useless and,
therefore, almost entirely unused. Roughly speaking, all root
processes have all capabilities and all non-root processes have none:
whenever an executable is execve()d as root, it gains all
capabilities, and wheneve it is execve()d as non-root, it
loses all. Thus, there is no way to export capabilities from one
program to another. Basically the only thing one can do with them is
for a daemon (e.g., Bind9) to
start as root, drop some (but not all) capabilities and switch to a
different uid (with a special,
prctl(PR_SET_KEEPCAPS,1,…), request to maintain
capabilities across setuid()): better than no knowledge
of capabilities at all but, still, not very useful. One cannot run a
given program with restricted capabilities except by patching the
program's code (that is, it must be made capability-aware, and very
few programs are): there is simply no way to restrict capabilities in
one program and from that point execute another (because all
capabilities will be lost on execve()).

Furthermore, Linux entirely disables one of its capabilities,
CAP_SETPCAP (which would have permitted transfering
capabilities from one process to another to some extent), because it
was incorrectlythought
to be responsible for a past sendmail-related exploit. There's really
no reason to disable this (useful) capability, and doing so further
cripples the already deficient Linux caps system.

What does this patch do, in more details?

Most importantly, this patch makes capabilities inheritable: i.e.,
when a process execve()s another executable, capabilities
will be kept (even in the absence of filesystem support for capabilities);
well, it's not really that simple, because we have to make sure not to
break anything, but that's the gist of the idea: the detailed
semantics will be described in detail below.
The patch also restores the CAP_SETPCAP capability which
was removed for no real reason.

Furthermore, the patch adds a new bunch of capabilities: presently the
Linux capabilities are 32-bit wide with normal non-root processes
having 0 bits everywhere, and this patch makes them
64-bit wide with normal non-root processes having sixteen 1's in
the (new) upper half (normal root processes have 1's
everywhere, of course). Moving from to 64-bit wide capability sets
means that the kernel-level interface changes; however, so as not to
break the (very few) programs and libraries that currently use
capset() and capget(), a the kernel checks
the magic version number and will, if necessary, reply with the former
interface.

The patch adds a number of such “regular” (a better
name would be welcome…) capabilities (most important among them
is CAP_REG_SXID, which controls a process's ability to
execute suid programs): but they are intended mostly as a
proof-of-concept and it is quite possible that they will be changed in
the future.

Finally, version 0.4.2 of the patch also adds filesystem support
for capabilities (through extended attributes): this is a merge of a patch provided by Serge
E. Hallyn, who is in no way to blame for my mischief.

The patch is also available in split
form: part 1 introduces 64-bit wide capability sets,
part 2 introduces the new inheritance rules, part 3
introduces the new (regular) capabilities, and part 4 (almost
entirely Serge's work) introduces the filesystem support.

What are the permitted, effective and inheritable capability sets (for a process)?

Each process (or, more accurately, each task) has, at all times,
not one but three sets of capabilities: they are called the
permitted, effective and inheritable
capability sets. Each capability can be present or absent in each of
the sets.

The effective set is the one which is actually used to
check permissions when making system calls that require capabilities.
For example, a process needs to have the CAP_CHOWN
capability in its effective set in order to execute the
chown() system call.

The permitted set is the set of capabilities to which
the process has access, at most. The effective set is, at all times,
a subset of the permitted set: when a given capability is present in
the permitted set, the process may, at will, add it or remove it to
its effective set. Once a capability is removed from the permitted
set, however, it cannot be regained except by executing a
suid executable or by having another process use
CAP_SETPCAP. (This is quite similar to the effective and
real/saved uid's in the traditional Unix approach.)

The inheritable set is also a subset of the permitted
set, and corresponds to capabilities which will be passed across
execve() (note that fork() does not affect
capabilities in any way: both the parent and child processes receive
the same capability sets as before the fork()). However,
the fine print is a bit more complicated
(and, in any case, in the present, pre-patch, situation, capabilities
are simply not inherited).

A previous Linux capabilities patch which
I had written increased the number of sets to four, adding a
bounding set to the story. This did not meet much
enthusiasm and this functionality is now essentially replaced by the
CAP_REG_SXID capability.

What are the permitted, effective and inheritable capability sets (for an executable file)?

The executable's inheritable set corresponds to the set of
capabilities it is willing to receive upon execve(); the
executable's permitted set (a decidedly bad terminology! forced
would be much better, but I am told it is deprecated) are capabilities
which are automatically added upon execve(), whether the
process possessed them or not (thus, this is similar to the
traditional Unix suid mechanism); lastly, the
executable's effective set indicates which capabilities should be
initially made effective. (Contrarily to processes, there is no
reason for an executable file's inheritable set to be a subset of the
permitted set; in fact, quite the contrary: inheritable bits are
interesting only when they are not in the permitted sets—sorry,
I'm not the one to blame for the confusion.)

Any executable in the absence of filesystem support for capabilities, or
any executable file which is not specially marked, is considered as
though it had every bit set in the inheritable and effective sets and
none in the permitted set—except when it's suid
root, in which case it also has a full permitted set (so it will gain
all capabilities upon execve()), or (in version 0.4.4 of
the patch) if it's suid anything else or
sgid, in which case all capability sets are equal to the
set of “regular” capabilities (so as to provide a
sanitized environment), except in the case when it would break normal
Unix rules (for example, exec of a suid non-root or
sgid program from real-uid=0 should only
restrict the effective set—I'm afraid it's quite a mess).

There is also a (system-wide) capability bounding set, which
controls which capabilities can actually be gained upon
execve(): thus it can be used to permanently disable a
certain capability (for all future processes).

What about filesystem support for capabilities? Does this patch add it?

Version 0.4.2 of the patch adds (optional) filesystem support for
capabilities (but only for the low-order part
of capabilities, i.e., those 32 bits which existed before the patch):
it is controlled by an extended attribute with name
security.capability (the format is as follows: the
attribute must contain four 32-bit words in little-endian format, the
first containing the version number 0x19980330 and the
three next containing the effective, permitted and inheritable sets in
this order). As previously explained, this is the merge of a patch provided by Serge
E. Hallyn (though a few adaptations have been made, such as
making the CAP_REG_SXID capability and
nosuid mount option defeat the effective set in the
executable's capabilities). Version 0.3.1 has no filesystem
support.

With the present patch but without filesystem support, or for files
which are unmarked, executables are assumed to have a full set of inheritable (=allowed) and effective
capabilities (meaning that they will receive all inheritable and
effective capabilities from their parent: this is necessary so as not
to break Unix semantics) and an empty set of permitted (=forced)
capabilities, except when they are suid root, in which
case all sets contain all capabilities.

What are the semantics this patch creates for capabilities?

As explained above, each task has three
sets of capabilities, the permitted, effective and inheritable sets.
We must now describe how these sets are changed or consulted upon
certain system calls.

Whenever a permission needs to be checked, the effective
set is consulted. This is the standard Linux behavior, and I do not
change this.

When a process fork()s, its capability sets are not
modified: both the parent and child processes receive the same
capability sets as before the fork(). This is also
unchanged by the patch.

In order to set capabilities for a target task, the
following checks are observed: first, the target must be the same
process as the caller task or the caller must possess the
CAP_SETPCAP capability (in its effective set). Second,
the newly raised bits in the inheritable and permitted sets (of the
target) must be part of the current permitted set of the caller.
Thirdly, the constraints must be preserved of the (new) effective and
inheritable sets (of the target) being subsets of the (new) permitted
set (of the target). All of this is current Linux code, unchanged by
the patch (except for the part about enforcing the inheritable set to
be a subset of the permitted set: this may have been an oversight or
perhaps a different interpretation of what the inheritable set means,
so I found it cleaner to enforce the constraint by intersecting the
requested inheritable set with the new permitted set).

Then we have compatibility rules for set*uid(): the
reason for this is that legacy Unix programs gain or lose privileges
by using the seteuid(), setuid() and cousin
functions, so we must emulate them with capabilities and make sure
they have the same behavior. This is how we do it, when a program
does not explicitly request (using
prctl(PR_SET_KEEPCAPS,1,…)) to keep capabilities
upon set*uid(): when all three of the real,
effective and saved uid's are set to non-zero (meaning
the program wishes to permanently abandon its root privileges), all
three capability sets are cleared of their additional (system) parts
(all but bits 32–47); when the effectiveuid is
set to non-zero, only the effective set of capabilities is thus
affected, and when the effective uid is reset to zero,
the effective set is raised to the full permitted set. This is,
essentially, what the current Linux code does (except for the
inheritable set). When the program did explicitly request (using
prctl(PR_SET_KEEPCAPS,1,…)) to keep capabilities
upon set*uid(), then nothing is altered, except that the
inheritable set is cleared of additional (system)
capabilities (so as to conform avoid surprising programs which
expected capabilities not to be inherited): perhaps even this behavior
could be suppressed using
prctl(PR_SET_KEEPCAPS,2,…) or something.

Finally, we must describe how the three capability sets are
affected by execve(). Recall that there are also three capability sets associated with
an executable file. Let us call P(per), P(eff)
and P(inh) the permitted, effective and inheritable sets
for the task before execve(), P′(per),
P′(eff) and P′(inh) the
corresponding sets after execve(), and F(per),
F(eff) and F(inh) the permitted, effective and
inheritable sets for an executable file. Finally, call bnd
the system-wide capability bounding set. Then the rules enforced by
the patch are as follows:

The first rule is exactly the one documented in the
capabilities(7) manual page. The other two differ
slightly, but this is demonstrably unavoidable if we are not to break
traditional Unix semantics (the documented rule for the effective set
is
P′(eff) ← P′(per) ∩
F(eff) ≡ (P(inh) ∩ F(eff)
∩ F(inh)) ∪ (F(per) ∩
F(eff) ∩ bnd), but this implies that
P′(eff) does not depend on P(eff), thus
breaking the traditional Unix semantics that all of uid,
euid and suid are preserved upon
execve(); similarly, the documented rule for the
inheritable set, viz.,
P′(inh) ← P(inh), means that
if an executed suid program itself executes something
else, its privileges would be lost). To justify why the proposed
rules are intuitive, consider this: the first part of the expression
for P′(per) or P′(eff) represents
the capabilities inherited by the exec'ed program (thus, it
should be formed by combining those capabilities which the process had
before exec and was willing to pass on, and those which the file is
willing to inherit), and the second part represents the capabilities
provoked by the exec, and is determined solely by file
capabilities (the difference between the rule we use for the effective
set and the one documented in capabilities(7) should not
be a cause for alarm: the security-critical part is the one which
concerns the forced bits, i.e., the second part of the expression and,
for that, it is identical; in any case, no program can presently rely
on the documented behavior since it is not at all implemented!). As
for the rule on the inheritable set, it is quite intuitive (unless
they act otherwise, processes will propagate all their capabilities
rather than merely those they themselves received in that way) and
conforming to the Unix legacy behavior.

Now in the absence of filesystem
support for capabilities, we must examine what happens for
(a) a non-suid-root executable file
(F(inh) and F(eff) are full and
F(per) is empty), and (b) a suid-root
executable file (F(inh), F(eff) and
F(per) are all full). In the first case, the rules
become:

P′(per) ← P(inh)

P′(eff) ← (P(inh) ∩
P(eff))

P′(inh) ← P(inh)

—which is quite unsurprising. In the case (b), assuming
the capability bounding set has not been decreased by the
administrator, all sets are set to full, which is the desired
behavior.

Additionally (not in version 0.3.0 of the patch), the
compatibility rules for set*uid() (described above) are
applied also on execve(): this is to cover the
(presumably very rare) case when a process running as root (some
uid=0) executes a suidnon-root
executable, thus switching to a different euid and
expecting to lose its effective capabilities (and possibly
permitted/inheritable also, in case the process has real
uid nonzero).

How do you know nothing will break?

Of course I can't be 100% sure unless I use a formal prover to
certify the semantics, which is not really
feasible. This is why I'd like the patch to be (1) peer-reviewed
and (2) tested (on non-security-critical systems at first!).
But I can offer some arguments.

First, I argue that non-caps-aware (legacy Unix) programs will
function exactly the same with the patch as before. For such
programs, each set of caps is either the regular bunch or the full
set; the effective set is the full set exactly when the effective
uid is zero, and the two other sets (permitted and
inheritable) are always equal and are the full set exactly when some
uid is zero. To see this, note that behavior upon
execve() is unsurprising: when executing a
suid root executable, all caps are set; when executing a
non-suid executable, all caps are preserved (since
non-caps-aware legacy programs always have the inheritable set equal
to the permitted set, this follows from the rules we described); and
the case of executing a suid non-root executable has also
been taken care of specifically (by applying the compatibility rules
for set*uid()). Behavior upon set*uid() is
also preserved: the compatibility rules ensure that the effective
capabilities are synchronized with euid being zero and
the permitted/inheritable capabilities with some uid
being zero; note that the patch does not modify the
set*uid() functions in any way, and only modifies the
compatibility rules insofar as to keep the inheritable set
synchronized with the permitted set (for non-caps-aware legacy
programs) and to retain regular caps.

Second, I argue that an attacker (non-root, obviously) cannot take
advantage of the patch. The critical point here is that
suid root executables are always executed with full
capabilities (or with no particular privileges, but nothing
intermediate): this is why the sendmail security hole is not repeated
(the problem was that by using the inheritable caps in an evil way one
could get sendmail to execute with some, but not all, of the root
privileges, in such a way that it could not drop its own privileges!);
in the case of this patch, a suid root executable is
always run with all capabilities set (at least, those of the
system-wide bounding set, but only the system administrator can affect
this). So the essential traditional Unix method for elevating
privileges (executing suid root programs) is preserved.
The case of suid non-root or sgid programs
will be discussed below.

Third, one must consider those (very few) programs which use the
capabilities in the (crippled) state in which they currently exist
under Linux. The only known example is Bind9, which has been tested
to run with the correct set of capabilities, but let us argue in
general: the only useful thing such a program can do is start as root,
drop some capabilities and switch to a different uid: as
far as that goes, nothing changes with the patch; even if the program
goes as far as assuming that capabilities will be lost upon
execve(), it get what it expects because the
compatibility rules for set*uid() clears the inheritable
set of additional (system) bits. Furthermore, the kernel offers a
compatibility version of the
capset()/capget() interface so that binaries
will not break.

What about suid non-root programs?

The question arises of what should be done about suid
non-root (and sgid) programs. Version 0.4.4 of the patch
behaves differently, in this respect, from prior versions.

Prior versions did not change the capabilities upon non-root
suid/sgid exec. One might argue, however,
that the patch makes suid non-root programs vulnerable,
as they could be executed with less (regular) capabilities than they
expect. However, this is not believed to be a serious problem,
because (a) such programs are much rarer than suid
root programs, (b) damage, if any, would be less limited (no
special capabilities are at stake, only access to the filesystem),
(c) removing regular capabilities makes system calls fail with a
clean error code (nothing exotic like the setuid()
function which exhibits a very subtle difference in behavior according
as the CAP_SETUID capability is set or not, which made
the sendmail exploit possible), and (d) system calls can always
fail, so adding new causes for failure is not introducing anything
significantly different. So I claim that this behavior is safe.

However, since security is a matter of excessive paranoia, version
0.4.4 offers a different behavior by default: non-root
suid/sgid executables behave as though they
had the inheritable (=allowed), effective and permitted (=forced) sets
of capabilities all equal to the set of “regular” (normal,
non-root) capabilities. Considering the rules of
inheritance, this means that they start with exactly the regular
capabilities in every set. Well, it's a bit more complicated: when
root execs an sgid program, for example, it shouldn't
drop capabilities (if you want the gory details: if all
uids before exec are non-zero then all capability sets
are set to the regular caps, and if any is zero then the inheritable
(=allowed) and effective sets of the executable are assumed to be the
full set and the permitted (=forced) set to the regular caps; the
compatiblity rules for set*uid() will take care of
dropping caps if root uid is actually dropped
permanently).

Is there a test suite somewhere?

There is an embryo for a test suite: see here.
Just extract it and type make (as root). It doesn't test
every aspect of the patch, though. Make sure to use the test suite
version which matches that of the patch!

How can one make something useful of this patch?

So far, an upgrade of the libcap library remains to be
written, so expect things to be a little rough. But it is still
possible to write simple programs which make use of the patch. (I
have chosen not to include linux/capability.h from the
programs and, rather, redefine the constants, which would be a very
bad habit in the long run but which is probably simpler while the code
is still experimental.)

The following program (which should be run by an unprivileged user)
runs a shell (or the program specified on the command line) without
the CAP_REG_SXID capability. This means that, from this
shell, it is impossible to elevate privileges by executing a
set[ug]id program: so it would be a good idea to execute
certain daemons from this wrapper.

The following program (which should be made suid root
and then run as an unprivileged user) runs a shell (or the program
specified on the command line) with the CAP_CHOWN
capability. So, from that shell, chown functions as root
although the user is otherwise unprivileged): if you install this
program executable by a certain group, this effectively gives
chown privilege to the members of that group. (Of
course, CAP_CHOWN is an example: the same example could
be used with other capabilities—see the
capabilities(7) manual page for examples.)

What capabilities exist that I could play with?

With this patch, capabilities come in two bunches: additional
capabilities (numbers 0 through 31—and 48 through 63, but those
are unused) are not possessed by normal
non-root processes, and these are exactly the capabilities of an
unpatched Linux kernel, whereas regular capabilities, numbers 32
through 47, are normally possessed by all processes and can be removed
to make a process underprivileged. The patch offers six of those,
but they are to be thought more of a “proof of concept”
than as a serious proposal:

CAP_REG_FORK (number 32) allows the process to
fork().

CAP_REG_OPEN (number 33) allows the process to
open() a file.

CAP_REG_EXEC (number 34) allows the process to
execve() an executable.

CAP_REG_SXID (number 35) allows the process to gain
privileges by execve()ing an s[ug]id
executable. This is thought to be the most useful of the lot because
it provides a form of confinement against privilege escalation: it
would seem like a good idea to run various daemons with this
capability turned off. In version 0.4.2 of the patch, this also turns
off the permitted (=forced) set of capabilities on an executable
file.

CAP_REG_WRITE (number 36) [introduced in version
0.4.2 of the patch] is required for the process to perform any kind of
write operation on the filesystem. (This could also be quite
interesting—unfortunately, for the moment, it even forbids
writing to /dev/null, which confuses a lot of
scripts.)

CAP_REG_PTRACE (number 37) [introduced in version
0.4.3 of the patch] is required for the ptrace() system
call (except for self-inspection).

Further additions which might be considered could be: having a
capability required for any kind of network access.

What are the differences between the various versions of the patch?

Version 0.4.4 adds sanitizing of capabilities on non-root
suid/sgidexecve(). It also
changes CAP_REG_SXID so that its absence will return
EPERM when attempting to execute a
suid/sgid executable (rather than execute it
with no permissions changed): this is to preserve the security of
executable but not readable images.

Version 0.4.3 restricts the regular capabilities to sixteen bits
(32 through 47) rather than 32. It also adds the
CAP_REG_PTRACE capability. Finally, it corrects a stupid
bug which forced the inheritable set of a process to be a subset of
the effective (rather than permitted) set.

Version 0.4.2 add the CAP_REG_WRITE capability, fixes
a couple of bugs and adds restrictions on kill() and
whatnot (part of the filesystem patch by Serge E. Hallyn). It
also disallows the permitted (=forced) executable set in
nosuid-mounted filesystems and when the
CAP_REG_SXID capability is absent.

Version 0.4.0 merges with filesystem support.

Version 0.3.1 fixes the problem that a process with only
euid zero and other uid's nonzero executing
a suid non-root program would not lose all non-regular
permitted capabilities.