About Linux Capabilities

Linux capabilities are meant to be a modern evolution of traditional UNIX
permissions checks.
The goal is to split the permissions granted to privileged processes into a set
of capabilities (eg. CAP_NET_RAW to open a raw socket), which can be
separately handled and assigned to single threads.

Processes can gain specific capabilities by either being run by superuser, or by
having the setuid/setgid bits or specific file-capabilities set on their
executable file.
Once running, each process has a bounding set of capabilities which it can
enable and use; such process cannot get further capabilities outside of this set.

In the context of containers, capabilities are useful for:

Restricting the effective privileges of applications running as root

Allowing applications to perform specific privileged operations, without
having to run them as root

For the complete list of existing Linux capabilities and a detailed description
of this security mechanism, see the capabilities(7) man page.

Default capabilities

By default, rkt enforces a default set of capabilities onto applications.
This default set is tailored to stop applications from performing a large
variety of privileged actions, while not impacting their normal behavior.
Operations which are typically not needed in containers and which may
impact host state, eg. invoking reboot(2), are denied in this way.

However, this default set is mostly meant as a safety precaution against erratic
and misbehaving applications, and will not suffice against tailored attacks.
As such, it is recommended to fine-tune the capabilities bounding set using one
of the customizable isolators available in rkt.

Capability Isolators

When running Linux containers, rkt provides two mutually exclusive isolators
to define the bounding set under which an application will be run:

os/linux/capabilities-retain-set

os/linux/capabilities-remove-set

Those isolators cover different use-cases and employ different techniques to
achieve the same goal of limiting available capabilities. As such, they cannot
be used together at the same time, and recommended usage varies on a
case-by-case basis.

As the granularity of capabilities varies for specific permission cases, a word
of warning is needed in order to avoid a false sense of security.
In many cases it is possible to abuse granted capabilities in order to
completely subvert the sandbox: for example, CAP_SYS_PTRACE allows to access
stage1 environment and CAP_SYS_ADMIN grants a broad range of privileges,
effectively equivalent to root.
Many other ways to maliciously transition across capabilities have already been
reported.

Retain-set

os/linux/capabilities-retain-set allows for an additive approach to
capabilities: applications will be stripped of all capabilities, except the ones
listed in this isolator.

This whitelisting approach is useful for completely locking down environments
and whenever application requirements (in terms of capabilities) are
well-defined in advance. It allows one to ensure that exactly and only the
specified capabilities could ever be used.

For example, an application that will only need to bind to port 80 as
a privileged operation, will have CAP_NET_BIND_SERVICE as the only entry in
its "retain-set".

Remove-set

os/linux/capabilities-remove-set tackles capabilities in a subtractive way:
starting from the default set of capabilities, single entries can be further
forbidden in order to prevent specific actions.

This blacklisting approach is useful to somehow limit applications which have
broad requirements in terms of privileged operations, in order to deny some
potentially malicious operations.

For example, an application that will need to perform multiple privileged
operations but is known to never open a raw socket, will have
CAP_NET_RAW specified in its "remove-set".

Usage Example

The goal of these examples is to show how to build ACIs with acbuild,
where some capabilities are either explicitly blocked or allowed.
For simplicity, the starting point will be the official Alpine Linux image from
CoreOS which ships with ping and nc commands (from busybox). Those
commands respectively requires CAP_NET_RAW and CAP_NET_BIND_SERVICE
capabilities in order to perform privileged operations.
To block their usage, capabilities bounding set
can be manipulated via os/linux/capabilities-remove-set or
os/linux/capabilities-retain-set; both approaches are shown here.

Removing specific capabilities

This example shows how to block ping only, by removing CAP_NET_RAW from
capabilities bounding set.

First, a local image is built with an explicit "remove-set" isolator.
This set contains the capabilities that need to be forbidden in order to block
ping usage (and only that):

This means that CAP_NET_RAW had been effectively disabled inside the container.
At the same time, CAP_NET_BIND_SERVICE is still available in the default bounding
set, so the nc command will be able to bind to port 80:

Allowing specific capabilities

In contrast to the example above, this one shows how to allow ping only, by
removing all capabilities except CAP_NET_RAW from the bounding set.
This means that all other privileged operations, including binding to port 80
will be blocked.

First, a local image is built with an explicit "retain-set" isolator.
This set contains the capabilities that need to be enabled in order to allowed
ping usage (and only that):

However, all others capabilities are now not anymore available to the application.
For example, using nc to bind to port 80 will now result in a failure due to
the missing CAP_NET_BIND_SERVICE capability:

Overriding capabilities

Capability sets are typically defined when creating images, as they are tightly
linked to specific app requirements. However, image consumers may need to further
tweak/restrict the set of available capabilities in specific local scenarios.
This can be done either by permanently patching the manifest of specific images, or
by overriding capability isolators with command line options.

Patching images

Image manifests can be manipulated manually, by unpacking the image and editing
the manifest file, or with helper tools like actool.
To override an image's pre-defined capabilities set, replace the existing capabilities
isolators in the image with new isolators defining the desired capabilities.

The patch-manifest subcommand to actool manipulates the capabilities sets
defined in an image.
actool patch-manifest --capability changes the retain capabilities set.
actool patch-manifest --revoke-capability changes the remove set.
These commands take an input image, modify its existing capabilities sets, and
write the changes to an output image, as shown in the example:

Overriding capabilities at run-time

Capabilities can be directly overridden at run time from the command-line,
without changing the executed images.
The --caps-retain option to rkt run manipulates the retain capabilities set.
The --caps-remove option manipulates the remove set.

Capabilities specified from the command-line will replace all capability settings in the image manifest.
Also as stated above the options --caps-retain, and --caps-remove are mutually exclusive.
Only one can be specified at a time.

Capabilities isolators can be added on the command line at run time by
specifying the desired overriding set, as shown in this example:

Capability sets are application-specific configuration entries, and in a
rkt run command line, they must follow the application container image to
which they apply.
Each application within a pod can have different capability sets.

Recommendations

As with most security features, capability isolators may require some
application-specific tuning in order to be maximally effective. For this reason,
for security-sensitive environments it is recommended to have a well-specified
set of capabilities requirements and follow best practices: