About seccomp

Linux seccomp (short for SECure COMputing) filtering allows one to specify which
system calls a process should be allowed to invoke, reducing the kernel surface
exposed to applications.
This provides a clearly defined mechanism to build sandboxed environments, where
processes can run having access only to a specific reduced set of system calls.

In the context of containers, seccomp filtering is useful for:

Restricting applications from invoking syscalls that can affect the host

Reducing kernel attack surface in case of security bugs

For more details on how Linux seccomp filtering works, see
seccomp(2).

Predefined seccomp filters

By default, rkt comes with a set of predefined filtering groups that can be
used to quickly build sandboxed environments for containerized applications.
Each set is simply a reference to a group of syscalls, covering a single
functional area or kernel subsystem. They can be further combined to
build more complex filters, either by blacklisting or by whitelisting specific
system calls. To distinguish these predefined groups from real syscall names,
wildcard labels are prefixed with a @ symbols and are namespaced.

When using stage1 images with systemd >= v231, some
predefined groups
are also available:

@systemd/clock for syscalls manipulating the system clock

@systemd/default-whitelist for a generic set of typically whitelisted syscalls

@systemd/mount for filesystem mounting and unmounting

@systemd/network-io for socket I/O operationgs

@systemd/obsolete for unusual, obsolete or unimplemented syscalls

@systemd/privileged for syscalls which need super-user syscalls

@systemd/process for syscalls acting on process control, execution and namespacing

@systemd/raw-io for raw I/O port access

When no seccomp filtering is specified, by default rkt whitelists all the generic
syscalls typically needed by applications for common operations. This is
the same set defined by @rkt/default-whitelist.

The default set is tailored to stop applications from performing a large
variety of privileged actions, while not impacting their normal behavior.
Operations which are typically not needed in containers and which may
impact host state, eg. invoking umount(2), are denied in this way.

However, this default set is mostly meant as a safety precaution against erratic
and misbehaving applications, and will not suffice against tailored attacks.
As such, it is recommended to fine-tune seccomp filtering using one of the
customizable isolators available in rkt.

Seccomp Isolators

When running Linux containers, rkt provides two mutually exclusive isolators
to define a seccomp filter for an application:

os/linux/seccomp-retain-set

os/linux/seccomp-remove-set

Those isolators cover different use-cases and employ different techniques to
achieve the same goal of limiting available syscalls. As such, they cannot
be used together at the same time, and recommended usage varies on a
case-by-case basis.

Operation mode

Seccomp isolators work by defining a set of syscalls than can be either blocked
("remove-set") or allowed ("retain-set"). Once an application tries to invoke
a blocked syscall, the kernel will deny this operation and the application will
be notified about the failure.

By default, invoking blocked syscalls will result in the application being
immediately terminated with a SIGSYS signal. This behavior can be tweaked by
returning a specific error code ("errno") to the application instead of
terminating it.

For both isolators, this can be customized by specifying an additional errno
parameter with the desired symbolic errno name. For a list of errno labels, check
the reference at man 3 errno.

Retain-set

os/linux/seccomp-retain-set allows for an additive approach to build a seccomp
filter: applications will not able to use any syscalls, except the ones
listed in this isolator.

This whitelisting approach is useful for completely locking down environments
and whenever application requirements (in terms of syscalls) are
well-defined in advance. It allows one to ensure that exactly and only the
specified syscalls could ever be used.

For example, the "retain-set" for a typical network application will include
entries for generic POSIX operations (available in @systemd/default-whitelist),
socket operations (@systemd/network-io) and reacting to I/O
events (@systemd/io-event).

Remove-set

os/linux/seccomp-remove-set tackles syscalls in a subtractive way:
starting from all available syscalls, single entries can be forbidden in order
to prevent specific actions.

This blacklisting approach is useful to somehow limit applications which have
broad requirements in terms of syscalls, in order to deny access to some clearly
unused but potentially exploitable syscalls.

For example, an application that will need to perform multiple operations but is
known to never touch mountpoints could have @systemd/mount specified in its
"remove-set".

Usage Example

The goal of these examples is to show how to build ACI images with acbuild,
where some syscalls are either explicitly blocked or allowed.
For simplicity, the starting point will be a bare Alpine Linux image which
ships with ping and umount commands (from busybox). Those
commands respectively requires socket(2) and umount(2) syscalls in order to
perform privileged operations.
To block their usage, a syscalls filter can be installed via
os/linux/seccomp-remove-set or os/linux/seccomp-retain-set; both approaches
are shown here.

Blacklisting specific syscalls

This example shows how to block socket operation (e.g. with ping), by removing
socket() from the set of allowed syscalls.

First, a local image is built with an explicit "remove-set" isolator.
This set contains the syscalls that need to be forbidden in order to block
socket setup:

Once properly built, this image can be run in order to check that ping usage is
now blocked by the seccomp filter. At the same time, the default blacklist will
also block other dangerous syscalls like umount(2):

This means that socket(2) and umount(2) have been both effectively disabled
inside the container.

Allowing specific syscalls

In contrast to the example above, this one shows how to allow some operations
only (e.g. network communication via ping), by whitelisting all required
syscalls. This means that syscalls outside of this set will be blocked.

First, a local image is built with an explicit "retain-set" isolator.
This set contains the rkt wildcard "default-whitelist" (which already provides
all socket-related entries), plus some custom syscalls (e.g. umount(2)) which
are typically not allowed:

Overriding Seccomp Filters

Seccomp filters are typically defined when creating images, as they are tightly
linked to specific app requirements. However, image consumers may need to further
tweak/restrict the set of available syscalls in specific local scenarios.
This can be done either by permanently patching the manifest of specific images,
or by overriding seccomp isolators with command line options.

Patching images

Image manifests can be manipulated manually, by unpacking the image and editing
the manifest file, or with helper tools like actool.
To override an image's pre-defined syscalls set, just replace the existing seccomp
isolators in the image with new isolators defining the desired syscalls.

The patch-manifest subcommand to actool manipulates the syscalls sets
defined in an image.
actool patch-manifest -seccomp-mode=... -seccomp-set=... options
can be used together to override any seccomp filters by specifying a new mode
(retain or reset), an optional custom errno, and a set of syscalls to filter.
These commands take an input image, modify any existing seccomp isolators, and
write the changes to an output image, as shown in the example:

Overriding seccomp filters at run-time

Seccomp filters can be directly overridden at run time from the command-line,
without changing the executed images.
The --seccomp option to rkt run can manipulate both the "retain" and the
"remove" isolators.

Isolator overridden from the command-line will replace all seccomp settings in
the image manifest, and can be specified as shown in this example:

Seccomp isolators are application-specific configuration entries, and in a
rkt run command line they must follow the application container image to
which they apply.
Each application within a pod can have different seccomp filters.

Recommendations

As with most security features, seccomp isolators may require some
application-specific tuning in order to be maximally effective. For this reason,
for security-sensitive environments it is recommended to have a well-specified
set of syscalls requirements and follow best practices:

Only allow syscalls needed by an application, according to its typical usage.

While it is possible to completely disable seccomp, it is rarely needed and
should be generally avoided. Tweaking the syscalls set is a better approach
instead.