Docker vs. PrivateTmp

While working with Docker the other day, I ran into an
undesirable interaction between Docker and systemd services that
utilize the PrivateTmp directive.

The PrivateTmp directive, if true, “sets up a new file system
namespace for the executed processes and mounts private /tmp and
/var/tmp directories inside it that is not shared by processes outside
of the namespace”. This is a great idea from a security
perspective, but can cause some unanticipated consequences.

It’s not just Docker

While I ran into this problem while working with Docker, there is
nothing particularly Docker-specific about the problem. You can
replicate this behavior by hand without involving either systemd or
Docker:

When you create a new mount namespace as a child of the global mount
namespace, either via the unshare command or by starting a systemd
service with PrivateTmp=true, it inherits these private mounts.
When Docker unmounts the the container filesystem in the global
namespace, the fact that the /var/lib/docker/devicemapper mountpoint
is marked private means that the unmount operation does not
propagate to other namespaces.

The solution

The simplest solution to this problem is to set the MountFlags=slave
option in the docker.service file:

MountFlags=slave

This will cause SystemD to run Docker in a cloned mount namespace and
sets the MS_SLAVE flag on all mountpoints; it is effectively
equivalent to:

# unshare -m
# mount --make-rslave /

With this change, mounts performed by Docker will not be visible in
the global mount namespace, and they will thus not propagate into the
mount namespaces of other services.

Not necessarily the solution

There was an attempt to fix this problem committed to the Fedora
docker-io package that set MountFlags=private. This will prevent
the symptoms I originally encountered, in which Docker is unable to
remove a mountpoint because it is still held open by another mount
namespace…

…but it will result in behavior that might be confusing to a system
administrator. Specifically, mounts made in the global mount
namespace after Docker starts will not be visible to Docker
containers. This means that if you were to make a remote filesystem
available on your Docker host:

# mount my-fileserver:/vol/webcontent /srv/content

And then attempt to bind that into a Docker container as a volume:

# docker run -v /srv/content:/content larsks/thttpd -d /content

Your content would not be visible. The mount of
my-fileserver:/vol/webcontent would not propagate from the global
namespace into the Docker mount namespace because of the private
flag.