systemd-nspawn

systemd-nspawn is like the chroot command, but it is a chroot on steroids.

systemd-nspawn may be used to run a command or OS in a light-weight namespace container. It is more powerful than chroot since it fully virtualizes the file system hierarchy, as well as the process tree, the various IPC subsystems and the host and domain name.

systemd-nspawn limits access to various kernel interfaces in the container to read-only, such as /sys, /proc/sys or /sys/fs/selinux. Network interfaces and the system clock may not be changed from within the container. Device nodes may not be created. The host system cannot be rebooted and kernel modules may not be loaded from within the container.

This mechanism differs from Lxc-systemd or Libvirt-lxc, as it is a much simpler tool to configure.

Installation

Examples

Create and boot a minimal Arch Linux distribution in a container

Next, create a directory to hold the container. In this example we will use ~/MyContainer.

Next, we use pacstrap to install a basic arch-system into the container. At minimum we need to install the base group.

# pacstrap -i -c ~/MyContainer base [additional pkgs/groups]

Tip: The -i option will avoid auto-confirmation of package selection. As you do not need to install the Linux kernel in the container, you can remove it from the package list selection to save space. See Pacman#Usage.

Note: The package linux-firmware required by linux, which is included in the base group and is not necessary to run the container, causes some issues to systemd-tmpfiles-setup.service during the booting process with systemd-nspawn. It is possible to install the base group but excluding the linux package and its dependencies when building the container with pacstrap -i -c ~/MyContainer base --ignore linux [additional pkgs/groups]. The --ignore flag will be simply passed to pacman. See FS#46591 for more information.

Once your installation is finished, boot into the container:

# systemd-nspawn -b -D ~/MyContainer

The -b option will boot the container (i.e. run systemd as PID=1), instead of just running a shell, and -D specifies the directory that becomes the container's root directory.

After the container starts, log in as "root" with no password.

The container can be powered off by running poweroff from within the container. From the host, containers can be controlled by the machinectl tool.

Note: To terminate the session from within the container, hold Ctrl and rapidly press ] three times. Non-US keyboard users should use % instead of ].

Create a Debian or Ubuntu environment

Note:systemd-nspawn requires that the operating system in the container has systemd running as PID 1 and systemd-nspawn is installed in the container. This means Ubuntu before 15.04 will not work out of the box and requires additional configuration to switch from upstart to systemd. Also make sure that the systemd-container package is installed on the container system.

For Debian valid code names are either the rolling names like "stable" and "testing" or release names like "stretch" and "sid", for Ubuntu the code name like "xenial" or "zesty" should be used. A complete list of codenames is in /usr/share/debootstrap/scripts. In case of a Debian image the "repository-url" can be http://deb.debian.org/debian/. For an Ubuntu image, the "repository-url" can be http://archive.ubuntu.com/ubuntu/.

Unlike Arch, Debian and Ubuntu will not let you login without a password on first login. To set the root password login without the '-b' option and set a password:

# systemd-nspawn -D myContainer
# passwd
# logout

If the above did not work. One can start the container and use these commands instead:

Here systemd-nspawn will see if the owner of the directory is being used, if not it will use that as base and 65536 IDs above it. On the other hand if the UID/GID is in use it will randomly pick an unused range of 65536 IDs from 524288 - 1878982656 and use them.

Note:

The base of the range chosen is always a multiple of 65536.

-U and --private-users=pick is the same, if kernel supports user namespaces. --private-users=pick also implies --private-users-chown, see systemd-nspawn(1) for details.

You will need to set the DISPLAY environment variable inside your container session to connect to the external X server.

X stores some required files in the /tmp directory. In order for your container to display anything, it needs access to those files. To do so, append the --bind-ro=/tmp/.X11-unix option when starting the container.

Note: Since systemd version 235, /tmp/.X11-unix contents have to be bind-mounted as read-only, otherwise they will disappear from the filesystem. The read-only mount flag does not prevent using connect() syscall on the socket. If you binded also /run/user/1000 then you might want to explicitly bind /run/user/1000/bus as read-only to protect the dbus socket from being deleted.

Avoiding xhost

xhost only provides rather coarse access rights to the X server. More fine-grained access control is possible via the $XAUTHORITY file. Unfortunately, just making the $XAUTHORITY file accessible in the container will not do the job:
your $XAUTHORITY file is specific to your host, but the container is a different host.
The following trick adapted from stackoverflow can be used to make your X server accept the $XAUTHORITY file from an X application run inside the container:

nsswitch.conf

Notes:please use the second argument of the template to provide more detailed indications. (Discuss in Talk:Systemd-nspawn#)

To make it easier to connect to a container from the host, you can enable local DNS resolution for container names. In /etc/nsswitch.conf, add mymachines to the hosts: section, e.g.

hosts: files mymachines dns myhostname

Then, any DNS lookup for hostname foo on the host will first consult /etc/hosts, then the names of local containers, then upstream DNS etc.

Use host networking

To disable private networking used by containers started with machinectl start MyContainer add a MyContainer.nspawn file to the/etc/systemd/nspawn directory (create the directory if needed) and add the following:

/etc/systemd/nspawn/MyContainer.nspawn

[Network]
VirtualEthernet=no

Parameters set in the MyContainer.nspawn file will override the defaults used in systemd-nspawn@.service and the newly started containers will use the host's networking.

Virtual Ethernet interfaces

If a container is started with systemd-nspawn ... -n, systemd will automatically create one virtual Ethernet interface on the host, and one in the container, connected by a virtual Ethernet cable.

If the name of the container is foo, the name of the virtual Ethernet interface on the host is ve-foo. The name of the virtual Ethernet interface in the container is always host0.

When examining the interfaces with ip link, interface names will be shown with a suffix, such as ve-foo@if2 and host0@if9. The @ifN is not actually part of the name of the interface; instead, ip link appends this information to indicate which "slot" the virtual Ethernet cable connects to on the other end.

For example, a host virtual Ethernet interface shown as ve-foo@if2 will connect to container foo, and inside the container to the second network interface -- the one shown with index 2 when running ip link inside the container. Similarly, in the container, the interface named host0@if9 will connect to the 9th slot on the host.

where my-container is the name of the directory that will be created for the container. After powering off, the newly created subvolume is retained.

Use temporary Btrfs snapshot of container

One can use the --ephemeral or -x flag to create a temporary btrfs snapshot of the container and use it as the container root. Any changes made while booted in the container will be lost. For example:

# systemd-nspawn -D my-container -xb

where my-container is the directory of an existing container or system. For example, if / is a btrfs subvolume one could create an ephemeral container of the currently running host system by doing:

# systemd-nspawn -D / -xb

After powering off the container, the btrfs subvolume that was created is immediately removed.

Run docker in systemd-nspawn

Docker requires rw permission of /sys/fs/cgroup to run its containers, which is mounted read-only by systemd-nspawn by default due to cgroup namespace. However, it is possible to run Docker in a systemd-nspawn container by bind-mounting /sys/fs/cgroup from host os and enabling necessary capabilities and permissions.

Note: The following steps are essentially sharing the cgroup namespace to the container, giving kernel keyring permissions and making it a privileged container, which is likely to increase the attack surface and decrease security level. You should always evaluate the actual benefits by doing so before following the steps.

First, cgroup namespace should be disabled by systemctl edit systemd-nspawn@myContainer

systemctl edit systemd-nspawn@myContainer

[Service]
Environment=SYSTEMD_NSPAWN_USE_CGNS=0

Then, edit /etc/systemd/nspawn/myContainer.nspawn (create if absent) and add the following configurations.

This grants all capabilities to the container, whitelists two system calls add_key and keyctl (related to kernel keyring and required by Docker), and bind-mounts /sys/fs/cgroup from host to the container. After editing these files, you need to poweroff and restart your container for them to take effect.

Note: You might need to load the overlay module on the host before starting Docker inside the systemd-nspawn to use the overlay2 storage driver (default storage driver of Docker) properly. Failure to load the driver will cause Docker to choose the inefficient driver vfs which copies everything for every layer of Docker containers. Consult Kernel modules#Automatic module loading with systemd on how to load the module automatically.

Troubleshooting

root login fails

If you get the following error when you try to login (i.e. using machinectl login <name>):

arch-nspawn login: root
Login incorrect

And journalctl shows:

pam_securetty(login:auth): access denied: tty 'pts/0' is not secure !

Add pts/0 to the list of terminal names in /etc/securetty on the container filesystem, see [2]. You can also opt to delete /etc/securetty on the container to allow root to login to any tty, see [3].

Unable to upgrade some packages on the container

It can sometimes be impossible to upgrade some packages on the container, filesystem being a perfect example. The issue is due to /sys being mounted as Read Only. The workaround is to remount the directory in Read Write when running mount -o remount,rw -t sysfs sysfs /sys, do the upgrade then reboot the container.

execv(...) failed: Permission denied

When trying to boot the container via systemd-nspawn -bD /path/to/container (or executing something in the container), and the following error comes up:

even though the permissions of the files in question (i.e. /lib/systemd/systemd) are correct, this can be the result of having mounted the file system on which the container is stored as non-root user. For example, if you mount your disk manually with an entry in fstab that has the options noauto,user,..., systemd-nspawn will not allow executing the files even if they are owned by root.