Quick systemd-nspawn guide

I switched to using systemd-nspawn in place of chroot and wanted to give a quick guide to using it. The short version is that I’d strongly recommend that anybody running systemd that uses chroot switch over – there really are no downsides as long as your kernel is properly configured.

Chroot should be no stranger to anybody who works on distros, and I suspect that the majority of Gentoo users have need for it from time to time.

The Challenges of chroot

For most interactive uses it isn’t sufficient to just run chroot. Usually you need to mount /proc, /sys, and bind mount /dev so that you don’t have issues like missing ptys, etc. If you use tmpfs you might also want to mount the new tmp, var/tmp as tmpfs. Then you might want to make other bind mounts into the chroot. None of this is particularly difficult, but you usually end up writing a small script to manage it.

Now, I routinely do full backups, and usually that involves excluding stuff like tmp dirs, and anything resembling a bind mount. When I set up a new chroot that means updating my backup config, which I usually forget to do since most of the time the chroot mounts aren’t running anyway. Then when I do leave it mounted overnight I end up with backups consuming lots of extra space (bind mounts of large trees).

Finally, systemd now by default handles bind mounts a little differently when they contain other mount points (such as when using –rbind). Apparently unmounting something in the bind mount will cause systemd to unmount the corresponding directory on the other side of the bind. Imagine my surprise when I unmounted my chroot bind to /dev and discovered /dev/pts and /dev/shm no longer mounted on the host. It looks like there are ways to change that, but this isn’t the point of my post (it just spurred me to find another way).

Systemd-nspawn’s Advantages

Systemd-nspawn is a tool that launches a container, and it can operate just like chroot in its simplest form. By default it automatically sets up most of the overhead like /dev, /tmp, etc. With a few options it can also set up other bind mounts as well. When the container exits all the mounts are cleaned up.

From the outside of the container nothing appears different when the container is running. In fact, you could spawn 5 different systemd-nspawn container instances from the same chroot and they wouldn’t have any interaction except via the filesystem (and that excludes /dev, /tmp, and so on – only changes in /usr, /etc will propagate across). Your backup won’t see the bind mounts, or tmpfs, or anything else mounted within the container.

The container also has all those other nifty container benefits like containment – a killall inside the container won’t touch anything outside, and so on. The security isn’t airtight – the intent is to prevent accidental mistakes.

Then, if you use a compatible sysvinit (which includes systemd, and I think recent versions of openrc), you can actually boot the container, which drops you to a getty inside. That means you can use fstab to do additional mounts inside the container, run daemons, and so on. You get almost all the benefits of virtualization for the cost of a chroot (no need to build a kernel, and so on). It is a bit odd to be running systemctl poweroff inside what looks just like a chroot, but it works.

Note that unless you do a bit more setup you will share the same network interface with the host, so no running sshd on the container if you have it on the host, etc. I won’t get into this but it shouldn’t be hard to run a separate network namespace and bind the interfaces so that the new instance can run dhcp.

How to do it

So, getting it actually working will likely be the shortest bit in this post.

You need support for namespaces and multiple devpts instances in your kernel:

That’s it – you can exit from it just like a chroot. From inside you can run mount and see that it has taken care of /dev and /tmp for you. The “.” is the path to the chroot, which I assume is the current directory. With nothing further it runs bash inside.

If you want to add some bind mounts it is easy:

systemd-nspawn -D . –bind /usr/portage

Now your /usr/portage is bound to your host, so no need to sync/etc. If you want to bind to a different destination add a “:dest” after the source, relative to the root of the chroot (so –bind foo is the same as –bind foo:foo).

If the container has a functional init that can handle being run inside, you can add a -b to boot it:

systemd-nspawn -D . –bind /usr/portage -b

Watch the init do its job. Shut down the container to exit.

Now, if that container is running systemd you can direct its journal to the host journal with -j:

systemd-nspawn -D . –bind /usr/portage -j -b

Now, nspawn registers the container so that it shows up in machinectl. That makes it easy to launch a new getty on it, or ssh to it (if it is running ssh – see my note above about network namespaces), or power it off from the host.

22 Responses

Systemd-nspawn is the next thing on my list of potentially useful systemd features to look into. Altho I won’t really grok it until I sit down and actually do it, reading articles like this give my subconscious a chance to pre-process much of the info so when I do decide I’m ready, uptake is MUCH faster. Plus I mentally bookmark articles on my “upcoming todo” topics and lookup and reread the most pertinent ones when I’m actually ready to jump in. Given that this one’s written from a gentoo angle it’s certain to be top of the list. =:^)

As for that mount -rbind thing, see the shared subtrees operations section of the mount manpage along with the kernel-doc it mentions (Documentation/filesystems/sharedsubtree.txt). This is another thing I’ve read and absorbed enough to have an idea when it might apply, then mentally bookmarked as potentially handy should I need it. Without double-checking that I’ve got it straight, I’m guessing systemd’s rbind mounts probably default to shared (or rshared), when you want (r)slave or possibly (r)private (which without the r, IIRC is the old behavior). Setting the appropriate mount flag in fstab should solve the problem. I guess you and I /both/ have been around Linux long enough to be a bit bewildered occasionally by all these newfangled features that weren’t there when we learned the ropes. Linux certainly can feel a bit like a video on fast-forward, that you sometimes wish there was a way to slow down a bit or even backup and go over it again, as it just seems like things are changing too fast to keep up, sometimes, and I had a bit of that vertigo feeling when I first read about this stuff, for sure!

Tho of course you’ve solved it with nspawn, now, but someday knowing about the shared/slave/private/unbindable stuff might come in handy too, at least to the point of knowing it’s there and where to lookup the details.

Meanwhile, it might be nice to update, say, the gentoo/amd64 chroot guide, with nspawn, some day. That, modified slightly (actually building out a full image including kernel, tho of course I don’t run them in the chroot) for my 32-bit netbook build image, is my biggest use of chroot right there. I had idly thought of virtualizing instead of chrooting, but containerizing using nspawn seems an even better and now simpler idea, bringing that chroot into the modern era. (That’s actually on my todo list as well, as I’ve not updated the netbook in long enough it’s likely to be easier to almost start over, and I was planning on redoing the chroot already, tho I hadn’t quite linked it up with the nspawn investigation todo yet. Now I potentially have a more modern way to handle it, while killing both the nspawn todo and the chroot-redo todo with one stone! =:^)

Thanks, good info!
However I think you made a typo in journal example:
“… ournal to the host journal with -h:
systemd-nspawn -D . –bind /usr/portage -j -b ”
Is there a service for running running containers on boot?

Does systemd-nspawn work seamlessly with qemu-user + systemd-binfmt for running an environment for a different arch? I have been using chroot for that with -rbind mounts, but as you wrote, it jacks up the host system when dismounting everything.

I don’t believe there is any capability for switching archs. I’m not quite sure how you’re switching archs just using chroot, unless you’re just talking about running x86 on an amd64 system (which can be supported by the kernel without virtualization). There is no virtualization, so the kernel+cpu has to be able to execute the binaries in your container.

That said, I suspect something like linux32 could be made to work with nspawn if that is what you’re getting at.

Using binfmt_misc, you can register a static qemu as an interpreter to handle executable file formats for another arch. I use this to mount an image I’ve built for ARM, and chroot into it. (e.g. userspace virtualization) It allows faster building of packages using the processing power of a x86_64 processor, but then unmount the image and flash it to a disk to run natively on the ARM platfrom. Much fast than building natively on ARM, but also much less fragile than cross-compiling. It’s just that I’ve been having to reboot every time I want to unmount the guest image because the virtual filesystems on the host system get clobbered if I try to unmount them from the mounted guest image.

Ok, I’ve been using it for a bit and it seems to work quite well. Tried the “-b” option to get the userspace systemd init to load but it failed.Would be nice if I can get it to work too. Perhaps contention for PID 1… but I would think that namespacing would deal with that. Will have to look into it more.

It wouldn’t be contention with PID 1. However, if anything in your container’s config would prevent systemd from booting in a VM/etc, then it won’t boot in a container either. Usually there isn’t much you need to do with it.

Note that if you’re trying to launch stuff like sshd then you need to either bind everything to different ports or use a separate network namespace.

I just found myself affected by the systemd default-shared thing too, and yes, it /is/ related to the make-shared thing as documented in the mount manpage and in $KERNDIR/Documentation/filesystems/sharedsubtree.txt.

While the kernel defaulted to private as it has a policy of not breaking existing userspace and anything else certainly would as private was the previous normal behavior, no such policy for systemd, and it defaults to shared, even tho that forces them to further restrict to slave when various namespace-related options are enabled, because they say it allows nspawn to work “out of the box”.

See the systemd.exec (5) manpage, Options section, under MountFlags=.

But I don’t think I’d want default-shared behavior even in nspawn namespaces, here, as at least for me it pretty much breaks the whole reason I’d bother running namespaces in the first place.

So now I’m debating the best way to switch that systemd default. Of course I could override it in every individual fstab entry, adding “private” to the mount options (as documented in the mount (8) manpage, modern mount (3) does the right thing, making repeated mount (2) syscalls if necessary), but that’s a lot of duplicate “private” entries added to pretty much every single fstab entry!

FWIW, looks like the related systemd source code is in src/core (search on shared). I’m trying to decide whether I patch it to default to private, or whether I simply setup a local boot job (systemd unit or taking advantage of the fact that gentoo’s systemd has a generator for the local.d stuff so it still runs when people switch from openrc) that does a remount –make-rprivate /, to run after all the localmounts are done.

(What systemd’s shared-default is interfering with here is the mount –bind / that I use to mount / without submounts (not –rbind) elsewhere, so I can simply backup /, all of / but _only_ /, without backing up anything else mounted on top of it but so I properly backup stuff on / that might ordinarily not be visible due to over-mounts. Of course for that alone, I could simply make only the / mount private, but that doesn’t solve the root problem, that being that systemd is breaking otherwise working assumptions and overriding the kernel’s default private policy, as well as making insecure-by-default decisions I don’t want any part in, with no documented method that I’ve yet found to change systemd’s default for the entire system. As I said, the systemd devs say they do it to make nspawn work out-of-the-box, but even then, they have to further restrict their own default due to conflicts with other systemd namespace management functionality!

I’ll have to research the systemd FAQs and previous bugs a bit more before filing my own, requesting a way to change that default to something a bit more security-sane like the private the kernel defaults to, if I can’t find a way to do it with an existing global-mount config option.

Does this kind of containerization also work with foreign-arch files? E.g. I have a MIPS filesystem (already with systemd installed) which I use with qemu-user on an x64 host, chrooting into it to compile stuff “native” faster.

FYI – apologies to all for the delayed responses as for some reason I’m not getting notifications on new comments awaiting moderation. I need to look into that. Glad the article was useful!

I might do a follow-up on setting up a container with a network namespace. It is also pretty easy to do, and I’m now running two containers on their own IPs. The only thing that is a mystery to me is how MAC addresses get assigned to them – they seem to be persistent if you only run a single instance of a container, but how this works isn’t clear to me and as far as I can tell it is stateless.

Essentially. Containers are just a way of using kernel namespaces, and lxc, docker, and nspawn are all tools that help you set them up. nspwan is a lot lighter than docker – it is about running containers, not maintaining images for them and all that (at least at the moment).

nspawn is nice – except one thing: I don’t get why it locks the directory tree to be nspawned only once. I mean , I understand there could be conflicts if you start the same stuff within – but thats not my intention. Even with that volatile state switch it denies double starts. So it basically means I have to copy the whole FS tree for every app – or use an overlay FS.

So to ‘just’ get a fitting “C-environment” for apps to be able to start there is still that one advantage of plain old chroots and ‘manually’ taking care of proc, sys, … imho.

Or am I overlooking something, would be interesting to know if you guys have an immediate answer regarding the reason for that behaviour – or how to work around it.

Cheers & thanks!

PS: One pretty ugly way I found myself: create a new directory and mount all top level directories over, unshared, then start nspawn on it (the lock is based on the inode of the FS container directory).

I couldn’t agree more. I’ve run into this as well. At the very least it should have a command line option to disable this behavior. I can see it as a useful sanity check in case you do it by mistake. However, there is certainly no reason that you can’t run multiple containers out of the same path, especially if they’re ephemeral.