But perhaps the most … intense work this cycle was the addition of apparmor support. The thing we wanted most from apparmor was not yet available: the ability to mediate mounts in the container. If we want to say “the container cannot write to /proc/sysrq-trigger”, then for that to be useful we either need to say “/sysrq-trigger relative to a proc mount”, or we need to be able to prevent /proc being mounted anywhere else (like /mnt). John Johansen in a huge effort implemented the kernel apparmor functionality (in a way acceptable upstream!) and a nice addition to the apparmor profile language, and was always helpful as we were shaking out bugs.

In the end it was tight, but 12.04 now has containers constrained by apparmor by default!

The apparmor support works as follows. First /usr/bin/lxc-start is automatically transitioned to its own profile, where it is only allowed to mount into the container’s tree. Then, just before executing the container’s init, lxc-start transitions to the container’s own profile. Each container configuration can specify a custom profile (which should start with “lxc-” and “unconfined” is also valid), or, if unspecified, then “lxc-default” is used. The default policy attempts to protect the host from accidental container abuses – such as writing to /proc/sysrq-trigger and /proc/mem, changing its cgroup settings (including its devices whitelist), or mounting the host’s devpts instance and subsequently manipulating host ptys. The goal in 12.04 is not to protect the host from malicious root user in a container, but from accidental abuses in the container.

An important apparmor feature missing in 12.04, however, is support for stacked profiles. Stacked profiles will implement a profile hierarchy. They will make it possible to have a container, running in its own restrictive profile, further load profiles. For instance, a container will be able to load the libvirt profiles – so that the container is protected from libvirt – but with that libvirt profile being subordinate to the container’s profile.

Since that support is not currently there, one must choose: either run the container in a profile and not allow it to load or transition to any further profiles, or run the container unconfined, and allow the container to load profiles. By default, the former is chosen, as that is will usually be the best choice from the host’s point of view.

The next few releases then will be very exciting from a container security point of view. For 12.10, we hope to further protect the host from containers using seccomp2, which implements a per-process system call filter. We also intend to hook the high-level testsuite into a jenkins instance, and start a code rewrite which will better support good unit tests. For 13.04, we hope to be able to exploit user namespaces and support stacked apparmor profiles. Finally, other features we hope to complete by 13.10 (though getting all of them done is unlikely) include cgroup fake roots, a devices namespace, and a system log namespace.

In terms of general features over those same releases, we will add apport hooks for better debug support, container hooks at various states (i.e. post-create and pre-start), and greater scriptability by providing a liblxc api. And the user namespace should, before 14.04, allow us to support container use by unprivileged users.