Welcome!

Linux Container Internals 2.0 - Lab 4: Container Host

Difficulty:Advanced

Estimated Time:20-30 minutes

Background

This lab focuses on understanding how a particular host actually runs the container images - from the deep internals of how containerized processes interact with the Linux kernel, to the docker daemon and how it receives REST API calls and translates them into system calls to tell the Linux kernel to create new containerized processes.

By the end of this lab you should be able to:

Understand the basic interactions of the major daemons and APIs in a typical container environment

Internalize the function of system calls and kernel namespaces

Understand how SELinux and sVirt secures containers

Command a conceptual understanding of how cgroups limit containers

Use SECCOMP to limit the system calls a container can make

Have a basic understanding of container storage and how it compares to normal Linux storage concepts

Steps

Linux Container Internals 2.0 - Lab 4: Container Host

Step1 of 5

Container Engines & The Linux Kernel

If you just do a quick Google search, you will find tons of architectural drawings which depict things the wrong way or only tell part of the story. This leads the innocent viewer to come to the wrong conclusion about containers. One might suspect that even the makers of these drawings have the wrong conclusion about how containers work and hence propogate bad information. So, forget everything you think you know.

How do people get it wrong? In two main ways:

First, most of the architectural drawings above show the docker daemon as a wide blue box stretched out over the Container Host. The Containers are shown as if they are running on top of the docker daemon. This is incorrect - containers don't run on docker. The docker engine is an example of a general purpose Container Engine. Humans talk to container engines and container engines talk to the kernel - the containers are actually created and run by the Linux kernel. Even when drawings do actually show the right architecture between the container engine and the kernel, they never show containers running side by side:

Second, when drawings show containers are Linux processes, they never show the container engine side by side. This leads people to never think about these two things together, hence users are left confused with only part of the story:

For this lab, let’s start from scratch. In the terminal, let's start with a simple experiment - start three containers which will all run the top command:

docker run -td registry.access.redhat.com/ubi7/ubi top

docker run -td registry.access.redhat.com/ubi7/ubi top

podman run -td registry.access.redhat.com/ubi7/ubi top

Now, let's inspect the process table of the underlying host:

ps -efZ | grep -v grep | grep " top"

Notice that we started each of the top commands in containers. We started two with docker, and one with podman, but they are still just a regular process which can be viewed with the trusty old ps command. That's because containerized processes are just fancy Linux processes with extra isolation from normal Linux processes. Hack around a bit, and notice that the docker daemon runs side by side with the containerized processes. A simplified drawing should really look something like this:

In the kernel, there is no single data structure which represents what a container is. This has been debated back and forth for years - some people think there should be, others think there shouldn't. The current Linux kernel community philosophy is that the Linux kernel should provide a bunch of different technologies, ranging from experimental to very mature, enabling users to mix these technologies together in creative, new ways. And, that's exactly what a container engine (Docker, Podman, CRI-O, etc) does - it leverages kernel technologies to create, what we humans call, containers. The concept of a container is a user construct, not a kernel construct. This is a common pattern in Linux and Unix - this split between lower level (kernel) and higher level (userspace) technologies allows kernel developers to focus on enabling technologies, while users experiment with them and find out what works well.

The Linux kernel only has a single major data structure that tracks processes - the process id table. The ps command dumps the contents of this data structure. But, this is not the total definition of a container - the container engine tracks which kernel isolation technologies are used, and even what data volumes are mounted. This information can be thought of as metadata which provides a definition for what we humans call a container. We will dig deeper into the technical underpinnings, but for now, understand that containerized processes are regular Linux processes which are isolated using kernel technologies like namespaces, selinux and cgroups. This is sometimes described as "sand boxing" or "isolation" or an "illusion" of virtualization.

In the end though, containerized processes are just regular Linux processes. All processes live side by side, whether they are regular Linux processes, long lived daemons, batch jobs, interactive commands which you run manually, or containerized processes. All of these processes make requests to the Linux kernel for protected resources like memory, RAM, TCP sockets, etc. We will explore this deeper in later labs, but for now, commit this to memory...

Step by step creation of a container

In this step, we are going to investigate the basic construction of a container. The general contruction of a container is similar with almost all OCI compliant container engines:

Pull/expand/mount image

Create OCI compliant spec file

Call runc with the spec file

Pull/Expand/Mount Image

Let's use podman to create a container from scratch. Podman makes it easy to break down each step of the container construction for learning purposes. First, let's pull the image, expand it, and create a new overlay filesystem layer as the read/write root filesystem for the container. To do this, we will use a specially constructed container image which lets us break down the steps instead of starting all at once:

After running the above command we have storage created. Notice under the STATUS column, the container is the "Created" state. This is not a running container, just the first step in creation has been executed:

podman ps -a

Try to look at the storage with the mount command (hint, you won't be able to find it):

mount | grep -v docker | grep merged

Hopefully you didn't look for too long because you can't see it with the mount command. That's because this storage has been "mounted" in what's called a mount namespace. You can only see the mount from inside the container. To see the mount from outside the container, podman has a cool feature called podman-mount. This command will return the path of a directory which you can poke around in:

podman mount on-off-container

The directory you get back is a system level mount point into the overlay filesystem which is used by the container. You can literally change anything in the container's filesystem now. Run a few commands to poke around:

mount | grep -v docker | grep merged

ls $(podman mount on-off-container)

touch $(podman mount on-off-container)/test

ls $(podman mount on-off-container)

See the test file there. You will see this file in our container later when we start it and create a shell inside of it. Let's move on to the spec file...

Create Spec File

At this point the container image has been cached locally and mounted, but we don't actually have a spec file for runc yet. Creating a spec file from hand is quite tedious because they are made up of complex JSON with a lot of different options (governed by the OCI runtime spec). Luckily for us, the container engine will create one for us. This exact same spec file can be used by any OCI compliant runtime can consume it (runc, crun, katacontainers, gvisor, etc). Let's run some experiments to show when it's created. First let's inspect the place where it should be:

The above command errors out because the container engine hasn't created the config.json file yet. We will initiate the creation of this file by using podman combined with a specially constructed container image:

podman start on-off-container

Now, the config.json file has been created. Inspect it for a while. Notice that there are options in there that are strikingly similar to the command line options of podman. The spec file really highlights the API:

Podman has not started a container, just created the config.json and immediately exited. Notice under the STATUS column, that the container is now in the Exited state:

podman ps -a

In the next step, we will create a running container with runc...

Call Runtime

Now that we have storage and a config.json, let's complete the circuit and create a containerized process with this config.json. We have constructed a container image, which only starts a process in the container if the /mnt/on file exists. Let's create the file, and start the container again:

touch /mnt/on

podman start on-off-container

When podman started the container this time, it fired up top. Notice under the STATUS column, that the container is now in the Up state::

podman ps -a

Now, let's fire up a shell inside of our running container:

podman exec -it on-off-container bash

Now, look for the test file we created before we started the container:

ls -alh

The file is there like we would expect. You have just created a container in three basic steps. Did you know and understand that all of this was happening every time you ran a podman or docker command? Now, clean up your work:

Notice that each container is labeled with a dynamically generated Multi Level Security (MLS) label. In the example above, the first container has an MLS label of c228,c810 while the second has a label of c184,c827. Since each of these containers is started with a different MLS label, they are prevented from accessing each other's memory, files, etc.

SELinux doesn't just label the processes, it must also label the files accessed by the process. Make a directory for data, and inspect the SELinux label on the directory. Notice the type is set to "user_tmp_t" but there are no MLS labels set:

Finally, look at the MLS label set on the directory, it is always the same as the last container that was run. The :Z option auto-labels and bind mounts so that the container can access and change files on the mount point. This prevents any other process from accessing this data and is done transparently to the end user.

ls -alhZ /tmp/selinux-test/

Cgroups: Dynamically created with container instantiation

The goal of this exercise is to gain a basic understanding of how containers prevent using each other's reserved resources. The Linux kernel has a feature called cgroups (abbreviated from control groups) which limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, etc.) of processes. Normally, these control groups would be set up by a system administrator (with cgexec), or configured with systemd (systemd-run --slice), but with a container engine, this configuration is handled automatically.

Notice how each containerized process is put into its own cgroup by the container engine. This is quite convenient, similar to sVirt.

SECCOMP: Limiting how a containerized process can interact with the kernel

The goal of this exercise is to gain a basic understanding of SECCOMP. Think of a SECCOMP as a firewall which can be configured to block certain system calls. While optional, and not configured by default, this can be a very powerful tool to block misbehaved containers. Take a look at this sample:

Help

Katacoda offerings an Interactive Learning Environment for Developers. This course uses a command line and a pre-configured sandboxed environment for you to use. Below are useful commands when working with the environment.

cd <directory>

Change directory

ls

List directory

echo 'contents' > <file>

Write contents to a file

cat <file>

Output contents of file

Vim

In the case of certain exercises you will be required to edit files or text. The best approach is with Vim. Vim has two different modes, one for entering commands (Command Mode) and the other for entering text (Insert Mode). You need to switch between these two modes based on what you want to do. The basic commands are: