Engineering

Introducing dumb-init, an init system for Docker containers

Chris K., Software Engineer

Jan 6, 2016

At Yelp we use Docker containers everywhere: we run tests in them, build tools
around them, and even deploy them into production. In this post
we introduce dumb-init, a simple init system written in C which we
use inside our containers.

Lightweight containers have made running a single process without normal init
systems like systemd or sysvinit practical. However, omitting an init system
often leads to incorrect handling of processes and signals, and can result in
problems such as containers which can’t be gracefully stopped, or leaking
containers which should have been destroyed.

dumb-init is simple to use and solves many of these problems: you can just add
it to the front of any container’s command, and it will take on the role of PID
1 for itself. It immediately spawns your process as PID ~2, and then proxies on
any signals it receives. This helps to avoid special kernel behavior applied to
PID 1, while also handling regular responsibilities of the init system (like
reaping orphaned zombie processes).

The motivation: modeling Docker containers as regular processes

What we really want is to be able to treat Docker containers just like regular
processes, so that we can slowly migrate our tools and infrastructure toward
Docker. Instead of forcing developers to unlearn their existing workflow, we
can move individual commands into containers without developers even realizing
they’re spawning a Docker container on each invocation.

It also lets us take a practical approach to Docker in development: rather than
require that everything live in a container, we can choose to use containers
when it makes sense from a business or technical perspective.

To achieve this goal, we want processes to behave just as if they weren’t
running inside a container. That means handling user input, responding the same
way to signals, and dying when we expect them to. In particular, when we signal
the docker run command, we want that same signal to be received by the process
inside.

Our quest to model Docker containers as regular processes led us to discovering
more than we ever wanted to know about how the Linux kernel handles processes,
sessions, and signals.

Process behavior inside Docker containers

Containers are unique in that they often run just a single process, unlike
traditional servers where even a minimal install usually runs at least a
complex init system, cron, syslog, and an SSH daemon.

While single-process containers are quick to start and light on resources, it’s
important to remember that, for most intents and purposes, these containers are
full Linux systems. Inside your container, the process running as PID 1 has
special rules and responsibilities as the init system.

What is PID 1 inside a container? There are two common scenarios.

Scenario 1: A shell as PID 1

A quirk of Dockerfiles is that if you specify your container’s command without
using the recommended JSON syntax, it will feed your
command into a shell for execution.

That results in a process tree that looks like:

docker run (on the host machine)

/bin/sh (PID 1, inside container)

python my_server.py (PID ~2, inside container)

Having a shell as PID 1 actually makes signaling your process almost
impossible. Signals sent to the shell won’t be forwarded to the subprocess, and
the shell won’t exit until your process does. The only way to kill your
container is by sending it SIGKILL (or if your process happens to die).

For this reason, you should always try to avoid spawning a shell. If you can’t
easily avoid that (for example, if you want to spawn two processes), you should
exec your last process so that it replaces the shell.

Scenario 2: Your process as PID 1

When you use the recommended syntax in your Dockerfile, your process is started
immediately and acts as the init system for your container, resulting in a
process tree that looks like:

docker run (on the host machine)

python my_server.py (PID 1, inside container)

This is better than the first scenario; your process will now actually receive
signals you send it. However, being PID 1, it might not respond to them quite
as you expect it to.

Trouble signaling PID 1

The Linux kernel treats PID 1 as a special case, and applies different rules
for how it handles signals. This special handling often breaks the assumptions
that programs or engineers make.

First, some background. Any process can register its own handlers for TERM
and use them to perform cleanup before exiting. If a process hasn’t registered
a custom signal handler, the kernel will normally fall back to the default
behavior for a TERM signal: killing the process.

For PID 1, though, the kernel won’t fall back to any default behavior when
forwarding TERM. If your process hasn’t registered its own handlers (which
most processes don’t), TERM will have no effect on the process.

Since we’re modeling containers as processes, we’d like to just send SIGTERM
to the docker run command and have the container stop. Unfortunately, this
usually doesn’t work.

When docker run receives SIGTERM, it forwards the signal to the container and
then exits, even if the container itself never dies. In fact, the TERM signal
will often bounce right off of your process without stopping it because of the
PID 1 special case.

Even using the command docker stop won’t do what you want; it sends TERM
(which the Python process won’t notice), waits ten seconds, and then sends KILL
when the process still hasn’t stopped, immediately stopping it without any
chance to do cleanup.

Not being able to properly signal services running inside your Docker container
has lots of implications, both in development and in production. For example,
when deploying a new version of your app, it might have to kill the previous
service version without letting it clean up (potentially dying in the middle of
serving a request, or leaving connections open to your database). It also leads
to a common problem in CI systems (such as Jenkins) where aborted tests leave
Docker containers still running in the background.

The same problem applies to other signals. The most notable case is SIGINT,
the signal generated when you press ^C in a terminal. Since this signal is
caught even less frequently than SIGTERM, it can be especially troublesome
trying to manually kill servers running in your development environment.

dumb-init to the rescue

To address this need, we created dumb-init, a minimal init system intended to
be used in Linux containers. Instead of executing your server process directly,
you instead prefix it with dumb-init in your Dockerfile, such as CMD
["dumb-init", "python", "my_server.py"]. This creates a process tree that looks
like:

docker run (on the host machine)

dumb-init (PID 1, inside container)

python my_server.py (PID ~2, inside container)

dumb-init registers signal handlers for every signal that can be caught, and
forwards those signals on to a session rooted at your process. Since your
Python process is no longer running as PID 1, when dumb-init forwards it a
signal like TERM, the kernel will still apply the default behavior (killing
your process) if it hasn’t registered any other handlers.

Using a regular init system also solves these problems, but at the expense of
increased complexity and resource usage. dumb-init is a simpler way to do
things properly: it spawns your process as its only child, and proxies signals
to it. dumb-init won’t actually die until your process dies, allowing you to do
proper cleanup.

dumb-init is deployed as a statically-linked binary with no extra dependencies;
it’s ideal to serve as a simple init system, and can typically be added to any
container. We recommend using it in basically any Docker container; not only
does dumb-init improve signal handling, but it also takes care of other
functions of an init system, such as reaping orphaned zombie processes.

You can find much more information about the role of dumb-init and how to start
using it on its GitHub page.