Karl Matthias

Building Tiny, Reliable Docker Container Images

Building good, clean Docker container images is a bit of an art, and there is a lot
of conflicting advice out there about how to do them properly. I’d like to
share some thoughts gained from running Docker containers in production for two
years at New Relic. Some of this is discussed in the O’Reilly book Docker: Up
and Running that I co-wrote
with Sean Kane. There are all kinds of best practices we could talk about.
Here I’ll focus on a few best practices aimed at making things small and reliable.

Size Matters

At scale you’re going to be shipping around tens, hundreds, or thousands of
copies of your image. Docker’s distribution mechanism means that each host will
need to have a copy of all of the layers of your image locally. If your image
is 800MB then you’re going to have 800MB of data to pull at least once. There
are a few prongs of attack to get that size down to a minimum. Some of these
intermesh nicely with also making your images more reliable, as you’ll see.

Use Standard Base Images

Build all of your production containers on top of a simple set of base images.
If your base image is big (usually meaning it’s based on a full OS distro),
then with this pattern you still have to pull all of those layers, but you
don’t have to do it for every application on each deployment. Since the layers
are shared between apps, the overhead is reduced. There are all kinds of
commonality benefits to be had here, too, so it’s a good pattern even if you
don’t need the space.

I recommend constructing an image hierarchy with a build job that rebuilds,
re-tags, and re-pushes all the affected base images when any upstream node is
changed in the tree. This is a really nice pattern that has worked out well in
my experience. With this pattern of building all the affected nodes in the tree
whenever a change is pushed, anyone who will build an application image
derived from one of those base layers will get the newest version of the base
they depended on, even if it is an upstream image that changes. It also means
you detect breaking changes to downstream images immediately, not when an
application build fails down the road.

Here’s a linear example:

OS base -> Webapp base -> Ruby webapp -> Your Application

The behavior you want is to make sure that if security updates were applied to
the “OS base” image, that your “Webapp base”, and “Ruby webapp” are
automatically rebuilt, re-pushed, and re-tagged. Or imagine that the Nginx
config in “Webapp base” was just improved. You want all future builds of “Your
Application” to pick them up even though it’s built from “Ruby webapp” and not
“Webapp base” directly.

Don’t Ship a Whole Linux Distribution

Savings: Huge.

If you’re building and deploying Docker containers FROM ubuntu or FROM
centos and the like, then you may be causing yourself a lot of unecessary
pain. In some cases this is the right thing to do. But for many applications
you can get away with much less. I’ll talk about the MVP here in the next
section. But let’s assume that you need a shell and maybe some other tools to
bootstrap your application. That’s why you’re using one of the big distro base
images. The good news is that there are great alternatives out there. I won’t
go into all of them, I’ll just tell you about my favorite: Alpine Linux. This
is a tiny distribution, aimed at embedded systems and other small
installations. It’s perfect for containers because it has a full package
manager, a lot of available packages, and there is a good Docker base
image being maintained.

So, next time try FROM gliderlabs/alpine:3.3 and see if it works for you.

Statically Link Your Applications

Savings: Potentially huge. In many cases no benefit.

If you can get away with it, you can ship the most minimal application of all:
your application binary and assorted supporting files. Rather than even using
Alpine Linux as your base, here you just declare FROM scratch. This is a
great way to ship Go applications, or Rust, or C, or other compiled languages
where the application artifact can be a statically linked binary. If you’re
running a JVM, or a Python, Ruby, or Node app, then this is probably not a
solution for you. But if you can get away with it, there is basically 12KB of
overhead here on top of your application. That’s pretty minimal! Your
Dockerfile then shrinks to a few lines, with FROM scratch and then adding
your application and its configs.

Process Management is Important

You need a program running at the top of your container that is meant to be run
as PID 1 on a Linux system. That process is usually SysV init or Upstart, or
Systemd on the major distributions. You need something like that in your
container. Phusion wrote a good
post explaining why this is
important so I won’t rehash that here. Suffice it to say that you need a real
PID 1 process at the top of your container tree. But there are other reasons.

There is a very worthy goal in the Docker community of running as little in
your container as possible. I applaud that. But reality is a harsh master and
it turns out that in widespread production deployments lots of things go wrong
no matter how well you build them. And as they get more fluid, with changes
flying constantly, they break more often. The best solution here is to use a
platform that schedules your containers to hosts and manages their life cycle.
Examples are Mesos with Marathon, Kubernetes, Deis, and friends. Even so, it’s
often the case that you need to run more than one process in your container and
because of that you must have something that makes sure that dependency inside
the container is maintained. Docker and outside schedulers can see the
container but if one process in the container dies they may not notice. Don’t
go overboard running lots of things in a container. Your container should do
one thing. But doing one thing doesn’t mean you must have one process.

Enter the process manager. A lot of alternatives exist in this space. But you
want something that can act both as a real PID 1 and also does active process
management. Phusion recommends runit which is a perfectly fine solution. If
you are running a big distro, you can easily use Upstart or systemd. I have
used supervisor here extensively and can recommend its robustness, but it
doesn’t handle the other duties of PID 1 (reaping children, signals, etc).
Also, a negative is that it needs a whole Python environment which can add
25-40MB to your container size.

My new favorite in this space is Skaware’s
S6, a not-that-well-known alternative that is
dead simple to use and configure. It’s minimalist but without giving up
ability. The binaries are statically linked, very small, and each do one thing
(take that, systemd). Installing it in your container image involves unpacking
a single tarball and dropping a config file.

One final point in this section. If you’re not running a distributed scheduler,
you owe it to yourself to use a process manager, even if there’s only one other
process in your container. Without something managing your application health,
you’re relying on Docker to restart your dead container for you. My experience
says that’s a much worse than 50-50 proposition. A process manager will make
sure your container stays where you put it. Without one, you’re on an upside
down roller coaster with no safety bar.

Conclusion

That’s some coverage of a few simple ideas that can have a big impact on the
reliability and sustainability of your Docker images. Docker is a great tool
but getting the platform right around is not trivial. This advice is based on
two years of real production Docker use where we shipped 75+ application
deployments on Docker per day. It should stand you in good stead.