Docker at Lyst

We don’t have a long history of using Docker at Lyst.
On June 9th 2014
Docker 1.0 was released.
On June 16th we started working on moving our entire platform to it, and by
entire platform we mean everything. Development, test and production. We’re not
quite finished rolling it out to all our services but we’ve learnt a lot of lessons
and it’s had a big impact on how we work.

Many discussions of Docker talk about how to deploy micro-services and other
simple applications. This isn’t one of those. This is how we deploy half a
million lines of our own code written over almost 5 years. Many times a day.
Using almost identical tooling from our laptops to our production servers. It’s
also the story of how in the space of 2 months we went from deploying twice a
month, to deploying twice an hour or more.

Lyst is not-small

Our stack is more or less your usual Django
website on AWS. We use Python 2.7 running inside
uwsgi, Django 1.5 and
PostgreSQL 9.3. We also depend on over 100 3rd
party libraries ranging from Boto to
SciPy and communicate with at least 10 different data
systems. On top of that we merge over 35 pull requests a day. With all this
going on deployment was non-trivial and it was often hard to keep track of
exactly what was going out. At the least we wanted a way to get rid of the
Fortran compiler needed to build SciPy that was living on all our webservers.

Building

Avoiding build dependencies in your container image is a bit tricky with Python
applications. The obvious idea is to use Wheels to install binaries of
everything instead of using the source packages. Except
PyPI doesn’t have Linux wheels yet so we have to
compile and store all our binaries in our own pip compatible index server.

To do this we use devpi and a dedicated build
container running devpi-builder
to ensure we have the right wheels available. Our main application container
then installs its dependencies direct from this package server. This also has
the advantage of making the container build process very fast and more tolerant
of PyPI outages.

One rough edge to this process is that getting any kind of configuration into
the container build environment is a bit tricky. Dockerfiles don’t provide a lot
of flexibility so we use a Makefile to
manage the build context. This is not as nice as it could be by a long way.
Particularly as Docker will currently use both the files content hash and
modification time in cache keys. This means every git checkout or file copy
invalidates our build cache. To work around this we use a script that scans git
logs to fix the timestamp.

It’s not pretty but it does the job.
It looks like 1.4 will fix this
which will make our build scripts much tidier.

As well as build configuration we also wanted to tidy up our application
configuration. We adopted the pattern advocated by designs like
12factor for this. To move as much of our configuration
as possible into environment variables. This meant making things smarter in
places so that we reduced the total number of variables that needed setting.
This was definitely worth it though. We now have one good default for
development that is overridden for production and testing.

The other remaining wart in our build process is static assets. Our containers
install quite a lot of software that is only used to compile static assets like
optipng (via imagemin) and
Uglify.js. The Django app doesn’t care
about these details and it would be quite nice to make this step more
independent. Unfortunately getting the output of our static build into the
Django container is quite hard right now. The technique we use now is to build a
Dockerfile that compiles our current assets and then docker cps the compiled
data out of a temporary container. This gives us some of the benefits of the
build cache but requires 4 separate steps to actually get the data into the
application.

Development

We support both Linux and Mac OS X development environments. On OS X we do this
using boot2docker. It mostly works but we are
considering moving to a Vagrant based host using a commercial VM for better
performance. To get local code editing functionality boot2docker mounts the
entire /Users folder inside the host VM. We then use Docker volume mounts to
expose the live source code to the container.

Getting the development containers up and running is currently done using a
Makefile. We originally planned to use Fig for this but at
the time there was a bug that caused
it to have no output in Jenkins. You should probably just use Fig though. It’s
much nicer and actually understands the dependency tree between your containers.

As part of standardising our application we also replaced the Django runserver
with uwsgi. This meant we needed to figure out how to cope with code reloading
and static files nicely. For static files we turned to the
Whitenoise WSGI middleware and hooked it
into the reload API in Django. This gave us a pretty close approximation of
runserver but running on uwsgi instead of raw Python. This makes it easy to test
changes to our WSGI configuration in and gives us lots of confidence that it
will work in production.

Testing

Like almost everyone we use Jenkins to run our testing
pipeline. With the aid of a custom web service that talks to GitHub we run tests
on every pull request both before and after merging. Docker makes creating
identical build environments for this easy to manage. Particularly as we now
need to run large numbers of tests concurrently and also test against a complex
array of services.

Whenever we get a clean build of our main branch we push updated images to our
private registry tagged with the git commit id. This makes our deploys very
transparent as it’s easy to tie specific Docker images back to the list of
changes that went into them.

Production

In production we currently deploy the 2 main services that make up our website.
The main Django application and the supporting
Celery workers. We deploy our application to
Docker using a simple Python script that finds Docker hosts using EC2 instance
tags. It talks to our nginx servers to carry out a rolling restart of the
application without interrupting users traffic.

Thanks to recent additions like the
“exec” command
troubleshooting has become a lot easier. There’s still lots of room to improve.
e.g. The lack of good socket information inside /proc can make tools like lsof
and netstat quite frustrating to use.

While there are still a lot of sharp edges we are happy with Docker and the
direction it seems to be going in. It’s changing fast but more often than not
that pace is solving our problems, not adding to them. It’s helped us get a
large system under control and given us a much more flexible development
process for the future.

P.S.

Now we’ve had some time to let Docker settle into our environment we’ve started
thinking about how things could be better. We are looking forward to finding
solutions to these issues we still have with Docker.

The private registry is not great. It’s slow and has taken more tuning and
nginx configuration than it should. It also has no built in authentication
which makes exposing it to our remote developers yet another thing to work
around. We’ve heard rumours of some big changes coming here though.

Docker host daemons have no authentication so you have to do everything using
either firewall rules or SSH. Hopefully 1.4 should start to solve this via
the new libtrust library.

Dockerfiles are great for simple cases but need a lot of external help to use
in more complex situations. Even simple variable expansion would be helpful.

The default Docker host logging setup works well but has some pretty glaring
problems. For example it never rotates the log files and they grow forever.
We’d much prefer to have Docker send our logs out over something like the syslog
protocol. Logging drivers sound
like they might solve this problem.

Devicemapper will do strange
things in some configurations.
For example on our Jenkins servers we have had containers become impossible to
remove several times. This doesn’t make our ops team happy. Our work around for
this is to just use AUFS. This has some performance issues when files become
invalidated but it is at least reliable.