At Stripe, we make extensive use of automated testing to help ensure
the stability and reliability of our services. We have expansive
test coverage for our API and other core services, we run tests on a
continuous integration server over every git branch, and we never deploy without green
tests.

The size and complexity of our codebase has grown over the past few years—and
so has the size of the test suite. As of August 2015, we have over 1400 test
files that define nearly 15,000 test cases and make over 130,000 assertions.
According to our CI server, the tests would take over three hours if run
sequentially.

With a large (and growing) group of engineers waiting for those tests
with every change they make, the speed of running tests is
critical. We’ve used a number of hosted CI solutions in the past,
but as test runtimes crept past 10 minutes, we brought
testing in-house to give us more control and room for experimentation.

Recently, we’ve implemented our own distributed test runner that
brought the runtime of our tests to just under three minutes. While
some of these tactics are specific to our codebase and systems, we
hope sharing what we did to improve our test runtimes will help
other engineering organizations.

Forking executor

We write tests
using minitest,
but we've implemented our
own plugin
to execute tests in parallel across multiple CPUs on multiple
different servers.

In order to get maximum parallel performance out of our build
servers, we run tests in separate processes, allowing each process to make
maximum use of the machine's CPU and I/O capability. (We run builds on
Amazon's c4.8xlarge instances, which give us 36 cores each.)

Initially, we experimented with using Ruby's threads instead
of multiple processes, but discovered that using a large number of
threads was significantly slower than using multiple processes. This
slowdown was present even if the ruby threads were doing nothing but
monitoring subprocess children. Our current runner doesn’t use Ruby threads at all.

When tests start up, we start by loading all of our application code into
a single Ruby process so we don’t have to parse and load all
our Ruby code and gem dependencies multiple times.
This process then calls fork a
number of times to produce N different processes that’ll each have
all of the code pre-loaded and ready to go.

Each of those workers then starts executing tests. As they execute
tests, our custom executor forks further: Each
process forks and executes a single test file’s worth
of tests inside the child process. The child process writes the
results to the parent over a pipe, and then exits.

This second round of forking provides a layer of isolation
between tests: If a test makes changes to global state, running the test
inside a throwaway process will clean everything up once that process
exits. Isolating state at a per-file level also means that running
individual tests on developer machines will behave similarly to the
way they behave in CI, which is an important debugging affordance.

Docker

The custom forking executor spawns a lot of processes, and creates a
number of scratch files on disk. We run all builds at Stripe inside
of Docker, which means we don't need to worry about cleaning up all
of these processes or this on-disk state. At the end of a build,
all of the state—be that in-memory processes or on
disk—will be cleaned up by a docker stop, every
time.

Managing trees of UNIX processes is notoriously difficult to do
reliably, and it would be easy for a system that forks
this often to leak zombie processes or stray workers (especially
during development of the test framework itself). Using a
containerization solution like Docker eliminates that
nuisance, and eliminates the need to write a bunch of fiddly cleanup
code.

Managing build workers

In order to run each build across multiple machines at once, we
need a system to keep track of which servers are currently in-use
and which ones are free, and to assign incoming work to available
servers.

We run all our tests inside
of Jenkins; Rather than
writing custom code to manage worker pools, we (ab)use a Jenkins
plugin called
the matrix
build plugin.

The matrix build plugin is designed for projects where you want a
"build matrix" that tests a project in multiple environments. For
example, you might want to build every release of a library against
several versions of Ruby and make sure it works on each of them.

We misuse it slightly by configuring a custom build axis, called
BUILD_ROLE, and telling Jenkins to build with
BUILD_ROLE=leader, BUILD_ROLE=worker1,
BUILD_ROLE=worker2, and so on. This causes Jenkins to
run N simultaneous jobs for each build.

Combined with
some other
Jenkins configuration, we can ensure that each of these builds
runs on its own machine. Using this, we can take
advantage of Jenkins worker management, scheduling, and resource
allocation to accomplish our goal of maintaining a large pool of
identical workers and allocating a small number of them for each
build.

NSQ

Once we have a pool of workers running, we decide which tests to run
on each node.

One tactic for splitting work—used by several of our previous test
runners—is to split tests up statically. You decide ahead of time
which workers will run which tests, and then each worker just runs
those tests start-to-finish. A simple version of this strategy just
hashes each test and take the result modulo the number of workers;
Sophisticated versions can record how long each test took, and try
to divide tests into group of equal total runtime.

The problem with static allocations is that they’re extremely prone
to stragglers. If you guess wrong about how long tests will take, or
if one server is briefly slow for whatever reason, it’s very easy
for one job to finish far after all the others, which means slower,
less efficient, tests.

We opted for an alternate, dynamic approach, which allocates work in
real-time using a work queue. We manage all coordination between
workers using an nsqd
instance. nsq is a super-simple queue that was
developed at Bit.ly; we already use it
in a few other places, so it was natural to adopt here.

Using the build number provided by Jenkins, we separate distinct
test runs. Each run makes use of three queues to coordinate work:

The node with BUILD_ROLE=leader writes each test
file that needs to be run into
the test.<BUILD_NUMBER>.jobs queue.

As workers execute tests, they write the results back to the
test.<BUILD_NUMBER>.results queue, where they
are collected by the leader node.

Once the leader has results for each test, it writes "kill"
signals to the test.<BUILD_NUMBER>.shutdown
queue, one for each worker machine. A thread on each worker
pulls off a single event and terminates all work on that node.

Each worker machine forks off a pool of
processes after loading code. Each of those processes independently
reads from the jobs queue and executes tests. By
relying on nsq for coordination even within a single machine, we
have no need for a second, machine-local, communication mechanism,
which might risk limiting our concurrency across multiple CPUs.

Other than the leader node, all nodes are homogenous;
they blindly pull work off the queue and execute it, and otherwise
behave identically.

Dynamic allocation has proven to be hugely effective. All of our
worker processes across all of our different machines reliably
finish within a few seconds of each other, which means we're making
excellent use of our available resources.

Because workers only accept jobs as they go, work remains
well-balanced even if things go slightly awry: Even if one of the
servers starts up slightly slowly, or if there isn't enough capacity
to start all four servers right at once, or if the servers happen to
be on different-sized hardware, we still tend to see every worker
finishing essentially at once.

Visualization

Reasoning about and understanding performance of a distributed
system is always a challenging task. If tests aren't finishing
quickly, it's important that we can understand why so we can debug
and resolve the issue.

The right visualization can often capture performance characteristics
and problems in a very powerful (and visible) way, letting operators
spot the problems immediately, without having to pore through reams of
log files and timing data.

To this end, we've built a
waterfall
visualizer for our test runner. The test processes record timing
data as they run, and save the result in a central file on the build
leader. Some Javascript d3 code can then assemble that
data into a waterfall diagram showing when each individual job started
and stopped.

Waterfall diagrams of a slow test run and a fast test run.

Each group of blue bars shows tests run by a single process on a
single machine. The black lines that drop down near the right show
the finish times for each process. In the first visualization, you
can see that the first process (and to a lesser extent, the second)
took much longer to finish than all the others, meaning a single
test was holding up the entire build.

By default, our test runner uses test files as the unit of
parallelism, with each process running an entire file at a
time. Because of stragglers like the above case, we implemented an
option to split individual test files further, distributing the
individual test classes in the file instead of the entire file.

If we apply that option to the slow files and re-run, all the "finished"
lines collapse into one, indicating that every process on every worker
finished at essentially the same time—an optimal usage of resources.

Notice also that the waterfall graphs show processes generally going
from slower tests to faster ones. The test runner keeps a persistent
cache recording how long each test took on previous runs, and
enqueues tests starting with the slowest. This ensures that slow
tests start as soon as possible and is important for ensuring an
optimal work distribution.

The decision to invest effort in our own testing infrastructure
wasn't necessarily obvious: we could have continued to use a
third-party solution. However, spending a comparatively small amount
of effort allowed the rest of our engineering organization to move
significantly faster—and with more confidence. I'm also optimistic
this test runner will continue to scale with us and support our
growth for several years to come.

If you end up implementing something like this (or have
already), send me a note!
I'd love to hear what you've done, and what's worked or hasn't for
others with similar problems.