Distributed Test Execution with GoCD + TLB

GoCD Team

Writing tests has finally become the norm. Consequently, running tests for every commit is central to & the most time consuming activity in any CI/CD setup. In a decent-sized production quality project you tend to have thousands of tests. That means the cycle time, i.e. the time it takes for a commit to reach deployable state (after running all unit, integration & functional tests), keeps growing.

One such example is GoCD's codebase. Just the "Common" & "Server" components of GoCD which comprises of unit & integration tests, together has ~6000 tests which will take about ~5 hours if run serially! The functional test suite is about 260+ tests with combined runtime of ~15 hours. That's close to a day & we haven't even run everything for a single commit!

Note that the number of tests is so huge that just putting in a powerful box & running test in parallel will not bring it down to acceptable limits. Also, a large number of other problems surface if you start running tests in parallel on same box (without sandboxed environment) like concurrency issues etc.

Solution [GoCD + TLB]

GoCD improves the cycle time of its own build by making test execution faster, distributing it across many agents (machines). After this "Common" + "Server" takes 20 minutes. All functional tests run in 45 minutes. Thats close to an hour! Still not ideal (a few minutes - constrained by resource availability), but better. :)

Test Load Balancer (TLB)

TLB is an open-source library which provides the ability to break up a test suite into pieces and run a part. It guarantees 'Mutual Exclusion' & 'Collective Exhaustion' properties that are essential to reliably running tests in distributed fashion.

TLB's strength lies in intelligent test distribution which is based on time, i.e. the tests will be distributed based on time they take to execute, making the jobs close to equal runs which leads to better resource utilization. It falls back on count based splitting if test times are not available. It also runs tests in 'Failed First' order, so if a test has failed in previous run it will be run before other tests which means faster feedback.

GoCD

GoCD is an open-source CI/CD tool. Its well known for its powerful modelling, tracing & visualization capabilities.

While TLB is doing all the distribution, GoCD does what it does best - orchestrate the parallel execution.

Run 'X' instances

Starting release 14.3 you can spawn 'x' instances of a job. So if you want to distribute your tests across 10 machines you just need to set run instance count to 10 & GoCD will spawn 10 instances of the job when scheduling.

Other features that helps with Test Parallelization

Wait for all jobs to finish

GoCD's modeling capability gives it the ability to run jobs in parallel but wait for all of them to finish before the next Stage / downstream Pipelines are triggered.

Stop the downstream flow

If any of the tests (and as a result the Job running the test) fails, the Stage is considered as failed. This causes the flow to stop as expected.

Consolidated Test Report

Once all the Jobs are done running, GoCD consolidates test reports & shows the result at stage level for easy consumption.

Drill down

You can drill down at job level to know more information like 'test count', 'console output' for the Job (test) etc.

Partition re-run

GoCD also provides ability to re-run a Job of a stage. This provides ability to run the partition that could have failed due to flaky test etc. The best part is, TLB runs the exact tests that it ran the last time making sure no test is missed out!

TLB Correctness Check

TLB provides an ability to check correctness, i.e. it will make sure all tests were run. You can configure to run this correctness check once all partitions are done executing, may be in next stage / pipeline.

Power of dynamic splitting

GoCD's one knob control to amount of parallelization means that when the number of tests increase/decrease all you will need to do is change the run instance count based on number of tests & resource availability & you are done!