Splitting Up A Large Test Suite

At Etsy, deployments are managed by engineers, not by ops or a release management team. The idea is: you write code, and then you deploy it to production. Even dogs deploy code. By the time 8am rolls around on a normal business day, 15 or so people and dogs are starting to queue up, all of them expecting to collectively deploy up to 25 changesets before the day is done.

If 15 Engineers Deploy 25 Changesets In 24 Hours…

Deploys generally take about 20 minutes. Any longer than that and the people at the back of the queue can wind up waiting a really long time before they get to deploy. In our world, “a really long time” means you waited two hours or more before you could make a production deployment.

We call the tests that get run before deployment “trunk tests” because they test Etsy’s production functionality. There are 7,000 trunk tests, and we’re adding more all the time. In truth, we have more tests than that, but we don’t run all of them when we deploy (more on that in just a moment).

If the trunk tests fail, deployment pauses while the engineers look for the source of the problem. Usually this takes under 5 minutes and ends with someone sheepishly making a fixing commit. The tests are then re-run, and if they pass, deployment continues.

Through trial-and-error, we’ve settled on about 11 minutes as the longest that the automated tests can run during a push. That leaves time to re-run the tests once during a deployment, without going too far past the 20 minute time limit.

Run end-to-end, the 7,000 trunk tests would take about half an hour to execute. We split these tests up into subsets, and distribute those onto the 10 machines in our Jenkins cluster, where all the subsets can run concurrently. Splitting up our test suite and running many tests in parallel, gives us the desired 11 minute runtime.

Keep Tests That Are Similar, Together

We decided to use PHPUnit’s @group annotations (really just a special form of comments) to logically divide the test suite into different subsets. Jenkins and PHPUnit group annotations turned out to be a simple and powerful combination.

This particular config block means that this PHPUnit job will run tests tagged as dbunit. And this job will not run tests tagged database, network, flaky, sleep, or slow.

Test Classification: the Good, the Fast and the Intermittent

If you feel like 11 minutes should be enough to run 7,000 unit tests, you’re right. But not all of our automated tests are unit tests. Or at least, they’re not what we call unit tests. Every shop seems to have its own terminology for describing kinds of tests, and… so do we.

Here’s how we classify the different kinds of tests that run in our CI system.

Unit Tests

We define a unit test as a test for one-and-only one class, and that has no database interaction at all, not even fixtures.

You may have noticed above that we didn’t define a PHPUnit annotation for unit tests. Unit tests are the default.

We run the unit tests on a server where MySQL and Postgres aren’t even available. That way we find out right away if a database dependency was added accidentally.

As of today, we have about 4,500 unit tests, which run in about a minute.

Integration Tests

For the most part when we say a test is an integration test, this implies that the test uses fixtures. Usually our fixtures-backed tests are built with the PHPUnit port of DBUnit. We’ve even provided some PHPUnit extensions of our own to make testing with DBUnit easier.

We also apply the term “integration tests” to test cases that depend on any external service (eg Memcache or Gearman).

The integration tests are the slowest part of our suite. If we ran them all sequentially, the integration tests alone would take about 20 minutes for every deployment. Instead, we spend about 8 minutes per deploy running them concurrently.

Network Tests

Some integration tests may also access network resources. For instance, a test might assert that our libraries can properly send an email using a third-party service.

For the most part, we try to avoid tests like this. When a test depends on a network request, it can fail just because the request was unsuccessful. That can occur for a number of reasons, one of which may be that the service being tested against is actually down. But you never really know.

Smoke Tests

Smoke tests are system level tests that use Curl.

For the most part, our smoke tests are PHPUnit test cases that execute Curl commands against a running server. As the response from each request comes back, we assert that the proper headers and other data were returned from the server.

Functional Tests

Like smoke tests, end-to-end GUI-driven functional tests are also run against a live server, usually our QA environment (for more on that see the comments). For these tests we use Cucumber and Selenium, driving a Firefox instance that runs in an Xvfb virtual desktop environment.

Since they are very labor-intensive to develop and maintain, end-to-end functional tests are reserved for testing only the most mission-critical parts of Etsy. We get a lot of confidence from running our unit, integration and smoke tests. But at the end of the day, it’s good to know that the site is so easy to use, even a robot user agent can do it.

Zero Tolerance for Intermittent Tests

Sometimes tests fail for no good reason. It’s no big deal, it happens. A test can be a bit flaky and still be helpful during development. And such tests can be useful to keep around for reference during maintenance.

But running tests before deployment is different than running tests during maintenance or development. A test that only fails 2% of the time will fail about every other day if run before each of our 25 deployments. That’s 2-3 failures a week. And in practice each failure of the trunk tests translates to around 10 minutes of lost developer time — for every developer currently waiting to make a deployment!

So a single test that fails only 2% of the time can easily incur a cost of about 5 wasted work-hours per week. We therefore provide a few other PHPUnit group annotations which anyone is free to use when they encounter a test that isn’t quite robust enough to block deployment.

Intermittent Tests

We use the annotation @group flaky to denote a test that has been observed to fail intermittently. Tests so annotated are automatically excluded from running during deployment.

This has worked out better for us than skipping or commenting out intermittent tests. Tests annotated as flaky can still be run (and are still useful) in some contexts.

Slow Tests

Observation has repeatedly shown that our slowest tests roughly exhibit a power law distribution: a very few tests account for a great deal of the overall runtime of the test suite.

Periodically we ask that engineers identify very long-running tests and tag them as @group slow. Identifying and “retiring” our top 20 slowest tests (out of 7,000) usually results in a noticeable speedup of the test suite overall.

Again, tests so annotated may still be run (and are still useful) in some contexts, just not before deployment.

Sleep Tests and Time Tests

Good tests don’t sleep() nor do they depend on the system clock.

Sometimes getting test coverage at all means writing less-than-good tests (this is especially true when writing new tests for legacy code). We accept this reality — but we still don’t run such tests before deployment.

Fast, Reliable Tests

Our CI system is still relatively new (most tests are under two years old and the Jenkins instance only dates back to July) so we still have a lot of work to do in terms of building awesome dashboards. And in the future we’d like to harvest more realtime data from both the tests, and from Jenkins itself.

But so far we’ve been very happy with the CI system that has resulted from using PHPUnit annotations to identify subsets of functionally similar tests and running the subsets concurrently on a cluster of servers. This strategy has enabled us to quadruple the speed of our deployment pipeline and given us fine-grained control over which tests are run before each of our 25 or more daily deploys. It’s a lightweight process that empowers anyone who knows how to write a comment to have input into how tests are run on Etsy’s deployment pipeline.

42 Comments

We took our automated tests from 15 hours down to ~12 minutes about 2 months ago by spawning new test workers on a cloud grid similar to EC2. We went from running them twice a day to running them once an hour and then on demand as we release and tag code. We caught up with Kellan out and about at SxSW and was able to preach our success there…

Wow, 15 hours! You must have a pretty large cluster to get that down to 12 minutes. And what consumes the most run time in your suite? Is it because you are running a lot of functional tests?

When I first joined Etsy, the CI system was a Buildbot cluster that ran on around 30 EC2 instances. In the end though we wound up migrating to a cluster of physical machines managed by our ops team.

One big reason for switching our CI from EC2 to physical boxen, was the long wait time required to spin up the EC2 slaves. Another was that the 16-core Nehalem servers that CI runs on now, are much more powerful than the c1 medium instances in the old EC2 cluster. I’d be really interested to hear more detail about how you have managed these limitations.

How does Etsy manage test data across the environments? Is there a build up/tear down framework associated to the automation test scripts or is the test data set restored after a test run of say the integration and/or functional tests?

For our PHP integration tests we use the PHPUnit port of DBUnit which facilitates storing the test data in YAML files. The appropriate data is then loaded into the test DBs every time a test starts up. Our JUNit integration tests use the original Java DBUnit, and work in the same way. Loading fresh data every time we run the tests is my preferred approach, as dealing with “cleaning up” data during teardown is messy and unreliable.

We also have some legacy integration tests that use dumps of data from our dev DBs. While this approach works, a raw dump of data is much harder to read than a YAML file. As a result we had some test bugs caused by misunderstandings about what’s in (and missing from) the test data. That’s why all our newer integration tests use DBUnit.

Our functional tests run against the same dev DB that all our engineers use during development. Test users are created at the beginning of each test run, and all data for these users is loaded at that time as well. We’ve written Selenium automations that step through creating accounts, setting up shops and listing items. So the data-loading step of our functional suite performs double duty by also verifying that the registration and listing flows are working as intended. That data is not torn down when the tests finish. Since our dev DB is wiped out and refreshed periodically there’s no problem with leaving old, dirty data lying around (it’ll get deleted soon enough). But we never re-use that old data in the functional tests.

@noah – Can’t figure out how to reply to your question in the three seconds I looked. So this non-nested reply will need to suffice.

We are doing a lot of functional tests (~2500 currently if I remember correctly). We’re FDA regulated and need to provide loads of artifacts for proof of functionality in each release. This leads to an abnormal amount of tests than what you’d see with another non-medical solution. Our testing cycle would have been weeks or more had we not committed to automation of our functional tests. Not fun times when your release cycle desire is measured in days instead of months.

For our instances, we are not spawning them on demand but rather have them continuously running. Something we can afford to do since it’s an internal computing cloud. We’re running two test workers per box and since we have a somewhat elastic cloud can afford to have up to 25 boxes running concurrently. Because the vast majority of our tests are functional, there are wait and commit times factored into them, contributing to the ~15 hours of run time (re-architecting some of the tests also contributed to the reduced times). I believe our longest running test is in our critical path causing a 12 minute run time, otherwise it would be lower.

I’ll have some of the architects of our testing framework chime in here to provide more details. While reducing our test run times has been tremendous, I’m proudest of the test run framework that these guys have created; automated artifact generation, top error reports, timing per test, measurement dashboards (thanks for the inspiration and statd, btw).

This all looks like fantastic process to implement an elaborate system of automated checks.

I am wondering if this represents the entirety of your testing efforts? Where in this process does the sapient exploratory testing occur? How does exploratory testing factor into your life cycle process?

We deploy in very small increments. Because our changesets are small, each deploy requires a comparatively small amount of manual testing. Further, since there are usually several engineers “riding the push train” we usually wind up with several people pooling their manual explorations, so that everyone is playing with each other’s changes while waiting for the tests to pass.

It’s also worth noting that almost everything we deploy is behind a config flag. So the vast majority of pushes have no visible impact on the production site.

New code runs for a long time in production before our members ever see it. However, the Etsy team has the ability to try out new features in production before they’re visible to everyone else. Thus we’re able to learn about and react to risk in production before exposing new features to the public.

It was great to see clear definitions for the different types of tests that you guys run. I think learning the difference between a unit test and an integration test is one of the first big steps for developers who are learning how to write tests (and testable code). I know it was for me at least.

Did you write your own script for loading the YAML data into the database, or can PHPUnit or Jenkins take care of that part of the process as well?

PHPUnit_Extension_Database, aka DBUnit, supports several formats for data sets. One of the formats is YAML. The only thing we wrote for ourselves was support for multiple database connections (with a data set for each) in a single test case. Please check out that extension if you are using a sharded database architecture or would like to write an integration or system test: https://github.com/etsy/phpunit-extensions/wiki/Multiple-Database

I was wondering if your functional tests were run in parallel using the same type of cleanup and test setup for each test case?

I would like to be able to use a single database for testing where each functional test sets up it’s own data and only tests a small piece of functionality. The more that I think about it, the harder it seems.

If the tests ran sequentially, we could clean the database and re-import baseline test data for each test to use. But since we are trying to run these tests in parallel, we cannot clean the db each time (the tests would interfere with each other) and managing the import scripts seems daunting, if the db can’t be cleaned for each test.

I am trying to avoid having a master baseline data file, but it seems like it may be the only solution. Does anyone have any suggestions as to how to overcome this obstacle?

Yes our functional tests do run in parallel. Some of our tests use pre-created data (one test uses the same data set over and over) but our tests run independently and we rarely share data between tests. For registration and sign-in tests we generate unique data on the fly which is disposable once the test has run. Occasionally we need to run a DB cleanup script to unclog the DB but that’s only once every few months. My advice would be to make sure your tests have no dependencies between them. If you’re testing only small pieces of functionality you should be able to do this with unique data every time. Else, you might want to look into creating a separate ‘data only’ set of scripts which generates unique, disposable test data to support your core functional tests. That should avoid having to revert back to baseline test data each time you run your suite. Hope that helps!

Perhaps I’m misunderstanding something but you said that you define a unit test as a test for one-and-only one class. You also mention that, as of today, you have about 4,500 unit tests. Does that mean you have 4500 classes in your codebase?

Good catch. In retrospect, my language could have been clearer :- The counts we use are Jenkins’ counts of test *methods,* so it’s not reflective of the number of classes under test. Honestly we don’t focus all that much on the number of tests or the percentage of coverage. We’re much more focused on having enough tests that we can deploy and refactor with confidence.

But thanks for pointing out the ambiguity. I’ll have to think about how to make that distinction clearer when I discuss this in the future.

We actually do not even attempt to equalize the test jobs (boxes). Unit tests run in a minute or less, and that is our largest suite. DBUnit tests run in 4 to 5 minutes, and our legacy test job takes 7 to 8 minutes. The different test jobs do (at least) two things for us: 1.) More quickly assess whether it is a dependency (ie. database, network, etc.) that is tripping up the tests and not a code change, and 2.) Encourage writing more faster, leaner, deterministic unit tests instead of larger integration tests. When someone sees that twice as many tests can run in an eighth of the time, that is some pretty good encouragement to write a cleaner and simpler test.

Yeah, I was just wondering if the box that the unit tests get farmed to gets sent some DbUnit or legacy tests to run once it’s done with the unit tests, so that the whole build process is as fast as it can be — how do you coordinate that?

Ah, gotcha. So we use Jenkins (aka Hudson) for CI. Jenkins has support for slaves, and it has support for multiple executors on a single box. Unit tests and smokers do not depend on global state (ie. per box resources, potentially shared resources), so we have at least one machine that we can have multiple executors on to run through those tests. We have several other boxes that are single executor, so we don’t run tests that could potentially share these resources and cause non-deterministic failures due to unintended interactions. We use Jenkins slave labels to tie jobs to viable boxes, and beyond that, Jenkins takes care of waiting for the next eligible executor to run a test job on.

I’m curious about the job structure you use. Is there a single job that has all these parallel test runs configured against the nodes, or are they all separate jobs in their own right? Do you use multi-builds with axes, or some other setup?

Any specific plugins (beyond the SSH slave that I noted) or run configurations that you use would help me understand that part of your build system .

We began with all separate jobs, in fact, we still have a separate job for each test suite (ie. unit, network, cache, dbunit, etc.) We looked at the multi-build with axes, but that seemed more applicable to running a single set of tests in multiple environments rather than running multiple sets of tests in a single environment. We did some simple stuff with trigger jobs, downstream builds, and list view, but we still didn’t have a good way to associate particular test runs with a particular move to the next deployment stage, so we wrote our own plugin(s), a Master Project and a Deployinator plugin.

We have a Master Project plugin which adds a project type that has no builders, limited publishers, and a check list of every Jenkins project. We created a Master Project for each stage of deploy and just check off which jobs need to get triggered by each deploy stage. The job starts those sub-jobs and monitors the progress of each sub-job to aggregate a single result, send a single result e-mail, and send a single IRC message.

Could you please describe how do you setup the environment needed to perform your integration tests? Do you mock anything as part of your integration tests or just let them work against external dependencies that are being setup and torn down?

It would be very much beneficial to learn about specific test scenarios and how they fit into your test category system.

Additionally we have a collection of Pake tasks that control our PHPUnit builds. Among these are tasks that can toggle the availability of services on the build slaves. So for unit tests, we would stop MySQL, Postgres and Memcache at the beginning of the build. Integration tests have both DBs running but Memcache would remain off, and there is a Pake task to ensure this is the case.

What you’ve put together there is pretty cool. Its nice to see you get the opportunity to put you r vision into reality in such a complete way. I am looking at a few opportunities in places that are relatively small but frequent deploys, and this write-up is a great summary of how to think about scaling that.

Do you have a problem that tests don’t present themselves as obbiously flaky until theyre already in a position to stop a deploy?

Additionally, most of our tests can be run locally during development, so a lot of errors get ironed out before code is even committed. Pretty much the easier we’ve made it to run tests locally, the fewer flaky tests we’ve encountered on CI.

Also, a lot of flakiness gets sorted out on the try server, again before code is even committed.

In the past we have discussed doing “nightly zergs” where we run all the tests over and over again and try to reproduce occasional intermittent behavior. IMVU does this and it sounds like it works well for them. In the end though we decided that we run our deployment tests at such a high rate (up to several times an hour) that doing yet more automated runs would just be overkill.

And in any case, just as with application code, there are always going to be intermittent tests whose misbehavior can’t be reproduced anywhere but in production. The @group flaky annotation continues to be a good communication tool when such production issues do arise.

Hmm maybe it would have been clearer for me to say that the functional tests run against the full Web application stack. Because in that context I was talking about testing prior to release, in a QA environment. Before a deploy goes out, functional tests run in an internal environment that has the full application stack plus a persistent DB seeded with data that looks very much like what’s in production. While some issues won’t manifest in such a sandbox environment, testing against the full stack before release has repeatedly enabled us to address and fix change-related issues before they could surface in production.

But we do also run end-to-end functional tests against the production Web site. That’s part of the larger strategy of obsessively monitoring production. Some of these are GUI-driven Selenium tests, some use a PHP wrapper we have written for Curl, some use Mechanize and some are Nagios checks leveraging Perl and LWP — whatever works. I think production monitoring is the key element that lets us move fast and take risks with our products. But also, specifically in the case of functional tests, running against the production site has been very valuable in terms of detecting service degradations that wouldn’t have been immediately detectable via Nagios or our graphs.

Also relevant is our “Princess” server. Princess is an intermediate environment, between QA and production (similar to the “latest” environment that they maintain at Facebook). It runs whatever code is in the HEAD revision of our git master branch. But, Princess uses production data. So on Princess you can use unreleased code to view and affect changes in production. We do exploratory testing on Princess before every release and a suite of automated functional tests runs against this environment as well.

[…] Divide and Concur (Noah Sussman). By reading this post, you’ll learn about all the inner workings of our automated testing setup: what software we use (with plenty of links), how we set it up, and the philosophy behind it all. […]

[…] day: deploy to production. We’ve talked a lot in the past about our deployment, metrics, and testing processes. But how does the development environment facilitate someone coming in on day one and contributing […]

[…] fog had been rolling in that night, and we had been setting up a new cluster of servers for our CI system. CentOS 6.2, LXC and random kernel panics were all there to lend a hand. The kernel panics were new […]

How do I deal with a superior who refuses to take responsibility for her failing automated tests?…

tl;dr: maybe your boss (or you) should just delete those long-failing tests so that both of you can spend time doing meaningful work rather than discussing the status of tests. From the way you phrase your question I feel like I can make a couple of in…

[…] from the fact that at any point during a day, people are waiting their turn to deploy code in our push queue. To do our schema change, we’ll need to wait through the push queue 4 times. Not hard, but […]

The engineers who make Etsy make our living with a craft we love: software. This is where we'll write about our craft and our collective experience building and running the world's most vibrant handmade marketplace.