If you experience a CI build turning red due to an intermittent failure, such as in a Capybara test, this post is for you. During my last encounter with such issue, the team and I wanted to strengthen the meaning behind the bright red color. Red build should be an actionable failure indicating a real fault in our application.

###Source of inaccuracy
In the context of Rails applications, false positive build failures most often occur from Capybara acceptance tests. In it's implementation, capybara relies on a variety of timeouts and waits, which may or may not delay long enough for the acceptable behaviour to take place. This is even more common when CI runs on slower hardware such as a small EC2 instance.

###Step 1: rspec-rerun and the controvercy

The first and most impactful step we took was to use rspec-rerun gem which, in its rake task, automatically reruns failed tests until either they pass or a limit of retries is reached.

Using rspec-rerun is fairly simple:
In Gemfile:

group :development, :testdo
gem "rspec-rerun"end

a rake task for your CI:

RSPEC_RERUN_RETRY_COUNT=3 rake rspec-rerun:spec

#####Results:
rspec-rerun immediately lowered the number of false positive build outcomes. The impact of this was huge:

Improved signal-to-noise ratio caused us to pay attention to the build much more

When a build was red, it felt more justifiable to stop the immediate task and investigate the failure, since it was more likely to be consistent and indicative of a quality issue.

#####Controvercy
rspec-rerun approach to flaky tests is similar to that of strategies other teams have tried. One other technique is to manually flag tests as flaky and run them in a separate build.

Given enough skill with Capybara, it is possible to argue that most tests can be fixed instead of merely retried. On the other hand, even with skill this costs engineering time and it may be worthwhile to pursue other changes to the system.

A measurable aspect of rspec-rerun and it's impact on our tests is it's cost in terms of total build duration. If we were to rerun tests, we wanted at least some view into the impact of false positives on the build duration.

###Step 2: Tracking incidence of false positives over time

We wrote a gem to track how often a test shows up as a false positive, via logging rspec data. Here's how it works:

Which indicates that the most unreliable test would, on average, slow down the whole suite by 10 seconds due to occasionally being re-ran multiple times.

When to act on this data

What I like about this, is that it collects the data and leaves it completely to your team to act upon it. In our case, we occasionaly looked for most costly examples and prioritized a task to make that example more reliable.

Learn more & respond

The project is on github. Please follow it there for new features, changes.

What do you think of this tool? Do you love it? Do you hate it? Let me know in the comments.