At Google, we run a very large corpus of tests continuously to validate our code submissions. Everyone from developers to project managers rely on the results of these tests to make decisions about whether the system is ready for deployment or whether code changes are OK to submit. Productivity for developers at Google relies on the ability of the tests to find real problems with the code being changed or developed in a timely and reliable fashion.

Tests are run before submission (pre-submit testing) which gates submission and verifies that changes are acceptable, and again after submission (post-submit testing) to decide whether the project is ready to be released. In both cases, all of the tests for a particular project must report a passing result before submitting code or releasing a project.

Unfortunately, across our entire corpus of tests, we see a continual rate of about 1.5% of all test runs reporting a "flaky" result. We define a "flaky" test result as a test that exhibits both a passing and a failing result with the same code. There are many root causes why tests return flaky results, including concurrency, relying on non-deterministic or undefined behaviors, flaky third party code, infrastructure problems, etc. We have invested a lot of effort in removing flakiness from tests, but overall the insertion rate is about the same as the fix rate, meaning we are stuck with a certain rate of tests that provide value, but occasionally produce a flaky result. Almost 16% of our tests have some level of flakiness associated with them! This is a staggering number; it means that more than 1 in 7 of the tests written by our world-class engineers occasionally fail in a way not caused by changes to the code or tests.

When doing post-submit testing, our Continuous Integration (CI) system identifies when a passing test transitions to failing, so that we can investigate the code submission that caused the failure. What we find in practice is that about 84% of the transitions we observe from pass to fail involve a flaky test! This causes extra repetitive work to determine whether a new failure is a flaky result or a legitimate failure. It is quite common to ignore legitimate failures in flaky tests due to the high number of false-positives. At the very least, build monitors typically wait for additional CI cycles to run this test again to determine whether or not the test has been broken by a submission adding to the delay of identifying real problems and increasing the pool of changes that could contribute.

In addition to the cost of build monitoring, consider that the average project contains 1000 or so individual tests. To release a project, we require that all these tests pass with the latest code changes. If 1.5% of test results are flaky, 15 tests will likely fail, requiring expensive investigation by a build cop or developer. In some cases, developers dismiss a failing result as flaky only to later realize that it was a legitimate failure caused by the code. It is human nature to ignore alarms when there is a history of false signals coming from a system. For example, see this article about airline pilots ignoring an alarm on 737s. The same phenomenon occurs with pre-submit testing. The same 15 or so failing tests block submission and introduce costly delays into the core development process. Ignoring legitimate failures at this stage results in the submission of broken code.

We have several mitigation strategies for flaky tests during presubmit testing, including the ability to re-run only failing tests, and an option to re-run tests automatically when they fail. We even have a way to denote a test as flaky - causing it to report a failure only if it fails 3 times in a row. This reduces false positives, but encourages developers to ignore flakiness in their own tests unless their tests start failing 3 times in a row, which is hardly a perfect solution.
Imagine a 15 minute integration test marked as flaky that is broken by my code submission. The breakage will not be discovered until 3 executions of the test complete, or 45 minutes, after which it will need to be determined if the test is broken (and needs to be fixed) or if the test just flaked three times in a row.

Other mitigation strategies include:

A tool that monitors the flakiness of tests and if the flakiness is too high, it automatically quarantines the test. Quarantining removes the test from the critical path and files a bug for developers to reduce the flakiness. This prevents it from becoming a problem for developers, but could easily mask a real race condition or some other bug in the code being tested.

Another tool detects changes in the flakiness level of tests and works to identify the change that caused the test to change the level of flakiness.

In summary, test flakiness is an important problem, and Google is continuing to invest in detecting, mitigating, tracking, and fixing test flakiness throughout our code base. For example:

We have a new team dedicated to providing accurate and timely information about test flakiness to help developers and build monitors so that they know whether they are being harmed by test flakiness.

As we analyze the data from flaky test executions, we are seeing promising correlations with features that should enable us to identify a flaky result accurately without re-running the test.

By continually advancing the state of the art for teams at Google, we aim to remove the friction caused by test flakiness from the core developer workflows.

35 comments
:

I hear you. same issues, same solutions. but we have another tool up our sleeve, we have a section called Reservoir that runs all new tests added in a loop for a week to determine if there is any flakiness in them, in that time they are not yet part of the critical CI path.happy to hear we are not alone.good day.

Thanks for the great blog-post!It seems that you categorize flakiness as a test issue, but the cause of the flaky test result could be in the production-code and therefor be a real issue.Did you investigate how many of the flaky tests are due to a real issue? And do you give flaky test results lower prio than tests that fail every time?

We do not currently keep accurate count of the number of times that flaky tests are really masking bugs in the code. We see it as a testing issue mostly because it makes it more difficult to use the tests for their intended purpose - finding problems with the code. From the testing system point of view a test that fails reliably is far better than a test that is flaky! A persistently failing test is giving a clear signal about what to do - even it means fixing the test.

Today at Google test authors and test infrastructure developers throughout the organization are responsible for creating/using service virtualization in their tests. We do not have a central framework - other than providing generic Mocking frameworks like Mockito.

I repeat this almost every day: do not write many UI System Tests - they should be rare. You need to build a pyramid (http://qala.io/blog/test-pyramid.html). There are almost always possibilities to write tests at lower level.

Often it's the separation of AQA and Dev teams that leads to flakiness since AQA usually write system tests only. Let Devs write all(!) the tests and the proportion of flaky tests would drop to 1:1000.

It's not only GUI tests. There're many sources of flakiness, some of which Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and I analyzed in this paper: http://mir.cs.illinois.edu/marinov/publications/LuoETAL14FlakyTestsAnalysis.pdf

Great post. I guess we all experience same issues when it comes to large scale automation processes.@John Micco - do you consider versions/experiments/configurations between test cycles when deciding if test passed Beta or marked as Flaky?

Nice Article John!! We too have this kind of issues , and we came up with rerun concept, in which You can rerun the tests 3 times if it is failed in each of the runs. If the test fails more than 3 times we mark them as failure. We have configuration to setup the rerun count. And Some times flaky tests are also the way tests are written up. -Surya

It's great to hear that you're working on this. I've been fighting against flaky tests in our c++ projects as well. I've noticed that some projects are adding flaky test information to junit xml results used by Jenkins, but the googletest framework doesn't yet support this ( https://github.com/google/googletest/issues/727 ). For projects that see lots of flaky test failures, we currently re-run failing tests one time and only report as a failure if it fails twice in a row.

Marking tests as flaky is addressing the problem from the wrong direction, and it will lose potentially valuable information.

Instead, have a test monitor itself for what it does. If it fails, look at root cause from available information. Then, depending on what failed (for example, an external dependency), do a smart retry. Is the failure reproduced? Then, fail the test!

"Marking a test as flaky" gives one permission to ignore failures, but there is potentially important and potentially actionable information there.

Instead, *use* the information to manage quality risk and/or improve the quality of the product.

MetaAutomation has patterns that describe at a high level how to do this. Don't drop information on the floor that can have value for the team and for the product!

Currently, we execute reliability runs of all of the CI tests (we try for hundreds of executions, but it depends on automation system load levels) per build to generate consistency rates. Using those numbers, we push product teams to move all tests that fall below a certain consistency level out of the CI tests. We keep them in the reliability suite for sake of coverage and issue discovery, but do not use them to gate submission into the main code branch.

We likewise have difficulty accounting for the costs, but ballpark estimates show it is very expensive. I have done prior analysis to demonstrate that intermittent failures cause engineers to take longer to submit. Intermittent failures have a high duplicate bug rate, and ad hoc estimates from engineers is that we lose ~20 per duplicate bug for an engineer to determine there is duplication. The costs go way beyond all of that, though, particularly as process gates close down team productivity (failing CI tests lock a branch from changes until it is resolved), but also from legitimate bug escapes that were ignored because of the noise.

It is my own opinion that even after tons of effort to reduce noise from tests, flaky tests are inevitable when the test condition reaches a certain complexity. There are more stable coding patterns (mostly in product, but also in test) which stabilize the test results, but they can only take you so far. Once you have moved the tests (e.g. convert end to end to unit tests, move pre-release tests to TIP methodologies) you still have a core set of test problems only discoverable in an integrated end to end system. And those tests will be flaky. If they are not flaky, they tend to never find bugs. This is not because the test is bad. It is because the conditions of the test, the thing that makes them flaky, are EXACTLY the same thing that caused the bug to be introduced in the first place. These bugs are scarier, riskier and harder to find. The secret, then, is to appropriately manage them. I prefer to rely more on repetition, statistics and runs that do not block the CI process. I prefer to data mine the test results and feed the work backlog.

Did you look at correlated unreliability? We have a number of tests that in themselves are stable, but use some form of global state (/tmp files, other global state) that cause it to fail if run together with another test. We also use a test environment that preferentially runs failing tests first, with the rest after it. That leads to the situation that if they ever fail, they are then run first with other failing tests, making it more likely to fail, and when they succeed they're run later making it less likely they fail again.

Of course, this makes it even harder to know if you broke something, as the test will reliably fail on your machine but only for you. And even when you revert any changes you did.

I have created sbt plugin to detect flaky tests in our Java/Scala projects: https://github.com/otrebski/sbt-flaky. It runs tests many times and analyze JUnits reports. Also it can calculate trends for tests. You can check example HTML report: http://sbt-flaky-demo.bitballoon.com

I do not have the background on the 'process' approach for categorizing, grouping, prioritizing your tests.... Still, would like to knowif a combination of Exploratory testing and CI has been considered. One of the basic premise for automation is to consider software candidates, that are stable and are not changed too often.

We solved this problem by re-running the failed test cases three Times. And checking them to find what causing flakiness. it is not much time consuning but more robust solution since 90 % pass at first try.

TestProject conducted a survey that compares AngularJS VS. ReactJS, exposes current front end development technologies and unit testing tool preferences of software professionals! See the results here:http://blog.testproject.io/2016/09/01/front-end-development-unit-test-automation-trends2/

As dangerous "flaky" tests that give false negatives are, giving false positives is even more dangerous. Writing test cases for misunderstood requirements could lead of incorrect validation of production code and potentially dangerous bugs to remain undetected for a long time.

I experienced such problem while working at Nokia's manufacturing facility in Fort Worth, TX in the late 90s. An incorrect calibration (adjustment) of a camera lead to a number of low-quality displays to be assembled on mobile phones. The problem was discovered by QC auditing and tedious examination of test data logged by the production test stations. The "false positive" lead to an unusually high prime pass yield of the test station in question which wasn't detected because it is almost impossible to sense a problem when all the tests are passing.

* Test Results Analyzer Plugin: Displays a matrix subsequent runs of the same tests, so you can identify which tests are ocassionaly red. https://wiki.jenkins-ci.org/display/JENKINS/Test+Results+Analyzer+Plugin