When tests fail on code that was previously tested, this is a strong signal that
something is newly wrong with the code. Before, the tests passed and the code
was correct; now the tests fail and the code is not working right. The goal of a
good test suite is to make this signal as clear and directed as possible.

Flaky (nondeterministic) tests, however, are different. Flaky tests are tests
that exhibit both a passing and a failing result with the same code. Given this,
a test failure may or may not mean that there's a new problem. And trying to
recreate the failure, by rerunning the test with the same version of code, may
or may not result in a passing test. We start viewing these tests as unreliable
and eventually they lose their value. If the root cause is nondeterminism in the
production code, ignoring the test means ignoring a production bug.

Flaky Tests at Google

Google has around 4.2 million tests that run on our continuous integration
system. Of these, around 63 thousand have a flaky run over the course of a week.
While this represents less than 2% of our tests, it still causes significant
drag on our engineers.

If we want to fix our flaky tests (and avoid writing new ones) we need to
understand them. At Google, we collect lots of data on our tests: execution
times, test types, run flags, and consumed resources. I've studied how some of
this data correlates with flaky tests and believe this research can lead us to
better, more stable testing practices. Overwhelmingly, the larger the test (as
measured by binary size, RAM use, or number of libraries built), the more likely
it is to be flaky. The rest of this post will discuss some of my findings.
For a previous discussion of our flaky tests, see John Micco's post
from May 2016.

Test size - Large tests are more likely to be flaky

We categorize our tests into three general sizes: small, medium and large. Every
test has a size, but the choice of label is subjective. The engineer chooses the
size when they initially write the test, and the size is not always updated as
the test changes. For some tests it doesn't reflect the nature of the test
anymore. Nonetheless, it has some predictive value. Over the course of a week,
0.5% of our small tests were flaky, 1.6% of our medium tests were flaky, and 14%
of our large tests were flaky [1]. There's a clear increase in flakiness from
small to medium and from medium to large. But this still leaves open a lot of
questions. There's only so much we can learn looking at three sizes.

The larger the test, the more likely it will be flaky

There are some objective measures of size we collect: test binary size and RAM
used when running the test [2]. For these two metrics, I grouped tests into
equal-sized buckets [3] and calculated the percentage of tests in each bucket
that were flaky. The numbers below are the r2 values of the linear best fit [4].

Correlation between metric and likelihood of test being
flaky

Metric

r2

Binary size

0.82

RAM used

0.76

The tests that I'm looking at are (for the most part) hermetic tests that
provide a pass/fail signal. Binary size and RAM use correlated quite well when
looking across our tests and there's not much difference between them. So it's
not just that large tests are likely to be flaky, it's that the larger the tests
get, the more likely they are to be flaky.

I have charted the full set of tests below for those two metrics. Flakiness
increases with increases in binary size [5], but we also see increasing linear
fit residuals [6] at larger sizes.

The RAM use chart below has a clearer progression and only starts showing large
residuals between the first and second vertical lines.

While the bucket sizes are constant, the number of tests in each bucket is
different. The points on the right with larger residuals include much fewer
tests than those on the left. If I take the smallest 96% of our tests (which
ends just past the first vertical line) and then shrink the bucket size, I get a
much stronger correlation (r2 is 0.94). It perhaps indicates that RAM and binary
size are much better predictors than the overall charts show.

Certain tools correlate with a higher rate of flaky
tests

Some tools get blamed for being the cause of flaky tests. For example, WebDriver tests
(whether written in Java, Python, or JavaScript) have a reputation for being
flaky [7]. For a few of our common testing tools, I determined the percentage of
all the tests written with that tool that were flaky. Of note, all of these
tools tend to be used with our larger tests. This is not an exhaustive list of
all our testing tools, and represents around a third of our overall tests. The
remainder of the tests use less common tools or have no readily identifiable
tool.

Flakiness of tests using some of our common testing tools

Category

% of tests that are flaky

% of all flaky tests

All tests

1.65%

100%

Java WebDriver

10.45%

20.3%

Python WebDriver

18.72%

4.0%

An internal integration tool

14.94%

10.6%

Android emulator

25.46%

11.9%

All of these tools have higher than average flakiness. And given that 1 in 5 of
our flaky tests are Java WebDriver tests, I can understand why people complain
about them. But correlation is not causation, and given our results from the
previous section, there might be something other than the tool causing the
increased rate of flakiness.

Size is more predictive than tool

We can combine tool choice and test size to see which is more important. For
each tool above, I isolated tests that use the tool and bucketed those based on
memory usage (RAM) and binary size, similar to my previous approach. I
calculated the line of best fit and how well it correlated with the data (r2). I
then computed the predicted likelihood a test would be flaky at the smallest
bucket [8] (which is already the 48th percentile of all our tests) as well as
the 90th and 95th percentile of RAM used.

Predicted flaky likelihood by RAM and
tool

Category

r2

Smallest bucket(48th percentile)

90th percentile

95th percentile

All tests

0.76

1.5%

5.3%

9.2%

Java WebDriver

0.70

2.6%

6.8%

11%

Python WebDriver

0.65

-2.0%

2.4%

6.8%

An internal integration tool

0.80

-1.9%

3.1%

8.1%

Android emulator

0.45

7.1%

12%

17%

This table shows the results of these calculations for RAM. The correlation is
stronger for the tools other than Android emulator. If we ignore that tool, the
difference in correlations between tools for similar RAM use are around 4-5%.
The differences from the smallest test to the 95th percentile for the tests are
8-10%. This is one of the most useful outcomes from this research: tools have
some impact, but RAM use accounts for larger deviations in flakiness.

Predicted flaky likelihood by binary size
and tool

Category

r2

Smallest bucket(33rd percentile)

90th percentile

95th percentile

All tests

0.82

-4.4%

4.5%

9.0%

Java WebDriver

0.81

-0.7%

14%

21%

Python WebDriver

0.61

-0.9%

11%

17%

An internal integration tool

0.80

-1.8%

10%

17%

Android emulator

0.05

18%

23%

25%

There's virtually no correlation between binary size and flakiness for Android
emulator tests. For the other tools, you see greater variation in predicted
flakiness between the small tests and large tests compared to RAM; up to 12%
points. But you also see wider differences from the smallest size to the
largest; 22% at the max. This is similar to what we saw with RAM use and another
of the most useful outcomes of this research: binary size accounts for larger
deviations in flakiness than the tool you use.

Conclusions

Engineer-selected test size correlates with flakiness, but within Google there
are not enough test size options to be particularly useful.

Objectively measured test binary size and RAM have strong correlations with
whether a test is flaky. This is a continuous function rather than a step
function. A step function would have sudden jumps and could indicate that we're
transitioning from one type of test to another at those points (e.g. unit tests
to system tests or system tests to integration tests).
Tests written with certain tools exhibit a higher rate of flakiness. But much of
that can be explained by the generally larger size of these tests. The tool
itself seems to contribute only a small amount to this difference.
We need to be more careful before we decide to write large tests. Think about
what code you are testing and what a minimal test would look like. And we need
to be careful as we write large tests. Without additional effort aimed at
preventing flakiness, there's is a strong likelihood you will have flaky tests
that require maintenance.

Footnotes

A test was flaky if it had at least one flaky run during the week.

I also considered number of libraries built to create the test. In a 1%
sample of tests, binary size (0.39) and RAM use (0.34) had stronger correlations
than number of libraries (0.27). I only studied binary size and RAM use moving
forward.

I aimed for around 100 buckets for each metric.

r2 measures how closely the line of best fit matches the data. A value of 1
means the line matches the data exactly.

There are two interesting areas where the points actually reverse their
upward slope. The first starts about halfway to the first vertical line and
lasts for a few data points and the second goes from right before the first
vertical line to right after. The sample size is large enough here that it's
unlikely to just be random noise. There are clumps of tests around these points
that are more or less flaky than I'd expect only considering binary size. This
is an opportunity for further study.

Distance from the observed point and the line of best fit.

Other web testing tools get blamed as well, but WebDriver is our most
commonly used one.

Some of the predicted flakiness percents for the smallest buckets end up
being negative. While we can't have a negative percent of tests be flaky, it is
a possible outcome using this type of prediction.

35 comments
:

Interesting, yet not so surprising that larger tests are more flaky. Here are a couple of quick thoughts: 1) I'm not so sure that the linear trend is a good fit, given the large variance for larger tests (looks rather heteroscedastic to me).2) It's surprising to me that very small tests (unit tests) are flaky. Is there a pattern in these small flaky tests?3) The analysis of Android emulator is interesting. I'm not too familiar with the emulator: does it exhibit non-determinism, e.g. in the form of scheduling?

1) The large variance for larger tests is actually variance to the buckets I've put the tests into. Buckets for the larger tests contain many fewer tests than those for smaller tests and are more likely to deviate - and deviate by a greater amount. The graph with the smallest 96% of tests doesn't have this issue.

You bring up a good point. The linear trend is just a default, though I tried a few others and nothing looked significantly better. Without knowing the mechanism, it's hard to say what it really should be.

2) There are a few patterns - some tests rely on random numbers, some occasionally hit time outs in test and infra code, etc. I believe that many of these tests can and do get fixed quickly, but we have enough tests that if you take a snapshot at any given moment there will be a few that have issues.

3) I'm not too familiar with the emulator either so I don't have a good answer here. It's worth looking into it further certainly.

Jeff, thanks very much for your detailed response. As for the visualization, my first thought was explicitly representing uncertainty, e.g. as done by the lmplot function in seaborn (or ggplot, ... )- for an example, see the first plot at http://seaborn.pydata.org/generated/seaborn.lmplot.html#seaborn.lmplot

Have you drawn any correlations between test behavior and flakiness? For example, capture product telemetry during the test. Compare metrics like number of events during test, number of distinct events during test, degree of action variance between iterations of test (in terms of which events show per iteration and in what order), and then see if any of those correlate with test flakiness?

I have not done any research along those lines. I think that would be interesting though choosing the correct set of events and measuring them across a large enough set of tests would be tricky. Are there specific events you had in mind? I could see certain things related to threads and locking being correlated.

In this particular case, the events are application specific. They are part of the product's own telemetry. They track things like which commands a user chose, what activities they are doing, whether specific application events were triggered for things like timers, operating system events, etc. So you get a different set of events for a word processor than you might a spreadsheet, but you also get a core set of events common to shared code.

My own observation is that events at this level exhibit a surprisingly high level of variance. A test doing the same thing ever time at the command level will typically yield different sequences of events at the application layer. About 90% of the events happen every time, but the remaining 10% or so vary. And as expected, this is highly application dependent. The more a given application is multi-threaded, asynchronous or event driven, the more variance you see in the actual telemetry signature.

Additional question: have you drawn any comparisons of bug discovery rates and bug priority as relates to test flakiness?

On my own analysis, I have found that for test suites where the tests have extremely low flakiness (between .1% and 1%), bug discovery is likewise very low. We get bugs, for certain, but not nearly as much as we get for tests with much higher flake rates. Certainly the product crashes and unexpected exits are much higher on tests with higher flake rates.

As expensive as flaky tests are, I am finding that there is a trade off between low flake and discovering product bugs. Stabilizing the test behavior seems to sanitize, or one prefers a more colorful metaphor, neuter, the test. My current hypothesis is that once you get past the obvious, bad test code problems, you are left with an inevitable amount of flaky that is not only unavoidable, but possibly even desirable. It is an indication you are exercising areas of the code that have legitimate problems for the same reasons the flake manifests.

I think it depends how you stabilize the test and whether you then lose some of the coverage you thought you had.

I think your last point is the key: "It is an indication you are exercising areas of the code that have legitimate problems for the same reasons the flake manifests".

You're right, a flake and a bug are different manifestations of the same problem. If a test consistently fails, and the problem lies in production code we consider it a bug. If a test flakily fails, and the problem lies in production code we should also consider it a bug. Lots of code, that is highly complicated, and changes rapidly is likely to lead to both bugs and flakes.

The key part of writing and testing that code is to break it down into manageable chunks to reduce both of those (unpleasant) outcomes.

I believe that as we write tests, we carefully create an answer to the question: What's the goal of this test? We put that answer into a couple lines of source code. As we maintain it later (and perhaps later the code changes or the tests are flaky), we can choose more distinct tests that break down and still achieve the same intended goal.

I agree it's important to re-assess your tests as code changes. Breaking down a test into smaller pieces is key to having clean, well-maintained tests.

"What's the goal of this test?" is a great question to answer. An equally important question is: "What code are we running for this test?" I think many system / integration tests actually have a fairly narrow answer to the first question but a very large answer to the second.

The topic is enlightening. Currently, while we use JIRA/Zephyr in manual mode, and I have worked in other more automated environments, I'm prototyping an automated test system for 'systems testing.' Preventing flaky tests certainly is a priority.

I could not determine however if these stats applied to "white box" testing, or unit testing of the code done by developers OR system type testing where the product is being used by systems testers with the product having a server-client architecture with a web app interface to the user.

I also feel with respect to the size to flakiness ratio is a good indicator as has been found by the data, but is size related to complexity? Is it test code growth (meaning algorithm complexity) and|or is it simple code, but bigger data? So the test logic is sound, but it runs out of time because the data set is 10 times bigger and the timer is set too low? It doesn't complete so there is an incomplete result (in a sense). Yes, a test bug, but the test logic itself is not changed but how it is run is?

Versus on the Google side of it with a system in place and having to deal with it, what is the best way to minimize risk when building a new system? My thought is to write smaller tests with single purposes while watching data growth processing.

The stats apply to continuously run, automated tests which are generally hermetic. Some of these are unit tests, others are system or even integration type tests.

The overall size is often due to the complexity of the code rather than the size of the data but I'm sure there are exceptions. If a timer is set too low and the test becomes flaky due to organic growth of test data it should be an easy problem to fix - though that doesn't mean that it is fixed immediately.

When building a system I think you always want to start with the smallest tests you can. You will need to add larger tests at some point, but you need to understand why they're necessary and think about whether you can test it in a different, smaller way.

Excellent post; thank you for sharing these findings with the larger community. As a future post, I would be very interested in hearing any correlation between fixing flaky tests and finding bugs in production code (as opposed to the test themselves). No doubt that flaky tests are a drain in engineering resources, but dedicating scarce engineering resources to fix large (and complex) flaky tests can be a challenge. If it was established that there is a real payback in finding bugs in production code by fixing flaky tests, it would help motivate teams to dedicate time to this exercise.

Also, are you aware of any tools/scripts that can help categorize tests results over time? For instance, if a test is run daily; then over a 14 day period this test may exhibit one of the following patterns: a) pass consistently; b) fail consistently; c) pass consistently then fail consistently -> new consistent failure; d) pass consistently, then fail intermittently -> new flaky test; e) etc...

Categorizing test results over time to identify patterns (eg. finding new consistent failures) is currently a manually exercise for us; but having this automated would eliminate another manual step and would help us identify new problems as they occur. If anybody has pointers on automating the test result categorization, that would be appreciated.

It seems you would need a shared test execution and result repository system to have any shared tools or scripts for measurement. Test result schema is all too often specific to the toolset and sometimes the system under test.

That said, I am not so sure this is an arduous problem. Assuming your system stores test results in some kind of database, then establishing consistency is a matter of querying passing executions over all executions on a per test basis. If one wants to be even more precise, also keep track of the number of distinct failures per test (some tests may fail more than once but in different ways). My own definition is if a test either always fails (with the same failure) or always passes then it is consistent - otherwise it is inconsistent. This is usually a trivial query in any repository.

I do recommend changing the execution practices. It is insufficient to have a daily run of a test (I assume "daily" approximates "new build"). You ought to have multiple iterations per test per build, ideally hundreds, so that you can establish granularity to at least the .01 level. In my own experience, I find that product teams can chase a test's flake factor down to .001 or better when needed (or feasible, depending on the type of test), and to establish that you need many iterations. We started what we call a "reliability run" several years ago, and the information value of that investment has paid back many times over.

One team within Google gathered some data regarding this. They found that when a stable test became flaky, and we could track it to a specific code change, the problem was a bug in production code 1/6th of the time.

If the default is to ignore the flaky tests then you will eventually be ignoring a real bug.

Wayne - Thanks for your reply. Our tests are generally system level tests with custom hardware; thus, there are multiple variables at play which can lead to flaky tests. We have multiple strategies in place to help tackle this problem: a) Testers need to ensure there tests pass consistently (run at least 10 times in succession without error; b) A failure in regression is automatically re-run 3 times to determine whether the failure is consistent; c) All test results from our regression system is logged into a CSV file; d) Automated tests typically run against each check-in, but the full regression takes at least 1/2 day, so hundreds of test runs is simply not possible. That being said, I like what was said about having "reliability" runs and I think our system will allows these to run as "best effort" to fill up spare capacity.

Jeff- thanks for the data point. I agree that flaky tests need to be fixed, it just a matter of the priority given to addressing the failure. As pointed out elsewhere, release velocity is important too and thus trade-offs are required.

Nice posting. I also agree that as tests get larger, it gets more flaky. I understand the whole concept of this posting but, I am kind of unfamiliar with some vocabulary. I'm not sure the exact meaning of 'binary size' .. is it the size of the test? and second, I'm not sure on the meaning of 'bucket'.. is seems like measure of something but I am a bit confused..I'd be thankful if somebody explains to me :)

The tests I looked at get compiled to a single binary that gets run. This contains all the code and data needed to run the test. Binary size is the overall size of that executable.

Bucket is a grouping of tests of similar size. Every test is either flaky (1) or not (0). By putting them into a bucket with tests of similar size we can calculate the percent of tests within that bucket which is flaky and come up with a continuous number between 0 and 1.

It's bizarre to me to read an article about "testing" that includes no insight into the relationship between testers and the automated fact checks that you are calling "tests." Testing is what people do. The closest our tools come to testing is the fact check.

It's as if I'm reading an article about noisy coffins, and you are wondering about trends and patterns of muffled screams coming from some buried coffins, but not asking what those screams might mean in any individual case. Surely the testers at Google either know the answer or think that answer is very important to discover for each and every "flaky" program you are referring to?

I write software that helps me test. When I do that, my code may behave in "flaky" ways. If so, then what I need to do is ask what that says about the product I am testing as well as my test strategy as a whole. I do not mind if one of my programs behaves in a flaky way, if I feel I am getting good information about the product-under-test by using it. If I am not getting good information, I need to tear that program down and rethink what I'm doing.

This goes back to the goal of testing. The goal is not certainty. The goal is not determinism. That's the way people think who actually hate testing and want to get past it as fast as they can, even if it means doing lousy fake testing. My goal, instead, is insight. I craft my tools to convey important information. If a "flaky" check is providing that, then my testing (which is a human process that accumulates and integrates the facts I glean from my automation) may be just fine.

I wholeheartedly agree, but we have to remember that not all insights are the same or serve the same purpose.

Some purposes have a low noise tolerance, and a higher escape rate tolerance. Some purposes have a higher noise tolerance and a lower escape rate tolerance. It is my belief that one cannot generally optimize for both with the same test. Thus we use different tests with different levels of noise and different levels of ability to discover based on the purpose.

The trend I see right now is that release velocity is dominating everybody's psychology, which puts pressure on the low noise requirements. This is creating back pressure on the "oops, you missed something" problem (higher discovery requirements paid for with more noise), but velocity pressure is receiving more attention. Further, it easier for people to just think of "tests" as a simple activity with only one purpose, so the entire focus yields to high precision, low noise tests. Not enough time has passed to force the balance to swing back.

I'm not quite sure I understand the point here but maybe we're talking about different things. These tests that are flaky are automated tests. By design, there is no human involved in running them. There are engineers that write the test, maintain the tests, and need to diagnose them when they fail. If you're testing that 1 + 1 = 2 and occasionally your program tells you the answer is 1 that's a problem that needs to be solved.

If you're talking about manual testing, there is some extra leeway for flakiness in the test tools. But even then, doesn't the flakiness at some point become an issue that needs to be fixed? If your test tools don't give you a clear signal then how do you trust them?

I cannot account for James' story, but in my case I am talking about automated tests.

The SUT, the environment, and the test conditions are too complex in end to end integrated tests to drive consistency to zero. In fact, variance is very much the true state of the system, and bugs derive from the complexity that derives from that variance. Some automated tests should be designed to exacerbate that variance, but my experience is that even the tests that are invariant and simple from their own behaviors will manifest underlying variance that is intrinsic to the SUT, and from out of that come the "flaky" results.

"If your test tools don't give you a clear signal then how do you trust them?"

I worry we are conflating correctness and consistency. For both, though, it is a matter of probability and what you can afford. What we do is measure the tests for consistency and separate them into groups where some are for fast/automatic decisions (gating and release procedures) and others have more room for having to filter noise (discovering bugs).

The tests which drift farther from 100% consistent (which also tend to have correctness issues - so while the variables are independent, they often have a relationship) need more statistical analysis to "trust" the signal. Teams usually adopt frequency as their rule of thumb, although some attributes of the reported failure motivate more fixes (e.g. if the entire call stack at time failure was reported is all in test code that manages generic UI and web page navigation, there is a tendency to punt...).

I have been putting a lot of my recent work using customer telemetry analysis to further mine value from intermittent test failures. Something that may get ignored because it doesn't occur consistently enough in test may have other signals we can relate to something customers do that will deserve more attention. It is new ground still, so I will have to come out of the mineshaft sometime later and talk about discoveries.

It is interesting to see a statistic on where the flaky tests come from, based on code analysis. I would argue that there is another layer, I tend to see more flaky tests from juniors or people less experienced with unit testing, as they tend to write the larger type of tests. And the discussion can go on and on.

I used to think about this a lot, when I was doing TDD and was in charge of a relatively small team. My thinking was that if only I could teach people how to write proper code and proper tests (that's another dimension you can make a statistic on, usually if the code under test is large then the test large as well, and it tends to lend to more "flakiness"), then we could avoid this all together.

However, in recent years, being in charge of larger teams, I learned that flacky tests tend to be a fact of life. I've not given up hope that we can eliminate them, but I realized that we also need a way to live with them. Training and getting better takes time (and by the time you train those guys, they leave and other fresh colleagues come in and start doing the same mistakes). My advice for people in similar positions would be:1. Isolate the effects of flacky tests on your builds by running failed tests a few more times at the end. This is different than just marking the flacky tests as ignored in two ways. First, it will be automatically maintained, the process just picks up whatever test failed and runs it again. And second, even flacky tests can catch problems, and then will fail 100% of the time and point to a real issue that might have otherwise gone unnoticed. 2. Have a couple of processes around these tests. One for getting better and not doing them in the first place. And one for dealing with the ones in your codebase, these tests are trying to tell you something.

You're getting at issues with scaling - specifically the development team. Training and education is certainly important. We put a large emphasis on this and new hires go through training on testing. They also learn from senior engineers who hopefully already have the right practices.

Your item #1 should be a short-term solution for any one test. Tests can be flaky due to production issues and isolating/ignoring them for any length of time means possibly missing a bug.

Item #2 is the long-term fix. Avoid writing new flaky tests, and get better at fixing the ones you have. You'll still need to follow item #1 but don't treat that as the only choice. Many people do.

Great post, thanks. Would you say that the majority of the WebDriver tests are against web apps? What do you see from a mobile apps flakiness perspective? I know that mobile has bigger challenges when it comes to testing especially when executing across devices, OS versions etc. - just written my blog about it to try and address one of the test optimization pains i see in the market lately - would appreciate your comments on this blog or other ideas around it. https://mobiletestingblog.com/2017/05/30/optimizing-android-test-automation-development/

- The plot line of "likelihood of beeing flaky" versus "binary size" looks like a pearl necklace. Why? I would expect a more irregular distribution arround the linear correlation line. It looks like that the different tests are not independent of each other.

- It's surprsing to find a (more or less) linear correlation as large systems become chaotic. Why? For chaotic systems one would expect an exponential dependency and not a linear one.

- There are some large test setups that perform significantly better then others. Why? What do they better than the others?

re: pearl necklace - They aren't always independent. Some tests are similar - they test the same system, or are using much of the same framework. This may also cause them to have similar binary sizes and flakiness rates. I don't know if that's the entire explanation, but it's some part of it.

re: linear correlation - I'm not sure a linear correlation is correct, but it is the simplest way to view this. Exponential had slightly better r2 values in some cases but not enough for me to say it's right. It seems like without knowing the true mechanism, it's hard to make a claim one way or another.

re: Some better/worse - Aside from android, most of the differences look fairly small when you factor in relative sizes. My belief is that much of the flakiness in these tests comes from absolute timing (you expect something to complete in XX time and it doesn't) or relative timing (one thread occasionally executes faster than another). To some extent, it's hard for the test framework / setup to deal with all cases where that can occur.