Tag Archives: development

Intermittent Oranges (tests which fail sometimes and pass other times) are an ever increasing problem with test automation at Mozilla.

While there are many common causes for failures (bad tests, the environment/infrastructure we run on, and bugs in the product)
we still do not have a clear definition of what we view as intermittent. Some common statements I have heard:

“It’s obvious, if it failed last year, the test is intermittent“

“If it failed 3 years ago, I don’t care, but if it failed 2 months ago, the test is intermittent“

“I fixed the test to not be intermittent, I verified by retriggering the job 20 times on try server“

These are imply much different definitions of what is intermittent, a definition will need to:

determine if we should take action on a test (programatically or manually)

define policy sheriffs and developers can use to guide work

guide developers to know when a new/fixed test is ready for production

provide useful data to release and Firefox product management about the quality of a release

Given the fact that I wanted to have a clear definition of what we are working with, I looked over 6 months (2016-04-01 to 2016-10-01) of OrangeFactor data (7330 bugs, 250,000 failures) to find patterns and trends. I was surprised at how many bugs had <10 instances reported (3310 bugs, 45.1%). Likewise, I was surprised at how such a small number (1236) of bugs account for >80% of the failures. It made sense to look at things daily, weekly, monthly, and every 6 weeks (our typical release cycle). After much slicing and dicing, I have come up with 4 buckets:

Random Orange: this test has failed, even multiple times in history, but in a given 6 week window we see <10 failures (45.2% of bugs)

Low Frequency Orange: this test might fail up to 4 times in a given day, typically <=1 failures for a day. in a 6 week window we see <60 failures (26.4% of bugs)

Intermittent Orange: fails up to 10 times/day or <120 times in 6 weeks. (11.5% of bugs)

High Frequency Orange: fails >10 times/day many times and are often seen in try pushes. (16.9% of bugs or 1236 bugs)

Alternatively, we could simplify our definitions and use:

low priority or not actionable (buckets 1 + 2)

high priority or actionable (buckets 3 + 4)

Does defining these buckets about the number of failures in a given time window help us with what we are trying to solve with the definition?

Determine if we should take action on a test (programatically or manually):

ideally buckets 1/2 can be detected programatically with autostar and removed from our view. Possibly rerunning to validate it isn’t a new failure.

buckets 3/4 have the best chance of reproducing, we can run in debuggers (like ‘rr’), or triage to the appropriate developer when we have enough information

Define policy sheriffs and developers can use to guide work

sheriffs can know when to file bugs (either buckets 2 or 3 as a starting point)

developers understand the severity based on the bucket. Ideally we will need a lot of context, but understanding severity is important.

Guide developers to know when a new/fixed test is ready for production

If we fix a test, we want to ensure it is stable before we make it tier-1. A developer can use math of 300 commits/day and ensure we pass.

NOTE: SETA and coalescing ensures we don’t run every test for every push, so we see more likely 100 test runs/day

Provide useful data to release and Firefox product management about the quality of a release

Release Management can take the OrangeFactor into account

new features might be required to have certain volume of tests <= Random Orange

One other way to look at this is what does gets put in bugs (war on orange bugzilla robot). There are simple rules:

15+ times/day – post a daily summary (bucket #4)

5+ times/week – post a weekly summary (bucket #3/4 – about 40% of bucket 2 will show up here)

Lastly I would like to cover some exceptions and how some might see this flawed:

missing or incorrect data in orange factor (human error)

some issues have many bugs, but a single root cause- we could miscategorize a fixable issue

I do not believe adjusting a definition will fix the above issues- possibly different tools or methods to run the tests would reduce the concerns there.

With all this talk of intermittent failures and folks coming up with ideas on how to solve them, I figured I should dive head first into looking at failures. I have been working with a few folks on this, specifically :parkouss and :vaibhav1994. This experiment (actually the second time doing so) is where I take a given intermittent failure bug and retrigger it. If it reproduces, then I go back in history looking for where it becomes intermittent. This weekend I wrote up some notes as I was trying to define what an intermittent is.

Lets outline the parameters first for this experiment:

All bugs marked with keyword ‘intermittent-failure’ qualify

Bugs must not be resolved or assigned to anybody

Bugs must have been filed in the last 28 days (we only keep 30 days of builds)

Bugs must have >=20 tbplrobot comments (arbitrarily picked, ideally we want something that can easily reproduce)

Here are what comes out of this:

356 bugs are open, not assigned and have the intermittent-failure keyword

25 bugs have >=20 comments meeting our criteria

The next step was to look at each of the 25 bugs and see if it makes sense to do this. In fact 13 of the bugs I decided not to take action on (remember this is an experiment, so my reasoning for ignoring these 13 could be biased):

2 bugs had patches and 1 had a couple of comments indicating a developer was already looking at it

This leaves us with 12 bugs to investigate. The premise here is easy, find the first occurrence of the intermittent (branch, platform, testjob, revision) and re-trigger it 20 times (picked for simplicity). When the results are in, see if we have reproduced it. In fact, only 5 bugs reproduced the exact error in the bug when re-triggered 20 times on a specific job that showed the error.

Moving on, I started re-triggering jobs back in the pushlog to see where it was introduced. I started off with going back 2/4/6 revisions, but got more aggressive as I didn’t see patterns. Here is a summary of what the 5 bugs turned out like:

Bug 1161052 – Jetpack test failures. So many failures in the same test file, it isn’t very clear if I am reproducing the failure or finding other ones. :erikvold is working on fixing the test in bug 1163796 and ok’d disabling it if we want to.

Bug 1161537 – OSX 10.6 Mochitest-other. Bisection didn’t find the root cause, but this is a new test case which was added. This is a case where when the new test case was added it could have been run 100+ times successfully, then when it merged with other branches a couple hours later it failed!

bug 1155423 – Linux debug reftest-e10s-1. This reproduced 75 revisions in the past, and due to that I looked at what changed in our buildbot-configs and mozharness scripts. We actually turned on this test job (hadn’t been running e10s reftests on debug prior to this) and that caused the problem. This can’t be tracked down by re-triggering jobs into the past.

In summary, out of 356 bugs 2 root causes were found by re-triggering. In terms of time invested into this, I have put about 6 hours of time to fine the root cause of the 5 bugs.

There are all kinds of great ideas folks have for fixing intermittent issues. In fact each idea in and of itself is a worthwhile endeavor. I have spent some time over the last couple of months fixing them, filing bugs on them, and really discussing them. One question that remains- what is the definition of an intermittent.

I don’t plan to lay out a definition, instead I plan to ask some questions and lay out some parameters. According to orange factor, there are 4640 failures in the last week (May 3 -> May 10) all within 514 unique bugs. These are all failures that the sheriffs have done some kind of manual work on to star on treeherder. I am not sure anybody can find a way to paint a pretty picture to make it appear we don’t have intermittent failures.

Looking at a few bugs, there are many reasons for intermittent failures:

Firefox Code (we actually have introduced conditions to cause real failures- just not every time)

Real regressions (failures which happen every time we run a test)

There are a lot of reasons, many of these have nothing to do with poor test cases or bad code in Firefox. But many of these are showing up many times a day and as a developer who wants to fix a bad test, many are not really actionable. Do we need to have some part of a definition to include something that is actionable?

Looking at the history of ‘intermittent-failure’ bugs in Bugzilla, many occur once and never occur again. In fact this is the case for over half of the bugs filed (we file upwards of 100 new bugs/week). While there are probably reasons for a given test case to fail, if it failed in August 2014 and has never failed again, is that test case intermittent? As a developer could you really do anything about this given the fact that reproducing it is virtually impossible?

This is where I start to realize we need to find a way to identify real intermittent bugs/tests and not clutter the statistics with tests which are virtually impossible to reproduce. Thinking back to what is actionable- I have found that while filing bugs for Talos regressions the closer the bug is filed to the original patch landing, the better the chance it will get fixed. Adding to that point, we only keep 30 days of builds/test packages around for our CI automation. I really think a definition of an intermittent needs to have some kind of concept of time. Should we ignore intermittent failures which occur only once in 90 days? Maybe ignore ones that don’t reproduce after 1000 iterations? Some could argue that we look in a smaller or larger window of time/iterations.

Lastly, when looking into specific bugs, I find many times they are already fixed. Many of the intermittent failures are actually fixed! Do we track how many get fixed? How many have patches and have debugging already taking place? For example in the last 28 days, we have filed 417 intermittents, of which 55 are already resolved and of the remaining 362 only 25 have occurred >=20 times. Of these 25 bugs, 4 already have patches. It appears a lot of work is done to fix intermittent failures which are actionable. Are the ones which are not being fixed not actionable? Are they in a component where all the developers are busy and heads down?

In a perfect world a failure would never occur, all tests would be green, and all users would use Firefox. In reality we have to deal with thousands of failures every week, most of which never happen again. This quarter I would like to see many folks get involved in discussions and determine:

what is too infrequent to be intermittent? we can call this noise

what is the general threshold where something is intermittent?

what is the general threshold where we are too intermittent and need to backout a fix or disable a test?

what is a reasonable timeframe to track these failures such that we can make them actionable?

Thanks for reading, I look forward to hearing from many who have ideas on this subject. Stay tuned for an upcoming blog post about re-trigging intermittent failures to find the root cause.

The last 2 weeks I have gone head first into a world of resolving some issues with our mochitest browser-chrome tests with RyanVM, Armen, and the help of Gavin and many developers who are fixing problems left and right.

There are 3 projects I have been focusing on:

1) Moving our Linux debug browser chrome tests off our old fedora slaves in a datacenter and running them on ec2 slave instances, in bug 987892.

These are live and green on all Firefox 29, 30, and 31 trees! More work is needed for Firefox-28 and ESR-24 which should be wrapped up this week. Next week we can stop running all linux unittests on fedora slaves.

2) Splitting all the developer tools tests out of the browser-chrome suite into their own suite in bug 984930.

browser-chrome tests have been a thorn in the side of the sheriff team for many months. More and more the rapidly growing features and tests of developer tools have been causing the entire browser-chrome suite to fail, in cases of debug to run for hours. Splitting this out gives us a small shield of isolation. In fact, we have this running well on Cedar, we are pushing hard to have this rolled out to our production and development branches by the end of this week!

Of all the tests that are run on tbpl, mochitests are the last ones to receive manifests. As of this morning, we have landed all the changes that we can to have all our tests defined in mochitest.ini files and have removed the entries in b2g*.json, by putting entries in the appropriate mochitest.ini files.

Ahal, has done a good job of outlining what this means for b2g in his post. As mentioned there, this work was done by a dedicated community member :vaibhav1994 as he continues to write patches, investigate failures, and repeat until success.

For those interested in the next steps, we are looking forward to removing our build time filtering and start filtering tests at runtime. This work is being done by billm in bug 938019. Once that is landed we can start querying which tests are enabled/disabled per platform and track that over time!