Category Archives: intermittents

I gave an update 2 weeks ago on the current state of Stockwell (intermittent failures). I mentioned additional posts were coming and this is a second post in the series.

First off the tree sheriffs who maintain merges between branches, tree closures, backouts, hot fixes, and a many other actions that keep us releasing code do one important task, and that is star failures to a corresponding bug.

Once we get bugs annotated, now we work on triaging them. Our primarily tool is Neglected Oranges which gives us a view of all failures that meet our threshold and don’t have a human comment in the last 7 days. Here is the next stage of the process:

As you can see this is very simple, and it should be simple. The ideal state is adding more information to the bug which helps make it easier for the person we NI? to prioritize the bug and make a decision:

While there is a lot more we can do, and much more that we have done, this seems to be the most effective use when looking across 1000+ bugs that we have triaged so far this year.

In some cases a bug fails very frequently and there are no development resources to spend fixing the bug- these will sometimes cross our 200 failures in 30 days policy and will get a [stockwell disabled-recommended] whiteboard tag, we monitor this and work to disable bugs on a regular basis:

This isn’t as cut and dry as disable every bug, but we do disable as quickly as possible and push hard on the bugs that are not as trivial to disable.

There are many new people working on Intermittent Triage and having a clear understanding of what they are doing will help you know how a random bug ended up with a ni? to you!

It has been 6 months since the last Stockwell update. With new priorities for many months and reducing our efforts on Stockwell, it was overlooked by me to send updates. While we have been spending a reasonable amount of time hacking on Stockwell, it has been a less transparent.

I want to cover where we were a year ago, and where we are today.

1 year ago today I posted on my blog about defining intermittent. We were just starting to focus on learning about failures. We collected data, read bugs, interviewed many influential people across Mozilla and came up with a plan which we presented Stockwell at the Hawaii all hands. Our plan was to do a few things:

Triage all failures >=30 instances/week

Build tools to make triage easier and collect more data

Adjust policy for triaging, disabling, and managing intermittents

Make our tests better with linting and test-verification

Invest time into auto-classification

Define test ownership and triage models that are scalable

While we haven’t focused 100% on intermittent failures in the last 52 weeks, we did about half the time, and have achieved a few things:

While that is a lot of changes, it is incremental yet effective. We started with an Orange Factor of 24+, and often we see <12 (although last week it is closer to 14). While doing that we have added many tests, almost doubling our test load and the Orange Factor has remained low. We still don’t think that is success, we often have 50+ bugs in a state of “needswork”, and it would be more ideal to have <20 in progress at any one time. We are still ignoring half the problem, all the other failures that do not cross our threshold of 30 failures/week.

Some statistics about bugs over the last 9 months (Since January 1st):

Category

# Bugs

Fixed

511

Disabled

262

Infra

62

Needswork

49

Unknown

209

Total

1093

As you can see that is a lot of disabled tests. Note, we usually only disable a test on a subset of the configurations, not 100% across the board. Another NOTE: unknown bugs are ones that were failing frequently and for some undocumented reason have reduced in frequency.

One other interesting piece of data is many of the fixed bugs we have tried to associate with a root cause, we have done this for 265 bugs and 90 of them are actual product fixes 🙂 The rest are harness, tooling, infra, or more commonly test case fixes.

I will be doing some followup posts on details of the changes we have made over the year including:

Triage process for component owners and others who want to participate

Test verification and the future

Workflow of an intermittent, from first failure to resolution

Future of Orange Factor and Autoclassification

Vision for the future in 6 months

Please note that the 511 bugs that were fixed were done by the many great developers we have at Mozilla. These were often randomized requests in a very busy schedule, so if you are reading this and you fixed an intermittent, thank you!

Week of Jan 02 -> 09, 2017

Turning on leak checking (bug 1325148) – note, we did this Dec 29th and whitelisted a lot, still much exists and many great fixes have taken place

some infrastructure issues, other timeouts, and general failures

I am excited for the coming weeks as we reduce the orange factor back down <7 and get the high frequency bugs <20.

Outside of these tracking stats there are a few active projects we are working on:

adding BUG_COMPONENTS to all files in m-c (bug 1328351) – this will allow us to then match up triage contacts for each components so test case ownership has a patch to a live person

retrigger an existing job with additional debugging arguments (bug 1322433) – easier to get debug information, possibly extend to special runs like ‘rr-chaos’

add |mach test-info| support (bug 1324470) – allows us to get historical timing/run/pass data for a given test file

add a test-lint job to linux64/mochitest (bug 1323044) – ensure a test runs reliably by itself and in –repeat mode

While these seem small, we are currently actively triaging all bugs that are high frequency (>=30 times/week). In January triage means letting people know this is high frequency and trying to add more data to the bugs.