This is the first post is a series where I will post some ideas. These are ideas, not active projects (although these ideas could be implemented with many active projects).

My first idea is surrounding the concept of AutoLand. Mozilla has talked about this for a long time. In fact a conversation I had last week got me thinking more of the value of AutoLand vs blocking on various aspects of it. There are a handful of issues blocking us from a system where we push to try and if it goes well we magically land it on the tip of a tree. My vested interest comes in the part of “if it goes well”.

The argument here has been that we have so many intermittent oranges and until we fix those we cannot determine if a job is green. A joke for many years has been that it would be easier to win the lottery than to get an all green run on tbpl. I have seen a lot of cases where people push to Try and land on Inbound to only be backed out by a test failure- a test failure that was seen on Try (for the record I personally have done this once). I am sure someone could write a book on human behavior, tooling, and analysis of why failures land on integration branches when we have try server.

My current thought is this-

* push to try server with a final patch, run a full set of tests and builds

* when all the jobs are done [1], we analyze the results of the jobs and look for 2 patterns

* pattern 1: for a given build, at most 1 job fails

* pattern 2: for a given job [2], at most 1 platform fails

* if pattern 1 + 2 pass, we put this patch in the queue for landing by the robots

[1] – we can determine the minimal amount of jobs or verify with more analysis (i.e. 1 mochitest can fail, 1 reftest can fail, 1 other can fail)

[2] – some jobs are run in different chunks. on opt ‘dt’ runs all browser-chrome/devtools jobs, but this is ‘dt1’, ‘dt2’, ‘dt3’ on debug builds

This simple approach would give us the confidence that we need to reliably land patches on integration branches and achieve the same if not better results than humans.

For the bonus we could optimize our machine usage by not building/running all jobs on the integration commit because we have a complete set done on try server.

In my opinion we could let the developers star their own Try push. If they star all their failures then the job could be auto landed, if it wasn’t simply a try job. This would also cause developers to look more carefully at their own failures and they would be less likely to submit a job to integration when it had a try failure.

BenWa, are you suggesting that the only method for landing on a tree is to have a complete green (or annotated) push on try? That would be an interesting concept and might make for a more streamlined model. Couple that with my idea for detecting “safe” patches and we could auto land 75% of the patches and leave the remaining to require some manual intervention or a secondary push.

Being super-strict about zero oranges in the face of known uncertainty seems like a form of cognitive bias. We know, as a fact, that on any given run there are highly likely to be some unexpected test results that don’t represent real regressions. We also know, as a fact, that our test coverage is imperfect and that we can regress many things without ever noticing. However because the former are visible and the latter are invisible, we worry much more about the risk that a scheme to automatically filter out suspected noise will temporarily hide some real regressions than we do about the real regressions that we can’t detect in an automated way. So, for example, the inability to consistently get a 100% green run is used to block progress on infrastructure that will help productivity, but the absence of 100% code coverage (or whatever metric you want to use) in our testsuites doesn’t block us from adding new features rather than adding new tests.

I’m pretty sure there’s a lot of ways that we can make reasonably reliable determinations of whether a given orange is likely to be noise or real. For example we can, as Joel suggests, correlate results across platforms. We can also match the failure to known bugs, or try rerunning the test. With better data retention we can look at the history of the test, and match against failure patterns when it found a real regression in the past. Probably with more thought there are more and better things that we could do. But we have to first accept the idea that there is a lot of uncertainty in our regression detection already, and we shouldn’t let the visibility of the particular piece of uncertainty in question affect our judgement of how to handle it.

Thanks James! A great view on this. I like the idea of the retrigger a failure and see if it is green. The danger here is we ignore intermittent issues- we would need to track the number of retriggers to make sure we are not getting worse (i.e. we start with 5 retriggers/push and a year later we have 20 retriggers/push).

What BenWa hints at is a way to force a developer to star all the oranges and then once we are all stared we can then land on an integration branch. That has a lot of potential, and a concern is that developers will just pick whatever is the magic autostar and not actually look at the errors.

Any way to reduce the human involvement is a win. Any way to fix our process is also a win.

It’s a good idea to land based on an optimistic estimate of whether a push is good. But your proposed criteria aren’t going to work. A real problem will often manifest itself as a failure on a single type of job. The special-purpose jobs are an obvious example (eg if you introduce a rooting hazard, only the SM(Hf) build will show it) but even if you exclude those, it’s not uncommon for a problem to be specific to a single platform (and opt- or debug-only). Heck, if it really *is* rare, then we’re wasting resources with redundant tests. 😉

I definitely like the general notion of not blocking autoland on a perfect determination of whether failures are intermittent or not. You could start out by saying fully green builds only will land, which would basically never happen, and then add in the “green except for starred rule”, and/or the “everything green in at least one of the initial + retriggered runs”. Most things would still land as they do now, but gradually more and more would start using autoland.

Note that using starring requires implementing machine-understandable semantics to the stars — you don’t want to autoland something just because you starred it “will be fixed by backing out XXXX”. Stars should be categorized into “intermittent”, “fixed by this later push”, “infra”, etc. Or for now, institute a “no starring on try unless you’re declaring it to be intermittent or otherwise not the fault of the push”, since we rarely star try pushes now anyway.

The other nasty bit is dealing with merges. By the time automation figures out that a try push is good, it’ll be based on an hours-old head. Solvable, but it requires coding and UI work.

Thanks for your thoughts on this Steve, I am looking forward to seeing what we can do with autoland. In fact I plan to experiment with this and look at a history of jobs and failures on the integration branches and compare that with what we star the oranges as. Maybe for the solo style jobs like SM(Hf) we don’t apply patterns. As for the full matrix of tests and the patterns I mentioned, I need to see evidence that we have issues which commonly fall outside of those patterns. It might need some adjustments after analyzing a few months worth of data.

I’m also very hesitant to ignore intermittent oranges. I think simply having an autoland mechanism will quickly change our culture to be less accepting of intermittent oranges, and the intermittent orange rate will go down substantially. Intermittent failures are often real bugs that we shouldn’t have, and we’re better off backing things out for causing new intermittent failures just like we back things out for causing new reliable failures.