Thanks for linking to my article, Henry. The well discussed case for not-revert, is a file that was not added to a changelist before the commit/push. Other than that, it down how many committers for the trunk in question. 5 developers? Sure take your time considering a roll forward. 70 developers? probably roll back automatically if you can identify which the commit was that caused the break. 20,000 developers (Google), prevent any breaking commit from reaching the trunk (baring rare accidents of timing, then do the auto-revert thing).

The decision as to where you are on a spectrum rests on how speedy your CI daemon is. Does it do per-commit builds quickly? Can it truly pick-out which commit (of a list that its building in parallel) broke the build? Yes and Yes (and team laments stop-the-line time), then consider auto-revert.

Your #7 "they loose the opportunity to debug and understand why that failure has occurred" - no they don't - they get to revert the revert on their workstation and debug there. Hell they don't even have to wait for that. They do have an isolated place to work (no shared infra) - right?

I hadn't thought of the problem as it scales with the number of contributors. For larger teams the cost of blocking the line with a red build surely isn't worth it. I suppose this is where http://martinfowler.com/bliki/PendingHead.html starts to provide some value.

Your second point about per-commit builds is I guess where our team had most of our issues and this also ties in with the points Andrew mentioned below. It was common for a build to bundle multiple commits together. Isolating the revert and retesting the revert with the other commits, sometimes comes at a higher perceived cost in time/effort then rolling forward.

With regard to #7, isolated yes, identical to a deployed environment, close but not equal. Dev env MacOSX - app in dev mode, Deployed env LinuxDistro - app in prod mode. A failure would sometimes be due to an unstable test failing due to a config or setup that only exists in a deployed environment - replication locally was difficult but not impossible. We also had people reverting commits on a red build even though the build failed due to unstableness in a test as they were rushing to get it green again.

I think this is an important discussion and I'd like to play devil's advocate here. I'll give you feedback based on my experience with build pipelines and hopefully I can convince you that the issues raised aren't due to reverting red builds.

It is a simple fix to address the problem, time to do the fix and push is less than the time to revert and push.

Unwarranted hubris from someone/a pair who just created a failing build. You're describing a situation that has two outcomes:

The "simple fix" solves the problem: You save an (arguably) small amount of time and "fixed" the build without reverting. You didn't create a regression test so you don't know if you found out the root cause and don't know if it will happen again, either in code or in team practice.

The "simple fix" does not solve the problem: Stress increases. You must either revert (net time loss compared to reverting at the outset) or try another "simple fix" and are now a developer/pair under stress and have aggravated the team and further delayed the pipeline. Either the "simple fix" works (go to 1) or it does not (repeat 2).

Database migrations have been run in a higher level testing environment and it may require manual effort to undo the migrations.

Can you provide an example? I might be misunderstanding, but I don't see how a build can fail and anything related to that commit happens in a higher level environment (especially some destructive data migrations as described).

Multiple commits have bunched up together from various pairs and apps and have failed together in automation, reverting your change may introduce another issue, reverting everyone's changes to the last green could have more adverse impacts and is a bigger/riskier commit.

Trunk-based development. I have no idea how this could occur if you're doing trunk-based development with a build on every commit.

The issue is not related to your commit, it could be environment related and reverting will not resolve it.

The scientific method will validate your claim if you revert. Until you do, you cannot be sure it's not a combination of env/code or just code. Your claim that it's an environment issue is a hypothesis.

There were multiple commits in different repositories and reverting will mean coordinating a revert across all of them.

There should never be multiple commits per run of the pipeline, especially over multiple repositories. When practicing trunk-based development properly on a pipeline, you will only ever have one commit on one repository to deal with. This is so that you can isolate the root cause efficiently if a build fails.

It may be an issue only reproducible in a deployed environment, it might not be possible to reproduce the failure locally that you are seeing in the deployed environment.

Then your environments are different and you should resolve the discrepancy (and thank your build pipeline for pointing that out before you deployed to production :).

When a pair immediately reverts their commit when they see the build has broken, they loose the opportunity to debug and understand why that failure has occurred. I have been on teams where a pair spent half a day reverting commits, attempting to stabilize the build and get it back to green not realizing that it wasn't their commits that broke the build in the first place, by having the knee jerk reaction to "Revert on Red" they didn't take the time to debug and understand what actually happened.

Taking half a day to revert commits is a sign of not pushing early and often. Push often. The underlying condition was that the build pipeline was not providing the pair with fast feedback and as a result they did not know what might have caused the failure since their last build. Reverting for half a day without knowing their code isn't at fault is a direct sign that the pipeline is not doing what it is designed to do; efficiently isolate failure.

Great points about the "simple fix" quote. Typically it is 1. and the regression test is already in there as the build has failed - hence the easy fix. These breaks are usually due to "poor hygiene", i.e. not running tests locally before push. I do agree with you on 2. though, I have seen that same scenario unfold.

A bit more context for your feedback on the latter points. These were notes from a discussion that our team had on this topic and some of the arguments are specific to our particular CI/CD pipeline.

We had multiple apps, each with their own repository, we were doing TBD for each app. An app's pipeline ran in isolation up until it produced and published an RPM. We had limited VMs to deploy and run functional tests against for the integrated apps. So it was typical for a change in one app to be published to an RPM and then wait for the functional test VM's to free up before they were deployed, by the time the functional VM's were free there would typically be at least 1 other app bundled up with that pipeline run.

The above while not an ideal setup is typical in most organisations I've worked in as we don't have the budget to provision enough infrastructure to support running each commit in isolation through the entire pipeline.

I might make an amendment to the article later on to include this background context and to also capture some of your feedback.