Commit loss prevention

Now that the commits have been recovered and things are almost back to normal, I think it's time to think about how to prevent this kind of incidents in the future.

Our open commit access policy was partly made possible by the idea that any bad commits can be always rolled back. But where I failed to think through was that the changes to refs aren't by themselves version controlled, and so it is possible to lose commits by incorrect ref manipulation, such as "git push -f", or by deleting a branch.

I still feel strongly that we maintain the open commit access policy. This is how we've been operating for the longest time, and it's also because otherwise adding/removing developers to repositories would be prohibitively tedious.

So my proposal is to write a little program that uses GitHub events API to keep track of push activities in our repositories. For every update to a ref in the repository, we can record the timestamp, SHA1 before and after, the user ID. We can maintain a text file for every ref in every repository, and the program can append lines to it. In other words, effectively recreate server-side reflog outside GitHub.

The program should also fetch commits, so that it has a local copy for every commit that ever landed on our repositories. Doing this also allows the program to detect non fast-forward. It should warn us in that situation, plus it will create a ref on the commit locally to prevent it from getting lost.

We can then make these repositories accessible via rsync to encourage people to mirror them for backup, or we can make them publicly accessible by hosting them on GitHub as well, although the latter could be confusing.

WIth a scheme like this, pushes can be safely recorded within a minute or so (and this number can go down even further if we use webhooks.) If a data loss occurs before the program gets to record newly pushed commits, we should still be able to record who pushed afterward to identify who has the commits that were lost. With such a small time window between the push and the record, the number of such lost commits should be low enough such that we can recover them manually.

The only concern is the throttling on the GitHub API: it would be better then to do the scripting on a local mirror of the GitHub repos. When you receive a forced update you do have anyway all the previous commits and the full reflog.

However as you said by being triggered via web hook the number of API calls can be reduced to the minimum.

I would submit a proposal to the Git mailing list of a "fetch by SHA1" which is a missing feature in Git IMHO.

Thanks to everyone including GitHub for the help and cooperation in getting this sorted out !!

On 12/11/13 07:25, Kohsuke Kawaguchi wrote:
> I still feel strongly that we maintain the open commit access policy.
> This is how we've been operating for the longest time, and it's also
> because otherwise adding/removing developers to repositories would be
> prohibitively tedious.

I agree that the policy of allowing everyone to have a repo and to
commit relatively freely remains a good idea, but having the option to
give new developers push access to 1100 repositories due to how GitHub
teams and our IRC bot work is an issue that has been been raised before:

Would it be reasonable to suggest that we remove the option to add
people to the "Everyone" team from IRC and, if GitHub still adds
newly-forked repos to every team by default, that we have some sort of
process to automatically clean up the teams, as mentioned in that thread?

I think part of the issue is that our canonical repositories are on github...

I would favour jenkins-ci.org being masters of its own destiny... hence I would recommend hosting canonical repos on project owned hardware and using GIT as a mirror of those canonical repositories... much like the way ASF uses GIT. That would allow us to implement policies such as preventing forced push to specific branches, etc...

Maven will then do the "right thing" for pushing releases *even if you checkout from github*... and we just have the canonical repos force push to github and put proper permission sets on the canonical repos... most developers will thus see no effective difference :-)

In CollabNet we already implemented so called History Protection. We already put some thoughts on this topic and come up with solution for unintended force pushes and branch deletion. Maybe you can reuse some of our approaches. Here is short description of this feature.

History Protection it is an extension to Gerrit Code Review that will create special ref when ever somebody is deleting a branch or force pushing.

When force push occur our plugin will create a special ref in refs/rewrites. This ref will point to old version of branch and it's name will contain additional informations like rewritten branch name, sha-1 of new head commit, timestamp and name of user that actually did this push.

Same goes for branch deletion, but in that case new ref will be created in refs/deleted.

Our plugin is also blocking write access to refs/rewrite and refs/deleted (therefore no body can modify them), but anybody can read those refs to be able recreate deleted/overwritten history.

Other then that it will send email to Gerrit Administrators group and put entry in audit log.

You can find more about this in my blog posts[1][2] and youtoube video[3]

When you say 'canonical' in this proposal, do you mean the repositories used for making releases, or the repositories where development (and especially, pull requests) would be handled?

If it's the former, I could see that being worthwhile, especially if *nobody* has permissions to push to the canonical repositories; if a developer pushes code to the master branch of their repo on GitHub, they'd have to wait a short time for that update to be mirrored to the release repo before they could make a release. Of course, this would put extra pressure on the people who are maintaining the project infrastructure, to be sure that this mirroring process is working reliably all the time.

I think part of the issue is that our canonical repositories are on github...

I would favour jenkins-ci.org being masters of its own destiny... hence I would recommend hosting canonical repos on project owned hardware and using GIT as a mirror of those canonical repositories... much like the way ASF uses GIT. That would allow us to implement policies such as preventing forced push to specific branches, etc...

Maven will then do the "right thing" for pushing releases *even if you checkout from github*... and we just have the canonical repos force push to github and put proper permission sets on the canonical repos... most developers will thus see no effective difference :-)

Now that the commits have been recovered and things are almost back to normal, I think it's time to think about how to prevent this kind of incidents in the future.

Our open commit access policy was partly made possible by the idea that any bad commits can be always rolled back. But where I failed to think through was that the changes to refs aren't by themselves version controlled, and so it is possible to lose commits by incorrect ref manipulation, such as "git push -f", or by deleting a branch.

I still feel strongly that we maintain the open commit access policy. This is how we've been operating for the longest time, and it's also because otherwise adding/removing developers to repositories would be prohibitively tedious.

So my proposal is to write a little program that uses GitHub events API to keep track of push activities in our repositories. For every update to a ref in the repository, we can record the timestamp, SHA1 before and after, the user ID. We can maintain a text file for every ref in every repository, and the program can append lines to it. In other words, effectively recreate server-side reflog outside GitHub.

The program should also fetch commits, so that it has a local copy for every commit that ever landed on our repositories. Doing this also allows the program to detect non fast-forward. It should warn us in that situation, plus it will create a ref on the commit locally to prevent it from getting lost.

We can then make these repositories accessible via rsync to encourage people to mirror them for backup, or we can make them publicly accessible by hosting them on GitHub as well, although the latter could be confusing.

WIth a scheme like this, pushes can be safely recorded within a minute or so (and this number can go down even further if we use webhooks.) If a data loss occurs before the program gets to record newly pushed commits, we should still be able to record who pushed afterward to identify who has the commits that were lost. With such a small time window between the push and the record, the number of such lost commits should be low enough such that we can recover them manually.

When you say 'canonical' in this proposal, do you mean the repositories used for making releases

I mean that they are the "official" repositories and all others are just mirrors... this is the way GIT at ASF works...

, or the repositories where development (and especially, pull requests) would be handled?

If it's the former, I could see that being worthwhile, especially if *nobody* has permissions to push to the canonical repositories;

No, I'd let developers be able to push to the canonical repositories... but just not `git push --force`. There are a set of git permissions that basically ensure you cannot rewrite the past, and those would be applied to the canonical repositories. I would then perhaps prevent developers from pushing to github... but there are possibly ways to permit that.

Pull requests, forking etc would still work at github though, so no major change there... this would just introduce a set of "one true repositories"

Well, that would mean that merging a pull request on GitHub (especially the quick way, using the web UI) wouldn't update the canonical repository; the repo maintainer would need to push that change to the canonical repository, potentially dealing with a second round of merge conflicts if that repo's master branch has moved on. Sounds a bit complex :-)

There's been some discussion about using Gerrit as a front-end for all the repository activity, and I'd definitely support that move. The GitHub repos would then just be a distribution/forking point, but the workflow would be through Gerrit.

On 11/11/2013 11:05 PM, Luca Milanesio wrote:
> Seems a very good idea, it is basically a remote audit trail.
>
> The only concern is the throttling on the GitHub API: it would be better
> then to do the scripting on a local mirror of the GitHub repos. When you
> receive a forced update you do have anyway all the previous commits and
> the full reflog.

With respect to throttling, the events API is designed for polling [1],
so we just need to poll the events for the entire jenkinsci org [2] and
we'll have the whole history.

We already do an equivalent of local mirrors of the GitHub repos in
http://git.jenkins-ci.org/. The problem is that reflogs do not record
remote ref updates, so it will not protect against accidental ref
manipulations.

It does help however for the purpose of retaining commit objects, so we
need to keep this.

> However as you said by being triggered via web hook the number of API
> calls can be reduced to the minimum.
>
> I would submit a proposal to the Git mailing list of a "fetch by SHA1"
> which is a missing feature in Git IMHO.

My recollection is that this was intentional for the security reason, so
that if a push is made accidentally and if it's removed, then those
objects shouldn't be accessible.

I think what's useful and safe is to allow us to create a ref remotely
on an object that doesn't exist locally. Again, the transport level
protocol allows this, so it'd be nice to expose this.

> Thanks to everyone including GitHub for the help and cooperation in
> getting this sorted out !!

Yes, it would be nice to be able to allow the people to auto-remove himself from push permissions to the repos he does not use.
For instance I normally push to no more than 5-6 repos, I should then be able to auto-restrict myself to those ones only.

We need to make some tests on the scalability of the events API because of:
1) need to monitor over 1000 repos (one call per repo ? one call for all ?)
2) by monitoring the entire jenkinsci org, 300 events could be not enough in case of catastrophic events

Working at webhook level ? I'll investigate further about the reliability / scalability of the API (on a series of *test* repo *OUTSIDE* the Jenkins CI organisation)

Hmm, I don't fully understand the Maven implication of such a setup, but
there's a whole lot more to switching canonical repositories from one
location to another than mass-updating pom.xml, such as communicating,
infra managing, pull requests, access control and backup, that I'm
pretty certain it's not as easy as you make it sound...

And I'm not yet sensing the appetite in the community for moving away
from GitHub.

On 11/12/2013 02:16 AM, Stephen Connolly wrote:
> I think part of the issue is that our canonical repositories are on
> github...
>

> its own destiny... hence I would recommend hosting canonical repos on
> project owned hardware and using GIT as a mirror of those canonical
> repositories... much like the way ASF uses GIT. That would allow us to
> implement policies such as preventing forced push to specific branches,
> etc...
>
> Of course that would be another pom.xml <scm> update change, namely the
> <developerConnection> would point to the canonical repo while the
> <connection> would point to the github repo... (with some use of
> http://developer.github.com/v3/users/keys/#list-public-keys-for-a-user> we should be able to let users just register their keys in github)
>
> e.g. the <scm> details would look like:
>
> <scm>
> <connection>scm:git:git://github.com/jenkinsci/[plugin

> <developerConnection>scm:git:git.jenkins-ci.org:jenkinsci/[plugin
> name]-plugin.git</developerConnection>
> <url>http://github.com/jenkinsci/[plugin name]-plugin</url>
> </scm>
>
> Maven will then do the "right thing" for pushing releases *even if you
> checkout from github*... and we just have the canonical repos force push
> to github and put proper permission sets on the canonical repos... most
> developers will thus see no effective difference :-)
>
>
> On 12 November 2013 06:25, Kohsuke Kawaguchi <k...@kohsuke.org

I think this was an exception and we should treat it as a such…
Sure this could happen again but by doing some backups we should be fine. Maybe we would better ask GH why they provide the feature to block forced pushes just in there enterprise solution.
/Domi

- Our IRC bot would put users into the "pre-approved" team, which by
itself doesn't grant access to any repositories, but is used to keep
track of who can add/remove themselves to other repositories.

- We'll improve http://jenkins-ci.org/account to allow people in the
"pre-approved" team to add/remove themselves to "Everyone" team
(which grants access to all the repos) and all the individual plugin
repos independently.

So if you are like me who wants to maintain access to all the repos
I can, but if you only want to work on a small number of repositories
you can do it that way, too.

This has a benefit of not getting bombarded by notification e-mails
for repositories you don't care.

I think this is actually tangential to the commit loss prevention, as I
can make the same mistake Luca did and mass update all the remote refs,
so we still need a measure to protect us from that.
--
Kohsuke Kawaguchi http://kohsuke.org/

On 11/13/2013 11:58 PM, Luca Milanesio wrote:
> We need to make some tests on the scalability of the events API because of:
> 1) need to monitor over 1000 repos (one call per repo ? one call for all ?)
> 2) by monitoring the entire jenkinsci org, 300 events could be not enough in case of catastrophic events

The good news is that the push that removes/alters refs also take time.
I have the notification e-mail from your push to 186 repos, and it spans
over an hour.

So I'm hoping that polling 300 events every minute would cover us pretty
well. And like you say, a webhook can help us reduce this window down
even further.

There's another reason I'm optimistic about this scheme.

Suppose you are maliciously trying to cause data loss. If we are
regularly recording refs, you have to mount an attack immediately after
some commits go in so as to overwhelm the 300 event buffer, then keep
that saturation going so that your ref updates/removals will also be
dropped from the event buffer. And even with this much effort you can
only cause the data loss of the commits that went in right before yours.

So I think it makes the attack so ineffective that we can tolerate that
risk, and I find it unlikely that no accidents will look like this.

> On 11/13/2013 11:58 PM, Luca Milanesio wrote:
>> We need to make some tests on the scalability of the events API because of:
>> 1) need to monitor over 1000 repos (one call per repo ? one call for all ?)
>> 2) by monitoring the entire jenkinsci org, 300 events could be not enough in case of catastrophic events
>
> The good news is that the push that removes/alters refs also take time. I have the notification e-mail from your push to 186 repos, and it spans over an hour.

True: however possibly the notifications took an hour but the push was pretty fast but still around 25 / min. 300 events per minute should be then fairly enough :-)
The only way to go over that limit is parallel push by multiple accounts ... but that I would say is very unlikely.

>
> So I'm hoping that polling 300 events every minute would cover us pretty well. And like you say, a webhook can help us reduce this window down even further.

Yep.

>
>
> There's another reason I'm optimistic about this scheme.
>
> Suppose you are maliciously trying to cause data loss. If we are regularly recording refs, you have to mount an attack immediately after some commits go in so as to overwhelm the 300 event buffer, then keep that saturation going so that your ref updates/removals will also be dropped from the event buffer. And even with this much effort you can only cause the data loss of the commits that went in right before yours.
>
> So I think it makes the attack so ineffective that we can tolerate that risk, and I find it unlikely that no accidents will look like this.