Automatically Updating Git Repositories

By Doug Kelly on December 10, 2013 5:00 PM

Part of what I do for real life is managing several version control servers, including Gerrit and Gitolite. Now, this in itself has its own fun problems, but recently, in the midst of my routine auditing, I saw a large number of people using the server on a holiday weekend. This had me thinking, who works on a holiday? Even more interesting, the requests appeared to be at regular intervals, which led me to think it was some automated process. I asked both users what was going on, and they explained they were using Atlassian SourceTree (a rather beautiful Git frontend, I might add). Apparently, it defaults to updating its repositories every 10 minutes.

Now, in a centralized VCS, you might ping the status of your files and see if you're outdated (and need to update before you commit), but with distributed version control, you're actually pulling down the latest and greatest code all the time (all without changing your working tree). In a busy repository, this can show when you're working on an outdated file - maybe pretty useful. But, at the same time, I think there may be a few negative sides that need to be considered.

Code Churn

Yep. If your repository is really so active you need to know where it is at all times, you're going to spend a lot of your time needlessly rebasing your change to keep it at the tip of development. Now, that's not to say you shouldn't do this sometime, but particularly when I've been writing a feature in an active repository, I write it on an isolated version that I know is working well for what I need, test everything, then when my feature is complete, I can update and test again. If it's broken now or I have a merge conflict, I have two points I can easily bisect and fix. By the same token, this is the reason why Git makes it so simple to branch and merge: you don't need to me working alongside everyone else - merging can be a much more coordinated effort, but when developing a feature, it can be done in isolation (for the most part). If you join the camp of Git Rebasers, they would say this is messy and you should always rebase your change (for a more detailed discussion, Git team workflows: merge or rebase? is a good explanation on the two sides). Personally, if I'm doing a simple change that doesn't require an independent feature branch, I may write my change, and before submitting to review on Gerrit, I'll rebase once just to make sure I'm in sync with the world and there are no conflicts before other developers spend time reviewing it. There, now I don't need to worry about a bunch of churn from my side, and I didn't ping the server continuously.

Destructive History

This is something I have to worry about less: we intentionally block all operations via Gerrit and Gitolite permissions that would be destructive to history. But, there have been a number of documented cases where users have inadvertently reverted months of history through a simple force push. For a few examples, Git Patterns and Anti-Patterns documents the specific case where Eclipse accidentally wiped out some history. Ironically, Luca Milanesio (the author of the Git Patterns and Anti-Patterns refsheet) accidentally force-pushed 186 Jenkins plugin repositories on GitHub. The force-push is the enemy of automatic updates, since your remote would be rewound to the point that the force-push took it, and you'd have no way of knowing the rewind happened (aside from looking at the reflog, but see the note under "Pattern #9" on the patterns/anti-patterns). In either of these cases, if you happened to fetch manually, at least you'd see that a forced-rewind happened, and you'd have a chance to investigate. In a scenario where data has been lost, sometimes a developer's repository may be useful in disaster recovery - after all, a Git repository is a Git repository. That said, you should still have other disaster recovery methods in place, but it's always good to have other options, too.

Conclusions

At the end of the day, is the automatic update feature useful? I don't think so. While there are some limited scenarios it could make things better, I also see a whole lot of cases where it can do more harm than good, even though the incremental fetches are rather inexpensive from a server perspective. Interestingly, I've pulled myself out of several pinches by looking at a Gerrit replication mirror and restoring content from that. At the same time, deletes are never pushed to the mirror (only new content), and non-fast-forwards are rejected. As a developer, you should probably not need to anticipate some of these "worst case" scenarios, but the time lost by rebasing your change indefinitely is real, and I would advise against even that unless you need to. Gerrit has options to resolve conflicts if it can (i.e. trivial rebase), or you may not even need to rebase at all. The tools like Gerrit and Jenkins are there to make your life easier by worrying about the code compiling and merging so you don't have to - second guessing them by posting rebases all the time can only waste your time (and CI time).