Friday, March 02, 2012

The Git Workflow

When the PostgreSQL project decided to migrate to git, we decided not to allow merge commits. A number of people made comments, in a number of different fora, to the effect that we weren't following "the git workflow". While a few commentators seemed to think that was reasonable, many thought that it demonstrated our ignorance and stupidity, and some felt it outright heresy.

So I noted with some interest Julio Hamano's blog post about the forthcoming release of git 1.7.10, which is slated to include a change to the way that merge commits work: users will now be prompted to edit the commit message, rather than just accepting a default one. Actually, what I found most interesting where Linus Torvalds' comments on this change, particularly where he says this: "This change hopefully makes people write merge messages to explain their merges, and maybe even decide not to merge at all when it's not necessary." His comments are quoted more fully in the above-linked blog article; unfortunately I don't know how to link directly to his Google+ post. And Julio Hamano makes this remark: "Merging updated upstream into your work-in-progress topic without having a good reason is generally considered a bad practice. [...] Otherwise, your topic branch will stop being about any particular topic but just a garbage heap that absorbs commits from many sources, both from you to work on a specific goal, and also from the upstream that contains work by others made for random other unfocused purposes."In other words, if you merge more often than really necessary, the commit log will become a cluttered mess. So don't merge into your topic branches.

But, of course, there's a pretty obvious reason why people might want to merge into their topic branches regularly: if you develop your patch based on an older version of the source code, you might not find out until much later that you have merge conflicts. You might write a bunch of code that seems OK, and then only discover later that it can't be merged cleanly. Or, worse, the merge might succeed even though the code is wrong. We've had a number of cases, during PostgreSQL development, where a developer uses one piece of code as a model for a new one; the original code gets fixed before the new code is committed, but the new code still contains the bug. Developing against an older version of the source code increases the probability of errors of this type.

So, there's a trade-off here. In the comments from Linus and Julio, there's a tacit admission that too many merges are not a good thing, and yet there are obvious advantages to frequent merging. I understand and appreciate the critiques that have been made about the PostgreSQL process: rebasing and squashing commits loses history that someone might want to see; on the other hand, it also squeezes our irrelevant history that no one cares about, and makes the development history easier to follow. Part of the reason we can get away with is that Linux has had 50,071 commits in the last year, whereas we have 1,849. Had we twenty-five times as much activity as we do, it's likely that our workflow would have to change in a variety of ways, and this might well be one of them. Nevertheless, I can't help thinking that it's a mistake to believe that there is one perfect solution here. As good as git is, maybe there's a better technological solution here; or maybe this is just a hard problem.

> Developing against an older version of the source code increases the probability of errors of this type.

This should never happen. If you call `master` the branch containing the latest development, then you want to rebase the upstream master into your feature branch very regularly. I personally have a git alias for this called `git up` and it is simply `git fetch && git rebase origin/master`. This solves the divergence that occurs on a feature branch with regards to the main development of the project. I invoke this very often, pretty much after every commit I make, so that I'm not in a situation where large conflicts occur as part of a rebase, only small ones may occur, and those are very easy to fix incrementally because you were literally *just* working on the conflicting code.

During development of the feature branch it's useful to commit very often, but I agree it creates a very noisy commit history, although admittedly in most projects I prefer the raw history and hardly ever squash.

The postgres workflow is in my opinion not great when it comes to submitting and reviewing patches. Every patch loses all history of its development. Git branches would make that far easier and more convenient since it's trivial to view diffs, merge it, squash it, commit on top of it, or whatever you may desire before it's ready for mainline. Further, when patches go through several versions as part of the patch review process, each new version contains a monolithic patch where it is very hard to tell what has changed since the last review (patchV1, patchV2, ..., patchVn).

On projects like PostgreSQL, and others much smaller, I really value the clean history that comes from not using merge commits. With that workflow, the origin/master history is a straightforward record of the main community codebase as it was committed, not the sometimes circuitous history the developer took to get to that point.

But like you I don't understand why I wouldn't frequently merge or rebase my topic branch to make sure I'm not developing against a defunct codebase. No matter how messy my topic branch becomes, I can contribute a clean patch by doing a squashed merge and cherry-pick that onto master if I really want no record of "merging" at all.

In other words, it seems like there's not a single Git workflow, so much as several, and I like PostgreSQL Git workflow fine.

Well, nobody can force anyone to do a merge commit. I use commits on topic branches as a way of checkpointing code during development. Those intermediate commits should be of zero interest to anyone else, and are almost always of no interest to me once I'm done with the topic branch. I certainly don't see any reason to clog the PostgreSQL commit history with them, and if anyone wants to call that heresy or make any of the other silly remarks that have been referred to I don't care. Yes, being able to edit the message might be of value in some cases, but not enough to make me abandon my work practice.

I've switched entirely to rebasing feature branches for the same reason outlined - I much prefer checking in clean commits tested against the current HEAD rather than hoping everything went right with a long-running merge.

You can enable git pull --rebase by default per branch or by setting the global branch.autosetuprebase option.