2006-12-31

It seems that distributed version control has become a somewhathottopic lately. The latter two posts make the case that being able to work offline is extremely useful, both for road warriors and for users with less than ideal Internet access. Yes, this does seem like a pretty good motivation for distributed version control. Indeed, it was my laptop and my dialup connection (aside from sheer curiosity) that first got me using darcs two years ago. But now I have one of these fancy ADSL connections and a less need to travel or hack offline while doing so. Yet I continue to use and love darcs. I'm sure this is something that bzr, git, mercurial, etc users can attest to: yes, offline versioning is indeed a great feature, but there is something more.

Warning: this is a rather long post. My apologies to planet haskellers and other busy readers.

one mechanism - many benefits

The thing that attracts me to a system like darcs is its conceptual elegance. From one single mechanism, you get the following features for free:

Painless intialisation

Offline versioning

Branching and merging

Easier collaboration with outsiders

These are all the same thing in the darcs world, no fanciness at work whatsoever. I suppose it's not very convincing to sell simplicity in itself, so in the rest of this post, I'm going to explore these four benefits of a distributed model and discuss what their implications might be for you the developer.

Painless initialisation

Getting started is easier because you don't have any central repositories to set up. That might sound awfully petty. After all, setting up a central repository is nothing more than a mkdir and cvs checkout. But it's much more of an inconvenience than you might think.

Setting up a central repository means you have to think in advance about putting your repository in the right place. You can't, for instance, set something up locally, change your mind, deciding that you want a server and switch over instantaneously. You COULD tarball your old repository, move it to the server, and either fiddle with your local configuration or checkout your repository again. But why should you? Why jump through all the hoops? The steps are simple, but they add friction. How many times have you NOT set up a repository for your code because it would have been a pain (a 30 second pain, but a pain nonetheless?). How many times have you put off a repository move because it was a pain? Painless initialisation means two things (1) instant gratification (2) the ability to change your mind. I would argue that such painlessness is crucial because it brings down the barrier of inconvenience to the point where you actually do the things you are supposed to do.

Branching and merging

A well thought out distributed version control system does not need to have a notion of branching and merging? Why? Because a branch can simply exactly the same concept as a repository, as a checkout. No need to learn two sets of concepts of operations or two views of your version control universe. Just think of them as one and the same. Now, you might be worried about say, the redundancy of all this (gee! wouldn't that mean that branches take up a lot of space?)... but eh... details.

For starters, disk space is cheap, at least much cheaper than it was in the past. There might be cases where you are trying to version very large binary files, but for many programming jobs, we are only shuffling text around, so why worry? Besides, branches are supposed to be disposed of one day or another (merged), right? It's not like they're going to be that long lived, otherwise it's just a fork. Moreover, worrying about disk space is the version control system's job, not yours. You could have a VCS that tries very very hard to save space. For example, it could try to use hard links whenever possible, in much the same manner as "snapshot" backup systems. Disk space is not what you the programmer should be worrying about. It's similar to the case being made for having a second or third monitor: programmer time is more valuable than disk space.

Offline versioning

Previous posts have discussed this at length. It's still useful, even if your Internet connection is superb. It's useful because it lets you hold on to things that aren't quite ready for the central repository, but worth versioning until you're more confident about your code.

Collaboration with outsiders

Open source and free software projects thrive on contributions from outsiders. For example, in the past year, 80% of the 360 patches to the darcs repository have come from somebody other than David Roundy, the original author of darcs. I'm cheating a little bit because many of these patches are also from known insiders. Then again, all of the known insiders were outside contributors at some point. The switch from outsider to insider has been for the most part informal: you send enough patches in and people eventually get to know you. And that's it; very little in the way of formal processes.

Outsider collaboration is made easier for two reasons, offline versioning and decentralisation.

By offline versioning, I mean that people can make their modifications locally, retrieve changes from the central repository and still have their modifications intact, versioned and ready to go. Consider a popular project, like the mail client mutt. Some mutt users have patches that are useful for a few people, but not appropriate for the central repository. So they make their changes available in the form of a Unix patch. If you're lucky, the patch applies to the version of mutt that you've downloaded. If you're not so lucky, you've got some cleaning up to do and a new set of patches. I'm not talking about merging or conflict resolution, per se. Assume the conflict resolution requires human intervention. You've fixed things so that it compiles against the new version. What do you do exactly? Make a patch to the patched version? "Update" the original patch so that it works with the new version of the repository? And what about the original author, what does s/he do with your patch? These kinds of things are not so hard in themselves, but they are a major source of friction. They gum up the works of free software development, or any large project, open source or closed.

If you are a project maintainer, having a tool that handles offline versioning means that it is easier for you to accept changes from outsider contributors (zero insertion force - no need to apply patches and re-commit them).

If you a contributor, having an offline versioning tool means that it's easier for you to submit modifications to the project. You don't have manually create patches: you don't have to keep around a clean and working copy of the project, you don't have to worry about where you do your diffs (so that the patch --strip options come out right), you don't have to worry about what happens when the central repository changes and your patch no longer applies. Again, I'm not referring to conflict resolution. If there are conflicts, somebody will have to resolve them; but the resolution and versioning of these conflicts should involve as little bureaucracy as possible. For extra credit points, some version control systems even implement a "send" feature in which you submit your local modifications via email. The maintainers of the repository can then choose to apply the patch at their leisure. These aren't regular Unix diff patches, mind you, they are intelligent patches with all the version-tracking goodness built in.

Offline versioning adds convenience to the mix, a technical benefit. If you flip it around and look at it in terms of distributed controls, you can see some pretty subtle social consequences as well. Since there is no need for a central repository, there is a lot less pressure for the central maintainer to accept patches or reject them outright because you know that the outsider contributors can get along fine with their modifications safely versioned in their local repositories. Worst come to worse, the outside contributors can place their repositories online and have people work from there instead. It sounds like a fork, which sucks... but sometimes, fork happens. Look, sometimes you get forks for differences in opinion, disagreements between developers, or general unpleasantness. But sometimes you get more innocent forks, for example, the main developers suddenly got a new job and is now working 60 hours a week. S/he is still committed to the project, but to be honest s/he hasn't been looking at patches for the last month. No big deal, the rest of us will just be working from this provisional repository until the main developer gets back on his/her feet. There's a social and a technical aspect to forking. Distributed version control greatly simplifies the technical aspect, and that in turn mellows out the social one. Distributed version control means that life goes on.

simplicity and convenience

I'm really only making two points here. Simplicity matters. It reduces the learning curve for newbies and removes the need for experienced users to carry mental baggage around. Convenience matters. It reduces the friction that leads to put off the things you could be doing and it removes some of the technical barriers to wide-ranging collaboration. I could always be mistaken, of course. Perhaps there is some bigger picture, some forest to my trees; and upon discovering said forest I find myself deeply chagrined, getting all worked up over something so silly as patches. But until that time, I will continue to use darcs and love it for how much easier it makes my life.

It seems that distributed version control has become a somewhathottopic lately. The latter two posts make the case that being able to work offline is extremely useful, both for road warriors and for users with less than ideal Internet access. Yes, this does seem like a pretty good motivation for distributed version control. Indeed, it was my laptop and my dialup connection (aside from sheer curiosity) that first got me using darcs two years ago. But now I have one of these fancy ADSL connections and a less need to travel or hack offline while doing so. Yet I continue to use and love darcs. I'm sure this is something that bzr, git, mercurial, etc users can attest to: yes, offline versioning is indeed a great feature, but there is something more.

Warning: this is a rather long post. My apologies to planet haskellers and other busy readers.

one mechanism - many benefits

The thing that attracts me to a system like darcs is its conceptual elegance. From one single mechanism, you get the following features for free:

Painless intialisation

Offline versioning

Branching and merging

Easier collaboration with outsiders

These are all the same thing in the darcs world, no fanciness at work whatsoever. I suppose it's not very convincing to sell simplicity in itself, so in the rest of this post, I'm going to explore these four benefits of a distributed model and discuss what their implications might be for you the developer.

Painless initialisation

Getting started is easier because you don't have any central repositories to set up. That might sound awfully petty. After all, setting up a central repository is nothing more than a mkdir and cvs checkout. But it's much more of an inconvenience than you might think.

Setting up a central repository means you have to think in advance about putting your repository in the right place. You can't, for instance, set something up locally, change your mind, deciding that you want a server and switch over instantaneously. You COULD tarball your old repository, move it to the server, and either fiddle with your local configuration or checkout your repository again. But why should you? Why jump through all the hoops? The steps are simple, but they add friction. How many times have you NOT set up a repository for your code because it would have been a pain (a 30 second pain, but a pain nonetheless?). How many times have you put off a repository move because it was a pain? Painless initialisation means two things (1) instant gratification (2) the ability to change your mind. I would argue that such painlessness is crucial because it brings down the barrier of inconvenience to the point where you actually do the things you are supposed to do.

Branching and merging

A well thought out distributed version control system does not need to have a notion of branching and merging? Why? Because a branch can simply exactly the same concept as a repository, as a checkout. No need to learn two sets of concepts of operations or two views of your version control universe. Just think of them as one and the same. Now, you might be worried about say, the redundancy of all this (gee! wouldn't that mean that branches take up a lot of space?)... but eh... details.

For starters, disk space is cheap, at least much cheaper than it was in the past. There might be cases where you are trying to version very large binary files, but for many programming jobs, we are only shuffling text around, so why worry? Besides, branches are supposed to be disposed of one day or another (merged), right? It's not like they're going to be that long lived, otherwise it's just a fork. Moreover, worrying about disk space is the version control system's job, not yours. You could have a VCS that tries very very hard to save space. For example, it could try to use hard links whenever possible, in much the same manner as "snapshot" backup systems. Disk space is not what you the programmer should be worrying about. It's similar to the case being made for having a second or third monitor: programmer time is more valuable than disk space.

Offline versioning

Previous posts have discussed this at length. It's still useful, even if your Internet connection is superb. It's useful because it lets you hold on to things that aren't quite ready for the central repository, but worth versioning until you're more confident about your code.

Collaboration with outsiders

Open source and free software projects thrive on contributions from outsiders. For example, in the past year, 80% of the 360 patches to the darcs repository have come from somebody other than David Roundy, the original author of darcs. I'm cheating a little bit because many of these patches are also from known insiders. Then again, all of the known insiders were outside contributors at some point. The switch from outsider to insider has been for the most part informal: you send enough patches in and people eventually get to know you. And that's it; very little in the way of formal processes.

Outsider collaboration is made easier for two reasons, offline versioning and decentralisation.

By offline versioning, I mean that people can make their modifications locally, retrieve changes from the central repository and still have their modifications intact, versioned and ready to go. Consider a popular project, like the mail client mutt. Some mutt users have patches that are useful for a few people, but not appropriate for the central repository. So they make their changes available in the form of a Unix patch. If you're lucky, the patch applies to the version of mutt that you've downloaded. If you're not so lucky, you've got some cleaning up to do and a new set of patches. I'm not talking about merging or conflict resolution, per se. Assume the conflict resolution requires human intervention. You've fixed things so that it compiles against the new version. What do you do exactly? Make a patch to the patched version? "Update" the original patch so that it works with the new version of the repository? And what about the original author, what does s/he do with your patch? These kinds of things are not so hard in themselves, but they are a major source of friction. They gum up the works of free software development, or any large project, open source or closed.

If you are a project maintainer, having a tool that handles offline versioning means that it is easier for you to accept changes from outsider contributors (zero insertion force - no need to apply patches and re-commit them).

If you a contributor, having an offline versioning tool means that it's easier for you to submit modifications to the project. You don't have manually create patches: you don't have to keep around a clean and working copy of the project, you don't have to worry about where you do your diffs (so that the patch --strip options come out right), you don't have to worry about what happens when the central repository changes and your patch no longer applies. Again, I'm not referring to conflict resolution. If there are conflicts, somebody will have to resolve them; but the resolution and versioning of these conflicts should involve as little bureaucracy as possible. For extra credit points, some version control systems even implement a "send" feature in which you submit your local modifications via email. The maintainers of the repository can then choose to apply the patch at their leisure. These aren't regular Unix diff patches, mind you, they are intelligent patches with all the version-tracking goodness built in.

Offline versioning adds convenience to the mix, a technical benefit. If you flip it around and look at it in terms of distributed controls, you can see some pretty subtle social consequences as well. Since there is no need for a central repository, there is a lot less pressure for the central maintainer to accept patches or reject them outright because you know that the outsider contributors can get along fine with their modifications safely versioned in their local repositories. Worst come to worse, the outside contributors can place their repositories online and have people work from there instead. It sounds like a fork, which sucks... but sometimes, fork happens. Look, sometimes you get forks for differences in opinion, disagreements between developers, or general unpleasantness. But sometimes you get more innocent forks, for example, the main developers suddenly got a new job and is now working 60 hours a week. S/he is still committed to the project, but to be honest s/he hasn't been looking at patches for the last month. No big deal, the rest of us will just be working from this provisional repository until the main developer gets back on his/her feet. There's a social and a technical aspect to forking. Distributed version control greatly simplifies the technical aspect, and that in turn mellows out the social one. Distributed version control means that life goes on.

simplicity and convenience

I'm really only making two points here. Simplicity matters. It reduces the learning curve for newbies and removes the need for experienced users to carry mental baggage around. Convenience matters. It reduces the friction that leads to put off the things you could be doing and it removes some of the technical barriers to wide-ranging collaboration. I could always be mistaken, of course. Perhaps there is some bigger picture, some forest to my trees; and upon discovering said forest I find myself deeply chagrined, getting all worked up over something so silly as patches. But until that time, I will continue to use darcs and love it for how much easier it makes my life.

2 comments:

Anonymous
said...

Good article with good points.

I would like to add that some of the best features found in darcs only become visible after you get past the first part on distributed version control. For example, interactive and block level commits help you structure your changes into patches for specific things like debugging code, logical changes, trivial changes (whitespace, and the like). I could then push my patches out while ommiting the debug stuff or lend it to a friend or co-worker that might be interested in the patch.

You can then go steps further with the offline mode and use it as a nice mechanism for cleaning out unwanted patches. For example, I might be solving a bug. I will record (aka commit) patches for all my changes, but some will be as test probes and others will be attemped fixes. Some of the fixes may not work so I will then obliterate them w/o messing around in a main repo. Once I am ready, I can push a working patch set out and obliterate any remaining trash I might have like debug code or maybe just ove those patches to an archive and get back to work with a clean repo.

Stuff like that only really works once you leave the old trunk behind. I've actually tried a number of different methods to get something similar with subversion but it just never works the same or takes a huge amount of effort and coordination.

Granted, darcs has its issues but they seem to be advertised out of perportion. I've not had many problems with any of my darcs projects given my careful development practices... but I can say the following are notable cons:* GHC to compile darcs. This isn't too bad for me but for others it is a large and long wait to get this done. GHC is quite portable but still doesn't support all the platforms I come in contact with, most notably FreeBSD x86_64.* Darcs doesn't warn you when it is going to go in deep for patch dependency tests so for those rare cases that it does dive deep, you just sit and stare at zero output. I wish it would give some notes on what it is doing when it takes more than just a few ops to do something.* Darcs doesn't store a whole lot about the file other than position, name, and location. It would be nice to have executable flags, sym links, etc... Most of the time this isn't a problem but it can be annoying.* No subdirectory checkout. This is not as much of a con as it is a trade off. Darcs has one single location that it stores repo info. This happens to be at the root of the project tree. This is unfortunate for those who want to have a modular repo.* No server side component yet. something like HTTP push would be great.

Despite those points though, I consider darcs having far fewer cons for my own uses than many of the popular centralized systems.

Thanks for the comment, anonymous. You might want to know about darcs-server, which may help you to work around the lack of a server-side component.

Also, there are some proposals that might allow for subdirectory checkout, but you'll have to check the mailing list on them, and they aren't likely to move unless somebody who actually cares enough to build it does so.

My top two wishlist items would be: (1) The new conflict handling code (the well known problems) (2) HalFS to protect against corruption from Unison, rsync, fancy IDEs, etc.

Seamless SVN integration of some sort would be nice, i.e. being able to treat as SVN repository as just another push-pullable location. But I haven't thought much about how such a thing might work.