Distributed Version Control Systems: A Not-So-Quick Guide Through

Since Linus Torvalds presentation at Google about git in May 2007, the adoption and interest for Distributed Version Control Systems has been constantly rising. We will introduce the concept of Distributed Version Control, see when to use it, why it may be better than what you're currently using, and have a look at three actors in the area: git, Mercurial and Bazaar.

What?

A Version Control System (or SCM) is responsible for keeping track of several revisions of the same unit of information. It's commonly used in software development to manage source code project. The historical and first project VCS of choice was CVS started in 1986. Since then many other SCM have flourished with their specific advantages over CVS: Subversion (2000), Perforce (1995), CVSNT (1998), ...

In December 1999, in order to manage the mainline kernel sources, Linus chose BitKeeper described as "the best tool for the job". Prior to this Linus was integrating each patch manually. While all its predecessors were working in a Client-(Central)Server model BitKeeper was the first VCS to allow a truly distributed system in which everybody owns their own master copy. Due to licensing conflicts, BitKeeper was later abandoned in favor of git (Apr, 2005). Other systems following the same model are available: Mercurial (Apr, 2005), Bazaar (Mar, 2005), darcs (Nov, 2004), Monotone (Apr, 2003).

Why?

Or a more precise question: Why Central VCS (and notably Subversion) are not satisfying?
Several things are blamed on Subversion:

Major reason is that branching is easy but merging is a pain (but one doesn't go without the other). And it's likely that any consequent project you'll work on will need easy gymnastic with splits, dev, test branches. Subversion has no History-aware merge capability, forcing its users to manually track exactly which revisions have been merged between branches making it error-prone.

No way to push changes to another user (without submitting to the Central Server).

Subversion fails to merge changes when files or directories are renamed.

The modern DVCS fixed those issues with both their own implementation tricks and from the fact that they were distributed. But as we will see in conclusion, Subversion did not resign yet.

How?

Decentralization

Distributed Version Control Systems take advantage of the peer-to-peer approach. Clients can communicate between each other and maintain their own local branches without having to go through a Central Server/Repository. Then synchronization takes place between the peers who decide which changesets to exchange.

This results in some striking differences and advantages from a centralized system:

No canonical, reference copy of the codebase exists by default; only working copies.

Disconnected operations: Common operations such as commits, viewing history, diff, and reverting changes are fast, because there is no need to communicate with a central server. Even if a central server can exist (for stable, reference or backup version), if Distribution is well used it shouldn't be as much queried as in a CVCS schema.

Each working copy is effectively a remoted backup of the codebase and change history, providing natural security against data loss.

You should also be aware that there are some disadvantages in opting for DVCS, notably in term of complexity; This decentralized view is very different from Central world and it might need some time to get used to for your developers. Changeset tracking instead of file tracking can also be confusing even if very powerful and making it theoritically possible to track method move through file.

Who?

The battle rages on! Some of the Good and the Bad.

The good and the bad essentially from an updated (because some old arguments are not true anymore) compilation of blogs and my personal experience.You should notice that it is a very short list of features (ie git has more than 150 commands), and some issues might be more critical than others.

With more than 150 binaries it's hard not to find the killing command you always dreamt of (even if this increases complexity)

Complexity

Bzr pretends to hide complexity by keeping a clean User Interface while adapting to the different collaboration workflows and their evolution in a team.

Revision Naming

Git revisions are SHA-1 making it less userfriendly when doing a diff between two revisions. This was chosen to guarantee safety and integrity of data and also happens to avoid collision when merging with other peers.

Simple naming

Simple revision id naming r1, r2, etc...

Commands

Familiar, with some specificities like rename command which differs from other SCM (won't be changed because of backward compatibility).
git is the most advanced SCM in term of commands but if you add all possible commands and their options, you end up with a huge number of possibilities that it's hard to master.
The fact that such tool like Easy Git exists means that Git can be considered quite complex.

Smallest Market Share: Apart from Canonical projects (Ubuntu, Launchpad), no big names are using it yet. Bazaar is also less well-known.

Linux kernel, Cairo, Wine, X.Org, Rails, Rubinius, Compiz Fusion

Xine, OpenJDK, OpenSolaris, NetBeans, (Part of) Mozilla

(Part of) Ubuntu, Launchpad

Documentation

Good documentation. Very good man pages (with a lot of examples)

Good documentation.

Good documentation.

Platforms

Poor Windows support

Cross Platform

Cross Platform

Additional Misc. Good Points

git is scriptable over pluginable which is a good and bad point (easy entry point through script, all tests are done in bash script by the way).
Very tunable for advanced administration: staging area, dangling objects, detached heads, plumbing vs porcelain, reflogs.
Local branches are possible.

git-stash (when interrupted for a quick bug-fix on another project), git-cherry-pick (for picking only single commits, rather that complete branches), git-rebase (to forward port your local commits. Quilt-like changeset like Mercurial Queues).git add -i (equivalent of Mercurial RecordExtension).
There's hardly a command you dreamt of that git doesn't have.

RecordExtension (it lets you choose which parts of the changes in a working directory you'd like to commit).Hg Shelve Extension (same as git-stash: to interactively select changes to set aside).

Shelf plugin (same as git-stash: when interrupted for a quick bug-fix on another project, latest rel on Jan. 2007).bzr-dbus (for broadcasting hooks and revisions).

Additional Misc. Bad points

Renaming not handled as good as bzr (Test Case).
Read-only static HTTP setup is a bit obtuse (--bare and update-server-info).
Handling of Unicode (UTF-16 encoded) files.
Storage Model. Git stores each newly created object as a separate file that can be packed into a single file delta compressed between each. Forces to do administration and launch pack command on a regular basis.
Mixture of C, Perl and bash script, which makes it far less obvious to port to other systems while maintaining the same feature set.

Renaming not handled as good as bzr. [FIXED]
Local branches are not possible, clone is used instead.
To avoid lost of space, Hg use hardlinks making problem when pull (and also under Windows).
Forest extension (submodules) not native and not well documented.

A quick and non-exhaustive look at performance

git is still leading the performance battle, but Hg and Bzr have made great improvements in the past year.

You should notice that Mercurial doubles the number of files in your repository (the historic is kept per file in .hg/store/data). It doesn't seem to be a good choice for Windows system running on NTFS.
It's also interesting to see that git takes a big advantage of the system when executing command. While Hg and Bzr do not spend a big proportion of time in system, Git can take up to 10-40% cpu time within system call, which raises the question as to how it will perform on Windows system where the git-developers won't have access to all the system performance trick they are used to with Linux.
Single Merges and Merge Queues should be tested, this is a tiedous part to benchmark.

Benchmarks should also be run on Windows as:

Even if your server is running on *nix, many developers are still having a Windows environment at work and DVCS transfered more processing on the developer station

Performance might be really different on Windows machine.

When?

Experience stories.

I had the chance to catch up with Kelly O'Hair from Sun about its choice for Hg for OpenJDK.

Sebastien Auvray: I read the reasons for migrating from TeamWare to Mercurial but had remaining questions. Did you simply follow OpenSolaris choice?

Kelly O'Hair: To some degree yes, but the OpenSolaris choice also became the Sun wide choice to any Sun Software teams having to convert. The OpenSolaris investigation was pretty complete and they had all the exact requirements we had. We had to convert for OpenJDK, because TeamWare was unacceptable for an open source project, the answer of Mercurial was pretty obvious for us.

Or did you do a refreshed tournament and tried the other DVCS again (git, ...)?

We did not do a detailed re-investigation, that seemed like a waste of time. The only other possible choice in my view was git, and since git wasn't giving Windows a priority, which we needed. Again the choice was obvious.

OpenSolaris reports took place in April 2006 which is 2 years ago.

Understood. Some things may have changed, git has improved, but the ball was rolling, and Mercurial was improving too.

Also did you encounter any specific problems in the migration?

File permissions and ownership can be a problem in sharing a repository vis a NFS or UFS file system, so we finally setup a server to handle the shared repositories, the better way. That could be made easier.
The other issue is that using hooks to rollback or filter pushes creates a window where someone could accidently pull changes that will be rolled back, so you have to use a pair of repositories, one for pushes and one for pulls, with an automatic sync after the hooks run to sync them up.
Using forests also introduces a problem because a forest push is just a set of individual pushes, and if one push failed, technically you would want to rollback all other pushes. Nobody is doing this, and just taking their chances. If the repositories in the forest are fairly independent, this is not a real problem.

In the day-to-day usage?

Remains to be seen. Change like this is easy for some, harder for others. Given time, I think most people have and will adapt and learn to love it.
The concept of "working set files" (having to do 'hg update') and having to merge changesets that don't seem to merge anything is confusing to people. Also, the idea that they are pushing changesets and not files is something people have a problem with, "Why can't I just bringover this one file?".

What is better than TeamWare?

Much much much faster than TeamWare. Our teams in China and Russia are looking forward to full deployment because they don't need to keep mirrors of integration areas. Refreshes (pulls) are very fast over slow connections.
The state of the repository in Mercurial is well-defined, unlike TeamWare which allowed for partial workspaces, TeamWare was just a loose bag of individually managed files (SCCS files).
The changeset concept was missing in TeamWare, along with the concept of well known simple state of the entire repository (a simple changeset id).

Is there anything you're missing from TeamWare?

People are missing the email notifications and putback/bringover transaction history, but the changeset provides much of that.
What may be missing is somekind of repository transaction history, but again, email archives of Mercurial events could provide this.

Is Hg becoming the VCS of choice for Sun including internal projects? Or is Sun using it only for public projects that need openness?

Both internal and external projects are converting, where it makes sense.
I've seen a big increase in interest from internal projects that are taking the plunge.

I also caught up with Pierre d'Herbemont from VLC to get their opinion about git.

Sebastien Auvray: Firstly what was the version control system you were using prior to using Git?

Pierre d'Herbemont: SVN and a git-svn mirror.

When did you migrate?

We opened a git mirror of the svn tree, to ease VLC Google Summer of Code projects. So that was back then. Then we totally migrated to git on March 1st-2nd 2008.

Why did you chose Git over its competitors?

Over SVN: Git is fast. Branch is cheap. Atomic Commits. Rebasing on top of an other tree.

Over other distributed system: Proven user base (Linux Kernel). I have been successfully using it while working on Wine. Git is sexy. And Some core developers had experiences with Git, whereas no one has with Mercurial and such. Nothing technical there.

Also did you encounter any specific problems in the migration? In the day-to-day usage?

We encountered some troubles with Trac and buildbot. Their support for Git is really minimal especially in their releases versions. We had to checkout Builbot latest trunk. For Trac we are using a crippled Git plugin. Trac Git Plugin needs Trac 0.11. But Trac 0.11 isn't stable and has some known memleak that prevent us from switching. So basically we are waiting for them to fix that...
It took some times for some committers to get accustomed with Git. But after two days, everything seemed fine. And some Git-beginners starts to really enjoy Git.

So what ?

Choosing between Distributed VCS and Central VCS is far from being easy. DVCS will definitely change the way you work and collaborate. Subversion, one of the Central VCS leader, has not resigned yet in the performance and features battle, and 1.5 version should come up with good compromises. It can count on its existing userbase and simplicity favor (at the cost of some pain). In very specific case like project dealing with large opaque binary files, Subversion would be better than DVCS because the client-side space usage stays constant. Also if you use partial checkout heavily, svn will perform better (but when massively used this reveals a problem in the setting of your modules).

Once you made the choice for either Distributed or Central solution, then it will also be hard to compare the competitors in their area as implementations/commands and at the end performances can be very different. And there is no real existing benchmarks for the common operations.
In this hard battle, Bazaar lost many new really influencing early adopters (Mozilla, Solaris, OpenJDK) because of its poor performance of the beginning. It also has to be said that Bazaar website is a lot more Marketing-oriented: by publicizing not-all-true differences with its competitors, or by publicizing benchmark comparison with its competitors only about Space efficiency while there's no timing benchmark comparison of daily commands: diff, add, ...
I feel that even though the 3 projects started out at nearly the same time, bzr did face a lot more performance and design problem at the early beginning making it a bit less mature than its competitors now.
Yet unseen phenomenon, it seems as if some choices have emerged based on the language used by the communities: Java / Sun related developments seem to be interested more in Mercurial while C / Linux / Ruby / Rails related projects are attracted by git.

Hope this article enlightened you and your experiences and feedbacks are always welcome!

Credits:
People who kindly accepted my interview: Kelly O'Hair, Pierre d'Herbemont.
Ian Clatworthy for his help and reactivity on the conversion of the Mozilla Hg Repository to Bzr.
#git, #mercurial, #bzr on Freenode IRC, #mozilla on Mozilla IRC.
Athletism Picture by Antonio Juez

Random quotes:
Linus Torvald: "Subversion has been the most pointless project ever started". "If you like using CVS, you should be in some kind of mental institution or somewhere else".
Mark Shuttleworth (Ubuntu / Canonical Ltd.): "Merging is the key to software developer collaboration."
Ian Clatworthy (Canonical / Bazaar): "By 2011-2012, I predict this technology will be widely adopted and many teams will wonder how they once managed without it."
Assaf Arkin in Git forking for fun and profit originally: "Apache built a great infrastructure around SVN, lots of sweat and tears went into making it happen, and at first I felt like we’re circumventing all of that. But the longer I thought about it, the more I realized that Git is just more social than SVN, and that’s exactly what Apache is about."

Apologies:darcs, Monotone were not taken into account in this comparison because it was already a hard work to gather all this information and to actually test those 3 DVCS. Strangely, even though they are the oldest in the DVCS scene, the focus is more on the DVCS I reviewed here (which doesn't help moving the focus I admit but darcs, Monotone users/developers are welcome to post comments and advertising here!).

Benchmark conditions.
Benchmark was done using AMD Athlon(tm) 64 Processor 3500+ 1GB RAM on Linux Kubuntu 6.10 Edgy x86_64 with ext3 fs.
Each command was run 8 times (and the best and worst time were cut out). They were done locally through the filesystem (other protocol tests should definitely be done as even if DVCS are not coupled with a central server, network communications when badly implemented can lower user performance).

Repository consists in a snapshot of 12456 changesets (from 20080303, 70853 total revisions from the hg Repository), ~30000 files from Mozilla Repository (originally hg formatted and translated into git repository thanks to hg-fast-export.sh for git and hg-fast-export.sh coupled with fast-import plugin for bazaar).
Default file formats were used and git repository size remained the same running git-gc (which can be considered normal for a freshly migrated repository). One file was modified (dom/src/base/nsDOMClassInfo.cpp) just like a benchmark test done by Jst 1.5 year ago.

About the Author

Sébastien Auvray is a senior software designer and technology enthousiast. After being forced to use CVS, svn now he has to suffer the daily usage of Perforce at work. Sébastien is also one of the Ruby editors of InfoQ.

The problem with this analysis, and those like it, is that they assume SVN as the state of the art. SVN is the state of the "free" art; but is missing a lot of features that perforce offers.

Now Perforce is pretty costly, but when broadly comparing "centralized vs. decentralized" it's probably not quite on target to use the "best free" rather than the "best." Many of the issues cited (and thereby implied to be issues with centralized solutions)

For example, perforce does not pollute your directories with control files. Perforce does a fantastic job of maintaining merges between branches and keeping merge history.

Indeed, most of the arguments put forward against "central" tend to be limitations in the SVN implementation of centralized SCM rather than limitations in the state of the art.

Hi Chris,I agree with you that Perforce got some good points that SVN do not support. But as you said 1) you need to pay for it 2) you really need to take a big care of your production scalability (as any central server...) else it's becoming a big bottleneck and you're blocked at each branch creation... Some Editors like Idea add some intelligence to that like adding the Offline mode.

DVCS's are young, but very capable. They are rather tools not systems to manage the code. Users need to come up with their own workflow and thats what a lot of them do not get.

It can be frustrating when someone tries out one of them and it seems complicated after CVS/SVN. The benefits are not so obvious (what are the benefits for a developer who has never ever did any branching and merging in CVS/SVN?).

What I don't like about Git:- is that it is a mess. c+perl+bash- my environment is polluted by it really hardly (at least in msys git, there are huge amount of aliases)- native windows port is really far from being production ready. (I can not even use it from behind firewall.)

What I don't like about Mercurial:- tree handling is not really good. It's not an easy call to implement it right, but it is important.- editing of the history is considered harmful and not really supported (can be done though through extensions, but not as nice as in git)

What I like about Mercurial:- hgbook is awesome- easy to extend (the included extensions are good starting point)- clean python- very flexible, you can build nice workflows around it- mq is evil but very handy once you understand what and how is it doing

Great article, nice comparison & research. I've been very interested in BitKeeper and git since reading about the controversy with Linux. Very cool idea, and if anyone's interested there is some very good doc out there explaining how BitKeeper works.

And I find it amazing that when Larry McEvoy pulled the plug on the Linux licenses, Linus (or someone) shook git out of their sleeve in a short time. Impressive.

Now that said, Torvalds has an incredible knack for saying incredibly insulting comments. SVN is pointless? When 59% of git users also use SVN or CVS? and anyone who uses CVS is stupid - uh - this coming from someone who didn't use *any* version control until forced to do so because it was impacting the work on Linux? Now that was nuts. Now, SCM is a given, CVS is probably one of the grandaddy, we stand on the shoulders of giants. And SVN's aim was to fix issues with CVS, nothing more, isn't git pretty much a clone of BitKeeper, like Linux a clone of Unix? Hopefully for Linus, the successors to Linux will have a more admirable view of their forebearers.

Nice overview, thanks. One thing, though -- the "major reason" you list for moving from Subversion is the difficulty of merging, but history-aware merge is the major improvement being delivered in Subversion 1.5. That's mentioned in the pages you link to at the end of your article, but given the importance attached to the feature, I thought it might be worth calling out.

Being a novice in the matter, I'd be very interested in having some feedback on how SVK compare to the mentioned DVCS.I had identified it as a potential solution to solve my SVN problems, mainly off-line commits and .svn file.

I'm very happy for the new perspectives the articles gave me. Thanks for the hard work!

This otherwise rather interesting article is for me and for more than 10% of readers fatally undermined by the choice to use generic icons differing only in red/green colour choice in the first comparison table. Simple check/dash/x symbology, or even a tooltip, would dramatically increase its usefulness.

Hello.A nice article, but I am a bit disappointed. First of all there is no mention of www.monotone.ca/ which is a well recognised DVCS now-a-days. It also fails to mention that git traditionally is fastest because on linux good old Linus Torvalds is using all the low level filesystem tricks you can possibly think off.

Developer's choices are not based on which is best, but what is my community using, as the author noticed :

it seems as if some choices have emerged based on the language used by the communities: Java / Sun related developments seem to be interested more in Mercurial while C / Linux / Ruby / Rails related projects are attracted by git.

But overall, the point is that your SCM tool should support your workflow and processes. It maybe be easier to change the tool than it is to change the processes.

My measurements have 'bzr clone master feature-1' coming in around 22 secs. A patch is available to reduce this to 16 secs. See this email for further details.

If you want to save further time and space when cloning in Bazaar, use the --hardlink option. It cuts the time to 11.2 secs (vs 11.1 secs for git on my computer) and reduces space usage across the working trees, which is where most of the disk space get consumed in tools as efficient at historical storage as these.

This is very impressive article and good analysis.Actually I want to know could anyone can explain about DVCS by UML diagram of its every process and steps so that I can understand all its micro process as visually.

There's been some confusion among people who think Mercurial doesn't use SHA1 for revision identification, because this article suggests its naming is simpler. While we gladly accept the notion that our revision identification scheme is simpler to use than that of other DVCSs, it's still fundamentally based on SHA1 hashes. Please note this in the table, if you can.

As much as I appreciated articles like this in the past, I must say we did ourselves a favor by skipping the interim solution of going from CVS and VSS to another open source tool and chose what we believe is the last tool we'll have to use, period. www.accurev.com I just don't get what all the excitement is over a 'free' tool that will only lead to more problems down the road?

Widely considered one of Git's killer features, git-gisect is worth mentioning.

Basically, it automates searching through the revision history for a version that introduced a regression. Git's speed lets you make frequent small commits, so the change that introduced the regression can be very small, and the result is the problem is easy to spot.

For distributed development, it's particularly nice because it lets even a relatively unskilled tester find the exact commit that introduced the regression and send the complaint straight to the relevant developer.

(It has, of course, been copied by Hg and bzr, so it's no longer unique to git. Still something well worth knowing about.)

But centralization is the main weakness of Perforce - you cannot work while you are offline as you need to check out files on the server, which is unneeded and annoying. That is why SVN beats Perforce (from developer point of view).