Site Navigation

Disclaimer

The views or opinions expressed on this blog are my own and do not necessarily reflect the views or opinions of my current employer. The views or opinions expressed by visitors on this blog are theirs solely and may not reflect mine.

Syndicate This Blog

Thursday, November 5. 2009

This blog post is a by-product of my preparation work for an upcoming talk titled "Why you should be using a distributed version control system (DVCS) for your project" at SAPO Codebits in Lisbon (December 3-5, 2009). Publishing these thoughts prior to the conference serves two purposes: getting some peer review on my findings and acting as a teaser for the actual talk. So please let me know — did I cover the relevant aspects or did I miss anything? What's your take on DVCS vs. the centralized approach? Why do you prefer one over the other? I'm looking forward to your comments!

Even though there are several distributed alternatives available for some years now (with Bazaar, git and Mercurial being the most prominent representatives here), many large and popular Open Source projects still use centralized systems like Subversion or even CVS to maintain their source code. While Subversion has eased some of the pains of CVS (e.g. better remote access, renaming/moving of files and directories, easy branching), the centralized approach by itself poses some disadvantages compared to distributed systems. So what are these? Let me give you a few examples of the limitations that a centralized system like Subversion has and how these affect the possible workflows and development practices.

I highly recommend you to also read Jon Arbash Meinel's Bazaar vs Subversion blog post for a more elaborate description of the limitations.

Most operations require interaction with the central repository, which usually is located on a remote server. Browsing the revision history of a file, creating a branch or a tag, comparing differences between two versions — all these activities involve communication via the network. Which means they are not available when you're offline and they could be slow, causing a slight disruption of your workflow. And if the central repository is down because of a network or hardware failure, every developer's work gets interrupted.

A developer can only checkpoint his work by committing his changes into the central repository, where it becomes immediately visible for everybody else working on that branch. It's not possible to keep track of your ongoing work by committing it locally first, in small steps, until the task is completed. This also means that any local work that is not supposed to be committed into the central repository can only be maintained as patches outside of version control, which makes it very cumbersome to maintain a larger number of modifications. This also affects external developers who want to join the project and work with the code. While they can easily obtain a checkout of the source tree, they are not able to put their own work into version control until they have been granted write access to the central repository. Until then, they have to maintain their work by submitting patches, which puts an additional burden on the project's maintainers, as they have to apply and merge these patches by hand.

Tags and branches of a project are created by copying entire directory structures around inside the repository. There are some recommendations and best practices on how to do that and how these directories should be arranged (e.g. by creating toplevel branches and tags directories), but there are several variants and it's not enforced by the system. This makes it difficult to work with projects that use a non-standard way for maintaining their branches and can be rather confusing (depending on the amount of branches and tags that exist).

While creating new branches is quick and atomic in Subversion, it's difficult to resolve conflicts when merging or reconciling changes from other branches. Recent versions of Subversion added support for keeping better track of merges, but this functionality is still not up to par with what the distributed tools provide. Merging between branches used to drop the revision history of the merged code, which made it difficult to keep track of the origins of individual changes. This often meant that developers avoided developing new functionality in separate branches and rather worked on the trunk instead. Working this way makes it much harder to keep the code in trunk a stable state.

Having described some downsides of the centralized approach, I'd now like to mention some of the most notable aspects and highlight a few advantages of using a distributed version control system for maintaining an Open Source project. These are based on my own personal experiences from working with various distributed systems (I've used Bazaar, BitKeeper, Darcs, git, Mercurial and SVK) and from following many other OSS projects that either made the switch from centralized to distributed or have been using a distributed system from the very beginning. For example, MySQL was already using BitKeeper for almost 2 years when I joined the team in 2002. From there, we made the switch to Bazaar in 2008. mylvmbackup, my small MySQL backup project, is also maintained using Bazaar and hosted on Launchpad.

Let me begin with some simple and (by now) well-known technical aspects and benefits of distributed systems before I elaborate on what social and organizational consequences these have.

In contrast to having a central repository on a single server, each working copy of a distributed system is a full-blown backup of the other repository, including the entire revision history. This provides additional security against data loss and it's very easy to promote another repository to become the new master branch. Developers simply point their local repositories to this new location to pull and push all future changes from there, so this usually causes very little disruption.

Disconnected operations allow performing all tasks locally without having to connect to a remote server. Reviewing the history, looking at diffs between arbitrary revisions, applying tags, committing or reverting changes can all be done on the local repository. These operations take place on the same host and don't require establishing a network connection, which also means they are very fast. Changes can later be propagated using push or pull operations - these can be initiated from both sides at any given time. As Ian Clatworthy described it, a distributed VCS decouples the act of snapshotting from the act of publishing.

Because there is no need to configure or set up a dedicated server or separate repository with any of today's popular DVCSes, there is very little overhead and maintenance required to get started. There is no excuse for not putting your work into revision control, even if your projects starts as a one-man show or you never intend to publish your code! Simply run "bzr|git|hg init" in an existing directory structure and you're ready to go!

As there is no technical reason to maintain a central repository, the definition of "the code trunk" changes from being defined by a technical requirements into a social/conventional one. Most projects still maintain one repository that is considered to be the master source tree. However, forking the code and creating branches of a project change from being an exception into being the norm. The challenge of the project team is to remain the canonical/relevant central hub of the development activities. The ease of forking also makes it much simpler to take over an abandoned project, while preserving the original history. As an example, take a look at the zfs-fuse project, which got both a new project lead and moved from Mercurial to git without losing the revision history or requiring any involvement by the original project maintainer.

Both branching and merging are "cheap" and encouraged operations. The role of a project maintainer changes from being a pure developer and committer to becoming the "merge-master". Selecting and merging changes from external branches into the main line of development becomes an important task of the project leads. Good merge-tracking support is a prerequisite for a distributed system and makes this a painless job. Also, the burden of merging can be shared among the maintainers and contributors. It does not matter on which side of a repository a merge is performed. Depending on the repository relationships and how changes are being propagated between them, some DVCSes like Bazaar or git actually provide several merge algorithms that one can choose from.

Having full commit rights into his one's own branch empowers contributors. It encourages experimenting and lowers the barrier for participation. It also creates new ways of collaboration. Small teams of developers can create ad-hoc workgroups to share their modifications by pushing/pulling from a shared private branch or amongst their personal branches. However, it still requires the appropriate privileges to be able to push into the main development branch.

This also helps to improve the stability of the code base. Larger features or other intrusive changes can be developed in parallel to the mainline, kept separate but in sync with the trunk until they have evolved and stabilized sufficiently. With centralized systems, code has to be committed into the trunk first before regression tests can be run. With DVCSes, merging of code can be done in stages, using a "gatekeeper" to review/test all incoming pushes in a staging area before merging it with the mainline code base. This gatekeeper could be a human or an automated build/test system that performs the code propagation into the trunk based on certain criterions, e.g. "it still compiles", "all tests pass", "the new code adheres to the coding standards". While central systems only allow star schemas, a distributed system allows workflows where modifications follow arbitrary directed graphs.

Patches and contributions suffer less from bit rot. A static patch file posted to a mailing list or attached to a bug report may no longer apply cleanly by the time you look into it. The underlying code base has changed and evolved. Instead of posting a patch, a contributor using a DVCS simply provides a pointer to his public branch of the project, which he hopefully keeps in sync with the main line of development. From there, the contribution can be pulled and incorporated at any time. The history of every modification can be tracked in much more detail, as the author's name appears in the revision history (which is not necessarily the case when another developer applies a patch contributed by someone else).

A DVCS allows you to keep track of local changes in the same repository, while still being able to merge bug/security fixes from upstream. Example: your web site might be based on the popular Drupal CMS. While the actual development of Drupal still takes place in (ghasp) CVS, it is possible to follow the development using Bazaar. This allows you to stay in sync with the ongoing development (e.g. receiving and applying security fixes for an otherwise stable branch) and keeping your local modifications under version control as well.

I've probably just scratched the surface on what benefits distributed version control systems provide with this blog post. Many of these aspects and their consequences are not fully analyzed and understood yet. In the meanwhile, more and more projects make the switch, gather experiences and establish best practices. If you're still using a centralized system, I strongly encourage you to start exploring the possibilities of distributed version control. And you don't actually have to "flip the switch" immediately — most of the existing systems happily interact with a central Subversion server as well, allowing you to benefit from some of the advantages without you having to convert your entire infrastructure immediately.

Here are some pointers for further reading on that particular subject:

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.Enter the string from the spam-prevention image above: