Open Source

Centralized vs. Distributed SCM

By Damon Poole, September 13, 2011

The model that works for open-source projects might not work in the enterprise  or is that a false dichotomy?

[This week, I invited Damon Poole to discuss the sometimes contentious issue of distributed vs. centralized source code management. Damon is the CTO of AccuRev, an SCM vendor whose product straddles both worlds. As a result, he's had the chance to view the pros and cons of both models in large-scale deployments. Ed.]

The value of any particular software configuration management (SCM) system depends entirely on your project's unique circumstances. In a corporate environment, there is likely to be a strong emphasis on security, process, and governance. In an open source project, there is likely to be greater emphasis on autonomy.

In any discussion of the architecture of tools, a common starting point is the various generations of architecture within the domain. It is difficult to use this approach with SCM tools because they frequently draw on multiple architectural models that were brought to market at different points in the past. For instance, the concept of distributed SCM is not new. The distributed Git and Mercurial version control systems (VCSs) were both offshoots from the Linux kernel's temporary use of Bitkeeper, which first shipped in 2000. Bitkeeper in turn was based on Sun's Teamware, which was created in the early '90s.

Let me start with some of the basic attributes of SCM systems (a term that I use to encompass a wider variety of tools than pure VCSs).

Replication. One of the primary differences between centralized and distributed SCM systems is how they handle replication. Systems that provide no method of replication are commonly known as "centralized SCM" tools. That is, all access to project information is via a centralized repository. CVS and Subversion are two such examples. Near the middle of the continuum are site-based proxy caching replication systems such as Perforce, Team Foundation Server, and AccuRev. Peer-to-peer replication systems with branch-based mastership such as ClearCase and Plastic SCM also fall in this middle area.

The other end, known as "distributed SCM," relies on masterless peer-to-peer replication. That is, any replica of a given project can push or pull any information to or from any other replica; and there is truly no master repository that architecturally serves as the single source of truth. Examples of this distributed model include BitKeeper, Git, Mercurial, and Veracity.

One benefit of a distributed SCM is the ability to work in an environment where network connectivity is unreliable or unavailable, such as on a plane (without WiFi). However, most other SCM systems do provide some degree of disconnected operation. For instance, Subversion allows for editing, diffing, reverting changes, and getting file status.

The value of disconnected operation depends on how many of the developers on your project are regularly working disconnectedly, how frequently, and for how long. The more your project needs to work disconnected, the more value a distributed SCM system provides. But working disconnected is not the only benefit of having a local repository.

Integration and merging. Even in a distributed SCM system, there is generally a mainline or central repository. The pressure for this comes from the need to integrate everybody's changes together. For the Linux kernel, that's Linus Torvald's repository. Another reason for a central repository when using a distributed SCM is Continuous Integration (CI).

Many projects are now using the practice of CI. A basic tenet of CI is that all changes are integrated into the "mainline" as frequently as possible. The benefit is there is then a configuration of the software that is known to have been fully integrated, built, and against which all automated tests have been run. To achieve this in the distributed SCM world, there must be a repository designated as the official CI repository.

When using a central repository, especially for a large project, congestion can occur during integrating with that repository. In this case, speed of operation now depends on how many people are trying to integrate at the same time, the number of conflicts, and the strength of the systems merging capabilities.

Distributed SCM systems have gotten quite a bit of well-deserved publicity for their ability to merge from one configuration to another. But this comes from comparing them against systems like CVS and (until recently) Subversion, which did not include good merging. The point here is that good merging is orthogonal to the replication model and depends directly on the version ancestry design choices of a given system.

Workflow. Until distributed SCM systems came along, stream-based systems had the upper hand when it came to setting up a flexible SCM-based workflow. Distributed SCM provides a new way to build a workflow by using repositories as the workflow building block. If you want to create team integration stages, production staging areas, code review gates, or CI servers, you can use separate repositories for each stage and you have an easy-to-use workflow.

Management visibility. In a corporate environment, managers want to have visibility into the progress of changes. Distributed SCM tools provide a good framework for this via a workflow composed of related repositories, but are currently weak in providing access to this information. The information is scattered across multiple repositories and there is no easy way to know where to look to gather the necessary information, let alone structure it together into a coherent management view. In counterpoint, stream-based systems such as Perforce, AccuRev, and Team Concert enable management to visualize this information because the necessary information is both well-structured and centralized.

Security. Most distributed SCM systems do whole-repository replication. This complicates security where there may be certain sub-trees to which you only want authorized people or groups to have access. In an open source project, this is generally not a problem. In a corporate, regulated, or national-security related environment however, security is a major concern.

The only way to guarantee security in this situation is to split your source base into multiple repositories. The more unique access types you need, the more repositories you will need to split your source base into. Users will then need to compose repositories together to create the particular source configuration that they need. This can prove to be challenging from an SCM perspective. Centralized VCSs, by comparison, often provide various access models to restrict access to certain parts of the code base.

There are certainly other aspects, such as performance and degrees of autonomy, that go into the choice of an SCM system. What is clear, however, is that unless security or management visibility are paramount considerations (and both would favor the centralized model), both distributed and centralized architectures provide a similar set of features and capabilities  even if these are implemented differently. In a sense, the centralized vs. distributed model is a false dichotomy. What matters most is your ability to choose one of the numerous SCM packages available based on a clear, prioritized statement of project needs. Almost invariably, that process will point you to a set of options to evaluate, and the model will be only one of several important factors.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!