Design

Distributed VCS: Understanding the Paradigm Shift

Distributed Version Control Systems use a different model of check-out, check-in, branching, and merging than traditional, centralized, repository-based solutions. Understanding these differences is the only way to enjoy the benefits of distributed version control.

In addition, every team I know enforces the rule: Check that the code builds correctly before check-in (even pass all tests) so you'll make sure every "snapshot" you're taking will effectively record a real configuration  something that really existed this way and could be built.

You'll notice that I'm stressing the concept of consistent check-ins versus dynamic configurations (where you download separate revisions of files to setup your workspace). It is a simplification of the previous model, and at first, you'll probably feel like you're losing power. But you're trading complexity for performance, simplicity and consistency (and better merging, which will be discussed shortly).

By the way, labels (or tags) are much easier to understand: Instead of a set that groups together random revisions, a tag is just an alias for a changeset (Figure 9).

Figure 9: Applying a tag ("version 1.0") to a changeset (4).

The benefit is clear: The tag is easier to understand and much faster to apply (in some old version control systems, applying a label to a big codebase took ages because it involved a new entry to be set for each revision to be labeled. Now it is just a new entry, independent of your project files).

Things That Have Changed

Concerning the changes I've described so far, Table 1 shows the things you do and have in a DVCS and things you don't.

You do or you have

You won't

Check the evolution of a project on a changeset-by-changeset basis. You'll check what files changed between two changesets

Look at the history of a file so often since you'll get used to thinking on a global basis.

Set your workspace to work on a given changeset, downloading a full configuration

Hand pick the files you will download to your workspace one-by-one or based on dynamic rules

Fast labeling

Wait for a label to be set

Table 1: The things you have in a DVCS and no longer use from traditional SCMS.

The Flexible Checkout Model vs. The Consistent and Rock Solid One

When I say "checkout," I mean it in the Subversion sense of checkout: Download files to your workspace (different from the "lock" or the "create new revision" that is performed in other systems). File-based version control systems allowed us to do things like that shown in Figure 10:

Figure 10: Checking out part of the repository.

You just download part of the repository to work on it. While this is doable in DVCS, it is not normally the way in which things are done. In a DVCS, you don't follow the "monster-repo" model where you work just on subsets of the project. This means that you will not use the traditional model ( Figure 11) for multiple projects.

Figure 11: Working on multiple projects from the same repo in the traditional SCM model.

In the traditional model, you define a big repository with all the code, then each developer must configure his/her working copy to work on the right parts of the code for a given project. The drawback is that the configuration of each workspace is dynamic and error-prone. Also, if you remember the "snapshot" approach I described earlier where each changeset is meant to capture a complete configuration, you'll see that the "traditional" way of working is not truly compatible. When a developer working on "proj1" checks-in, he will be creating a new changeset where there will be contents (src/proj2, src/proj3, lib/comp2) that are not related to his changeset and will be incorporated in the "snapshot" together.

In a pure DVCS approach, things would be a little bit different, as shown in Figure 12.

Figure 12: The multiple projects from Figure 11 in a DVCS.

As you can see, some code reorganization is required, but the benefit is that every project directly maps to a repository in a one-to-one relationship. Every project is contained in a single repository, so the consistency of the changeset model is preserved. Also, configuring a workspace is as simple as downloading from the corresponding repository; there's no need for the developer to play with download rules or to decide what he needs. He has the whole repository. Gone is the error-prone part because the configuration is captured by the repository.

Multiple, consistent, and well-defined repositories are preferred, and the reason is deeply tied to two concepts:

Each changeset captures a fully consistent picture of the status of the project at a given time

Merge tracking in DVCSs is handled at the changeset level and not at file level. I'll cover this topic in the next section.

Merge Tracking in DVCS

The version control systems born before the age of DVCS implemented "item-based merge tracking," which means every single file had its own merge tracking history. Figure 13 shows the history of a single file in a system implementing item-based merge tracking.

Figure 13: Item-based branch and merge in traditional SCM systems.

The green arrow is a merge link and the graphic shows how the file "foo.c" evolved from the "main" branch. Next, a "branch3" was created, two new revisions were checked in there, then they were merged back to the "main" branch. In this model, every file and directory has its own history tree with merge history.

Figure 14 shows how things happen in a changeset-based merge tracking system, such as those used in DVCSs:

Figure 14: Branch and merge in changeset-based DVCSs.

The main difference is that the merge tracking is not handled on an item-by-item basis (file-by-file or directory-by-directory), but on a global basis: The system keeps track of the merges on a changeset level. What are the important differences between the two models?

Changeset-based merge tracking enables much faster merging. Suppose you have to merge two branches: "main" and "feature130." The "feature130" branch was created from "main," but then the two branches evolved in parallel for a month. Major changes were done in "feature130," so a total of 20,000 files were modified there. In an item-based merge tracking system, the version control system will need to check the merge history of at least 20,000 items. You can optimize the merge code as much as you want, but it will never beat the speed of changeset-based merge tracking. The latter will check only one merge tracking tree  the changeset tree, which is going to be only one  independent of the number of files in the repository. This is really a significant advantage.

Changeset-based merge systems cannot do "partial merging." This is an important limitation that needs to be correctly understood. In practice, it means that when you merge a branch you will have to merge all the files or none of them. It is not possible (as it was with item-based merge tracking systems) to merge only a few files and then merge the rest of them later on. Merge tracking information is attached to the changeset, not kept on a file-by-file basis, and that's why merges can't be partial. At the end of the day, it doesn't impose such a big restriction because, conceptually, you won't do changes on branches that you won't integrate, and branches are handled as a unit of change that are merged as a complete unit.

This difference in merge tracking is one of the reasons why multiple cohesive repositories are preferred in DVCS over a single one with sparse checkouts.

Merging the Full Changeset or Nothing

In branch3, we modified files foo.c and bar.c. What happens if, during the merge from changeset "4" to "3" (creating changeset "5" as result), we only merge foo.c? In an item-based merge, the history of foo.c will include a new merge link, while bar.c will still not be merged. If we repeat the merge again to create changeset "6," then "bar.c" will be merged. In a DVCS, you cannot do this partial merge that forgoes bar.c.

The theory behind this way of implementing merge tracking states:

If you don't merge certain files from a branch, most likely it means you're abandoning them. What else would force you to create a changeset "5" where part of "branch3" is not merged? By doing that, you wouldn't be preserving the consistency rule.

If you use consistent configurations (a directory tree belongs to the same logic unit), you'll always be able to run complete merges. For instance: If you have a source tree with subdirectories "/src/tetris-game" and "/src/arcanoid-game," chances are you'll need to merge only the part of the "high-res-branch" related to "tetris-game" into the "tetris-2" branch. But, following this train of thought, shouldn't "src/tetris-game" and "src/arcanoid-game" belong to different repositories? In the DVCS world, the answer would be "yes" and this way, you avoid the need for most partial merges.

Conclusion

DVCSs present a new approach to version control that is constrained to their "distributed" nature, and also introduce changes in the way in which repositories are designed and structured. Getting used to this new DVCS approach is the key to getting the best out of the next generation of version control systems and correctly structuring your projects.

Pablo Santos is the founder and chief engineer of Codice Software, the makers of PlasticSCM.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!