Keithp.com

Serving from Portland for a while.

I've seen many posts recently about SCM user interfaces and how one system
is easier to learn, more powerful than another or better supports a
particular development style. I submit that these arguments fail to capture
the most salient feature of any source code management system—how the
system manages the actual source code. This fundamental underpinning of the
system, the repository structure, limits the kind of information the system
can capture, the robustness and reliability of the data and to a great
extent can limit the kinds of repository interactions possible.

A few days ago, Havoc makes a
push for Subversion as a reasonable choice for projects. His complaints
focus on the Git user interface, while again making this mistake that Git
forces users to engage in distributed development.

I agree with Havoc that few projects are large enough in scale to require
the kind of hierarchy seen in the Linux kernel. In fact, most projects have
fewer than 10 developers working on them, and with close coordination,
rarely see the need for any branching and merging at all.

However, as far as I know, none of the SCMs that provide distributed
development insist that developers hide their work on long-lived branches
and send patches up to a master maintainer. The distributed SCMs all allow
either centralized or distributed development; it all depends on the
conventions used within a project and individual developer style.

At X.org, we migrated from CVS to Git and yet have retained our largely
centralized development model. There are few people publishing alternate
trees, and we grant direct repository access to the same set of developers
who used to have CVS access.

For really bizarre stuff that is experimental in nature, we occasionally
publish a temporary alternate repository as a way to distance work from the
mainline further than a branch within the master repository would; we allow
developers to publish such trees on a public server that is visible through
the same web interface as the master repositories, so there remains a single
central location to discover what work is going on within a given module.

Git provides us with three principle functional advantages:

Offline repository access. Until you've used it, it's hard to understand
just how often one can commit changes to a repository if the operation
takes mere seconds. Havoc himself likes to save editor state every few
minutes; with Git, he would be free to commit that state to the
repository without significant additional delay.

The ability to make very fine grained changes to the code encourages
people to separate work into small comprehensible pieces. Both proactive
review and reactive debugging benefit substantially from this kind of
detail, allowing people to highlight significant small changes which
would otherwise be lost in large functionally-neutral restructuring.

Offline repository access is not the same as distributed development;
changes are still pushed to a single shared public repository and
included in a single line of development. Of course, simultaneous
offline development often results in conflicts, but we've had that with
CVS forever, and Git provides better merge-resolution tools than CVS
ever did.

Private branches. For those of us with ultra-secret hardware plans, we
develop drivers for unreleased hardware in parallel with the development
of the public project. Git makes this supremely easy by allowing us to
keep the ultra-secret new hardware changes in a private repository while
still tracking the public repository. When we're allowed to release the
source code for the new hardware, we simply merge the private branch to
the upstream master and push that to the public repository. All of the
development history for the new hardware then becomes a part of the
public source repository.

Distributed backups. Even given freedesktop.org's reasonably reliable
RAID disk array and daily tape backups, it's nice to know that around
the world there are hundreds of people with complete backups of our
source code repositories. If freedesktop.org is destroyed by earthquake,
fire, flood or volcano, we can be confident that somewhere on the planet
there will be complete and recent backups.

Alternatively, if the freedesktop.org administration becomes evil and
starts to manipulate source code to subvert users machines, the
distributed nature of our system means that the external developers will
detect such changes and can easily repair them.

That's nice for us, but none of these may be compelling for people new to
the distributed revision control world. Similarly, Git provides some nice
tools to view and manage the repository (gitk, Git-bisect, etc.), again,
useful but not compelling.

I would like to argue that none of the user-interface and high-level
functional details are nearly as important as the fundamental repository
structure. When evaluating source code management systems, I primarily
researched the repository structures and essentially ignored the user
interface details. We can fix the user interface over time and even add
features. We cannot, however, fix a broken repository structure without all
of the pain inherent in changing systems.

Given this argument, it should be clear that I think git's repository
structure is better than others, at least for X.org's usage model. It seems
to hold several interesting properties:

Files containing object data are never modified. Once written, every
file is read-only from that point forward.

Compression is done off-line and can be delayed until after the primary
objects are saved to backup media. This method provides better
compression than any incremental approach, allowing data to be
re-ordered on disk to match usage patterns.

Object data is inherently self-checking; you cannot modify an object
in the repository and escape detection the first time the object
is referenced.

Many people have complained about git's off-line compression strategy,
seeing it as a weakness that the system cannot automatically deal with this.
Admittedly, automatic is always nice, but in this case, the off-line process
gains significant performance advantages (all objects, independent of
original source file name are grouped into a single compressed file), as
well as reliability benefits (original objects can be backed-up before being
removed from the server). From measurements made on a wide variety of
repositories, git's compression techniques are far and away the most
successful in reducing the total size of the repository. The reduced size
benefits both download times and overall repository performance as fewer
pages must be mapped to operate on objects within a Git repository than
within any other repository structure.

Subversion appears to me to have the worst repository structure of all;
worse even than CVS. It supports multiple backends, with two available in
open source and one (by google) in closed source. The old Berkeley DB-based
backend has been deprecated as unstable and subject to corruption, so we
will ignore that as obviously unsuitable. The new FSFS backend uses simple
file-based storage and is more reliable, if somewhat slower in some cases.

The FSFS backend places one file per revision in a single directory; a test
import of Mozilla generated hundreds of thousands of files in this
directory, causing performance to plummet as more revisions were imported.
I'm not sure what each file contains, but it seems like revisions are
written as deltas to an existing revision, making damage to one file
propagate down through generations. Lack of strong error detection means
such errors will be undetected by the repository. CVS used to suffer badly
from this when NFS would randomly zero out blocks of files.

The Mozilla CVS repository was 2.7GB, imported to Subversion it grew to
8.2GB. Under Git, it shrunk to 450MB. Given that a Mozilla checkout is
around 350MB, it's fairly nice to have the whole project history (from 1998)
in only slightly more space.

Mercurial uses a truncated forward delta scheme where file revisions are
appended to the repository file, as a string of deltas with occasional
complete copies of the file (to provide a time bound on operations). This
suffers from two possible problems—the first is fairly obvious where
corrupted writes of new revisions can affect old revisions of the file. The
second is more subtle -- system failure during commit will leave the file
contents half written. Mercurial has recovery techniques to detect this, but
they involve truncating existing files, a piece of the Linux kernel which
has constantly suffered from race conditions and other adventures.

I was looking seriously at Mercurial for X.org development, and was
fortunate to spend a week last January with key developers from both
Mercurial and Git. Discussions with both groups led me to understand that
Git provided more of what X.org needed in terms of repository flexibility and
stability than Mercurial did. The key detractors for Git was (and remains)
the steep learning curve for the native Git interface; ameliorated for some
users by alternate interfaces (such as Cogito), but not for core developers.

The other killer Git feature is speed. We've all gotten very spoiled by Git;
many operations which take minutes under CVS now complete fast enough to
leave you wondering if anything happened at all. This alone should be enough
to convince anyone leaning towards Subversion or Bzr; fine-grained commits
are only reasonable if the commit operation takes almost no time.

We were not particularly interested in the kind of massive distributed
development model seen in the kernel, but the ability to work off-line (some
of us spend an inordinate amount of time on airplanes) and still provide
fine-grained detail about our work makes a purely central model less than
ideal. Plus, the powerful merge operations that Git provides for the kernel
developers are still useful in our environment, if not as heavily exercised.

I know Git suffers from its association with the wild and wooly kernel
developers, but they've pushed this tool to the limits and it continues to
shine. Right now, there's nothing even close in performance, reliability and
functionality. Yes, the user interface continues to need improvements. Small
incremental changes have been made which make the tools more consistent, and
I hope to see those discussions continue. Mostly, the developers respond to
cogent requests (with code) from the user community; if you find the UI
intolerable, fix it. But, know that while the UI improves, the underlying
repository remains fast, stable and reliable.

And yes, Havoc, anyone seriously entertaining moving to SVN should have
their heads examined.