What people Love about their VCS - Part 3 of 4. git

It is clear that the earlier posts in this series are light on
details and teasers, whereas this post goes into much detail on each
new feature. For this bias I offer no apology. There is no mistaking
that within the period of one year, I have gone from being an
outspoken SVK advocate to extolling the virtues of the content
filesystem, git. And I am not alone.

Content Addressible Filesystem

There are many good reasons that super-massive projects like the
Linux Kernel, XFree86, Mozilla, Gentoo, etc are switching to git.
This is not just a short term fad, git brings a genuinely new (well,
stolen from monotone) concept to the table - that of the
content addressable filesystem.

In this model, files, trees of files, and revisions are all hashed
with a specially pre-seeded SHA1 to yield object identifiers,
that uniquely identify (to the strength of the hashing algorithm) the
type and contents of the object. The full ramifications of this take
some time to realise, but include more efficient delta compression¹,
algorithmically faster merging, less error prone file history
detection², but chiefly, much better identification of revisions. All
of a sudden, it does not matter which repository a revision comes from
- if the SHA1 object ID matches, you have the same object, so
the system naturally distributes by model, with no
requirement for URIs or surrogate repository UUIDs and revision
numbers.

Being content-keyed also means you are naturally transaction-safe.
In terms of the core repository, you are only ever adding new objects.
So, if two processes try to write to the same file, this will succeed
because it means that they are writing the same contents.

It also makes cryptography and authentication easy - you can sign
an entire project and its revision history just by signing
text including a commit ID. And if you recompute the object
identifiers using a stronger hash, you have a stronger guarantee.

The matter of speed

The design of the git-core implementation is very OS
efficient. People might scoff at this as a key feature, but consider
this performance comparison;

That was 115 seconds to mirror 6,511 revisions. The key bottleneck
was the network - which was saturated for almost all of the execution
time of the command, not a laborious, revision by revision dialogue
due to a server protocol that just didn't seem to think that people
might want to copy entire repositories³. The server protocol simply
exchanged a few object IDs, then using the merge base
algorithm to figure out which new objects are required, it
generated a delta-compressed pack that just gives you the new objects
you need. So, git does not suffer from high latency networks
in the same way that SVN::Mirror does.

But it's not just the server protocol which is orders of magnitude
faster. Git commands overall execute in incredibly short periods of
time. The reason for this speed isn't (just) because "it's written in
C". It's mainly due to the programming style - files are used as
iterators, and iterator functions are combined
together by way of process pipelines. As the computations
for these iterator functions are all completely independent, they
naturally distribute the processing, and UNIX with its pipe buffers
was always designed to make mincemeat of this kind of highly parallel
processing task.

There is a lot to be learnt from this style of programming;
generally the habit has been to try to avoid using unnecessary IPC in
programs in order to make best use of traditional straight line CPU
performance, where task switching is a penalty. Combining iterative
programming with real operating system filehandles can bring the
potential of this speed enhancement to adequately built iterative
programs. I expect it will only be a matter of time before someone
will produce a module for Perl 6 that will automatically auto-thread
many iterative programs to use this trick. Perhaps one day, it will
even be automatic.

But that aside, we have yet to touch on some of the further
ramifications of the content filesystem.

Branching is more natural

Branches are much more natural - instead of telling the repository
ahead of time when you are branching, you simply commit. Your commit
can never be invalid (there is no "transaction out of date" error) -
if you commit in a different way to somebody else, then you have just
branched the development.

Branches are therefore observed, not declared.
This is an important distinction, but is actually nothing new - it is
the paradigm shift that was so loudly touted by those
arch folk who irritatingly kept suggesting that systems like
CVS were fundamentally flawed. Beneath the bile of their arguments,
there was a key point of decentralisation that was entirely missed by
the Subversion design. Most of the new version control systems out
there - bazaar-NG, mercurial, codeville, etc have this property.

Also, the repository itself is normally kept alongside the
checkout, in a single directory at the top called.git (or
wherever you point the magic environment variable GIT_DIR at
- so you can get your 'spotless' checkouts, if you need them). As the
files are saved in the repository compressed via gzip and/or delta
compressed into a pack file, with filenames that are essentially SHA1
hashes, the 'grep -r' problem that Subversion and CVS
suffered from is gone.

It means that you can explain that to make a branch, you can just
copy the entire checkout+repository:

$ cp -r myproject myproject.test

Not only that, but you can combine repositories back together just
by copying their objects directories over each other.

Now, that's crude and illustrative only, but these sorts of
characteristics make repository hacks more accessible. Normally you
would just fetch those revisions:

$ git-fetch../myproject.test test:refs/heads/test

Merges are truly merges

Unlike in Subversion, the repository itself tracks key information
about merges. When you use `svn merge', you are actually
copying changes from one place in the repository to another. Git does
support this, but calls it "pulling" changes from one branch to
another. The difference is that a merge (by default) creates a
special type of commit - a merge commit that has two parents
(a "parent" is just a SHA1 identifier to the previous commit). Thus,
the two branches are truly converged, and if the maintainer of the
other branch then pulls from the merged branch, they're not just
identical - they are the same branch. Merge base
calculations can just look at two commit structures, and find the
earliest commits that the two branches have in common.

To compare the model of branching and merging to databases and
transactional models, the Subversion model is like auto-commit,
whereas distributed SCM such as git provides is akin to
transactions, with the diverged branch's commits being like SQL
savepoints, and merges being like full "commit" points.

"Best of" merging - cherry picking

There is also the concept of piecemeal merging via cherry
picking. One by one, you can pluck out individual changes that
you want instead of just merging in all of the changes from the other
branch. If you later pull the entire branch, the commits which were
cherry picked are easily spotted by matching commit IDs, and do not
need to be merged again.

The plethora of tools

Another name for git is the Stupid content tracker. This
is reference to the fact that the git-core tools are really
just a set of the small "iterator functions" that allow you to build
'real' SCMs atop of it. So, instead of using the git-core -
the "plumbing" - directly, you will probably be using a "porcelain"
such as Cogito, (h)gct, QGit, Darcs-Git, Stacked Git, IsiSetup, etc. Instead of using
git-log to view revision history, you'll crank up Gitk, GitView or the
curses-based tig.

The huge list
of tools which interface with git already are a product of the
massive following that it has received in its very short lifetime.

The matter of scaling

The scalability of git can be grasped by browsing the many Linux
trees visible on http://kernel.org/git/. In fact, if
you were to combine all of the trees on kernel.org into one
git repository, you would measure that the project as a whole has
anywhere between 1,000 and 4,000 commits every month. Junio's OLS git
presentation contains this and more.

In fact, for a laugh, I tried this out. First, I cloned the
mainstream linux-2.6 tree. This took about 35 minutes to
download the 140MB or so of packfile. Then I went through the list of
trees, and used 'git fetch' to copy all extra revisions in
those trees into the same repository. It worked, taking between a
second and 8 minutes for each additional branch - and while I write
this, it has happily downloaded over 200 heads so far - leaving me
with a repository with over 40,000 revisions that packs down to only
200MB. (Update: Chris Wedgwood writes that he has a revision history of the Linux kernel dating all the way back to 2.4.0, with almost 97,000 commits, which is only 380MB)

Frequently, scalability is reached through distribution of
bottlenecks, and if the design of the system itself elimates
bottlenecks, there is much less scope for overloaded central servers
like Debian's alioth or the OSSF's
svn.openfoundry.org to slow you down. While Subversion and
SVK support "Star" and "Tree" (or heirarchical) developer team
patterns, systems such as git can truly, both in principle
and practice, be said to support meshes of development teams.
And this is always going to be more scalable.

Revising patches, and uncommit

The ability to undo, and thus completely forget commits is
sometimes scorned at, as if it were "wrong" - that version control
systems Should Not support such a bad practice, and therefore that
having no way to support it is not a flaw, but a feature. "Just
revert", they will say, and demand to know why you would ever want
such a hackish feature as uncommit.

There is a point to their argument - if you publish a revision then
subsequently withdraw that revision from the history without
explicitly reverting it, people who are tracking your repository may
also have to remove those revisions from their branches before
applying your changes.

However, this is not an unsurmountable problem when your revision
numbers uniquely and unmistakably identify their history - and when
you are working on a set of patches for later submission, it is
actually what you want. In the name of hubris, you only care to share
the changes once you've made them each able to withstand the hoards of
the Linux Kernel Mailing List reviewers (or wherever you are sending
your changes, even to an upstream Subversion repository via
git-svn).

In fact, the success of Linux kernel development can also be
attributed in part to its approach of only committing to the mainline
kernel, patches that have been reviewed and tested in other trees,
don't break the compile or add temporary bugs, etc. As they are
refined, the changes themselves are modified before they are
eventually cleared for inclusion in the mainline kernel. This
stringent policy allows them to do things such as bisect
revisions to perform a binary search between two starting points to
locate the exact patch that caused a bug.

Before git arrived, there were tools such as Quilt that
managed the use case of revising patches, but they were not integrated
with the source control management system. These days, Patchy Git
and Stacked Git layer this
atop of git itself, using a technique that amounts to being
commit reversal. In fact, the reversed commits still exist - it's
just nothing refers to them - they can still be seen by
git-fsck-objects before the next time the maintenance command
git-prune is run.

So, Stacked Git has a command called uncommit that takes a
commit from the head and moves it to your patch stack, refresh to update the current patch once it has been suitably revised, a pair of
commands push and pop to wind the patch stack, a
pick command to pluck individual patches from another branch,
and a pull command that picks entire stacks of patches, which
is called "rebasing" the patch series. And of course, being a porcelain only, you can mix and match the use of stgit with other git porcelain.

Far from being "so 20th century", patches are a clean way to
represent proposed changes to a code base that have stood the test of
time - and a practice of reviewing and revising patches encourages
debate of the implementation and makes for a tidier and more tracable
project history.

The polar opposite to reviewing every patch - a single head that
anyone can commit to - is more like a Wiki, and an open-commit policy
Subversion server suits this style of colloboration well enough.
There is no "better" or "more modern" between these two choices of
development styles - each will suit certain people and projects better
than others.

Of course, those tools that made distributed development a key
tenet of their design make the distributed pattern more natural, and
yet it is just as easy for them to support the Wiki-style development
pattern of Subversion.

In fact there are no use cases for which I can recommend Subversion
over git any more. In my opinion, those that attack it on
the grounds of "simplicity" (usually on the topic of the long, though
able to be abbreviated, revision numbers) have not grasped the beauty
of the core model
of git.

Footnotes:

Many people, especially those with time, effort and ego invested in
their own VCS, judged the features of git in very early days.
Without being able to see where it would be today, they each gave
excuses as to why this new VCS git offered their users less
functionality. So, a lot of FUD exists, a few points of which I address here.

git does do delta compression to save space (as a
separate step)

git can track renames of files, though it does not record
this in the meta-data, and pragmatically the observation is that this
is, overall speaking, just as good, if not better, than tracking them
with meta-data.

git is not forced to hold the entire project history, it
is quite possible to have partial repositories using grafts,
though this feature is still relatively new and initial check-outs
cannot easily be made grafts. Patches welcome;-).

The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Without JavaScript enabled, you might want to
use the classic discussion system instead. If you login, you can remember this preference.