Git

6.1. Git in a Nutshell

Git enables the maintenance of a digital body of work (often, but not limited
to, code) by many collaborators using a peer-to-peer network of
repositories. It supports distributed workflows, allowing a
body of work to either eventually converge or temporarily diverge.

This chapter will show how various aspects of Git work under the covers
to enable this, and how it differs from other version control systems (VCSs).

6.2. Git's Origin

To understand Git's design philosophy better it is helpful to understand the
circumstances in which the Git project was started in the Linux Kernel
Community.

The Linux kernel was unusual, compared to most commercial software projects at that
time, because of the large number of committers and the high variance of
contributor involvement and knowledge of the existing
codebase.
The kernel had been maintained via tarballs and patches for years, and
the core development community struggled to find a VCS that
satisfied most of their needs.

Git is an open source project that was born out of those needs and
frustrations in 2005. At that time
the Linux kernel codebase was managed across two VCSs, BitKeeper
and CVS, by different core developers. BitKeeper offered a different
view of VCS history lineage than that offered by the popular open source VCSs at this
time.

Days after BitMover, the maker of BitKeeper, announced it would revoke the licenses
of some core Linux kernel developers, Linus Torvalds began development, in
haste, of what was to become Git. He began by writing a
collection of scripts to help him manage email patches to apply one after
the other. The aim of this initial collection of scripts was to be able to abort
merges quickly so the maintainer could modify the codebase
mid-patch-stream to manually merge, then continue merging subsequent patches.

From the outset, Torvalds had one philosophical goal for Git—to be the anti-CVS—plus
three usability design goals:

Support distributed workflows similar to those enabled by BitKeeper

Offer safeguards against content corruption

Offer high performance

These design goals have been accomplished and maintained, to a degree, as I
will attempt to show by dissecting Git's use of directed acyclic graphs
(DAGs) for content storage, reference pointers for heads, object model
representation, and remote protocol; and finally how Git tracks the merging of trees.

Despite BitKeeper influencing the original design of Git, it is implemented
in fundamentally different ways and allows even more distributed plus
local-only workflows, which were not possible with BitKeeper.
Monotone,
an open source distributed VCS started in 2003, was likely another
inspiration during Git's early development.

Distributed version control systems offer great workflow flexibility, often
at the expense of simplicity. Specific benefits of a distributed
model include:

Providing the ability for collaborators to work offline and
commit incrementally.

Allowing a collaborator to determine when his/her work is
ready to share.

Offering the collaborator access to the repository history when
offline.

Allowing the managed work to be published to multiple repositories,
potentially with different branches or granularity of changes visible.

Around the time the Git project started, three other open source
distributed VCS projects were initiated. (One of them, Mercurial, is
discussed in Volume 1 of The Architecture of Open Source Applications.) All of these dVCS tools
offer slightly different ways to enable highly flexible workflows, which
centralized VCSs before them were not capable of handling directly.
Note: Subversion has an extension named SVK maintained by different developers
to support server-to-server synchronization.

6.3. Version Control System Design

Now is a good time to take a step back and look at the alternative VCS
solutions to Git. Understanding their differences will allow us to explore
the architectural choices faced while developing Git.

A version control system usually has three core functional
requirements, namely:

Storing content

Tracking changes to the content (history including merge metadata)

Distributing the content and history with collaborators

Note: The third requirement above is not a functional requirement for
all VCSs.

Content Storage

The most common design choices for storing content in the VCS world are with
a delta-based changeset, or with directed acyclic graph (DAG) content
representation.

Delta-based changesets encapsulate the differences between two versions of
the flattened content, plus some metadata. Representing content as a
directed acyclic graph involves objects forming a hierarchy
which mirrors the content's filesystem tree as a snapshot of the commit (reusing
the unchanged objects inside the tree where possible). Git stores content as
a directed acyclic graph using different types of objects. The
"Object Database" section later in this chapter describes the different
types of objects that can form DAGs inside the Git repository.

Commit and Merge Histories

On the history and change-tracking front most VCS software uses one of
the following approaches:

Linear history

Directed acyclic graph for history

Again Git uses a DAG, this time to store its history. Each commit contains
metadata about its ancestors; a commit in Git can have zero or many
(theoretically unlimited) parent commits. For example, the first commit
in a Git repository would have zero parents, while the result of a three-way merge
would have three parents.

Another primary difference between Git and Subversion and its linear history
ancestors is its ability to directly support branching that will record
most merge history cases.

Figure 6.1: Example of a DAG representation in Git

Git enables full branching capability using directed acyclic
graphs to store content. The history of a file is linked all the way
up its directory structure (via nodes representing directories) to the root
directory, which is then linked to a commit node. This commit node, in turn,
can have one or more parents. This affords Git two
properties that allow us to reason about history and content in
more definite ways than the family of VCSs derived from RCS do, namely:

When a content (i.e., file or directory) node in the graph has the same
reference identity (the SHA in Git) as that in a different commit, the two
nodes are guaranteed to contain the same content, allowing Git to
short-circuit content diffing efficiently.

When merging two branches we are merging the content of two nodes
in a DAG. The DAG allows Git to "efficiently" (as compared to the
RCS family of VCS) determine common ancestors.

Distribution

VCS solutions have handled content distribution of a working copy to collaborators on a
project in one of three ways:

Local-only: for VCS solutions that do not have the
third functional requirement above.

Central server: where all changes to the repository must transact
via one specific repository for it to be recorded in history at all.

Distributed model: where there will often be publicly accessible
repositories for collaborators to "push" to, but commits can be made
locally and pushed to these public nodes later, allowing offline work.

To demonstrate the benefits and limitations of each major design choice,
we will consider a Subversion repository and a Git repository
(on a server), with equivalent content (i.e., the HEAD of the default
branch in the Git repository has the same content as the Subversion
repository's latest revision on trunk). A developer, named Alex,
has a local checkout of the Subversion repository and a local clone of the
Git repository.

Let us say Alex makes a change to a 1 MB file in the local Subversion
checkout, then commits the change. Locally, the checkout of the file mimics
the latest change and local metadata is updated. During Alex's commit in the centralized
Subversion repository, a diff is generated between the
previous snapshot of the files and the new changes, and this diff is stored
in the repository.

Contrast this with the way Git works.
When Alex makes the same modification to the equivalent file in the local
Git clone, the change will be recorded locally first, then Alex can "push"
the local pending commits to a public repository so the work can be shared
with other collaborators on the project. The content changes are
stored identically for each Git repository that the commit exists in. Upon
the local commit (the simplest case), the local Git repository will create a
new object representing a file for the changed file (with all its content
inside). For each directory above the changed file (plus the repository
root directory), a new tree object is created with a new identifier. A DAG
is created starting from the newly created root tree object pointing to
blobs (reusing existing blob references where the files content has not
changed in this commit) and referencing the newly created blob in place
of that file's previous blob object in the previous tree hierarchy.
(A blob represents
a file stored in the repository.)

At this
point the commit is still local to the current Git clone on Alex's
local device. When Alex "pushes" the commit to a publicly accessible
Git repository this commit gets sent to that repository. After the public
repository verifies that the commit can apply to the branch, the same objects
are stored in the public repository as were originally created in the
local Git repository.

There are a lot more moving parts in the Git scenario, both under the
covers and for the user, requiring them to explicitly express intent to share
changes with the remote repository separately from tracking the change as a
commit locally. However, both levels of added complexity offer the team
greater flexibility in terms of their
workflow and publishing capabilities, as described in the "Git's Origin" section
above.

In the Subversion scenario, the collaborator did not have to remember
to push to the public remote repository when ready for others to
view the changes made. When a small modification to a larger file is sent
to the central Subversion repository the delta stored is much more
efficient than storing the complete file contents for each version.
However, as we will see later, there is a workaround for this that Git
takes advantage of in certain scenarios.

6.4. The Toolkit

Today the Git ecosystem includes many command-line and UI tools on a number
of operating systems (including Windows, which was originally barely
supported). Most of these tools are mostly built on top of the Git core
toolkit.

Due to the way Git was originally written by Linus, and its inception within
the Linux community, it was written with a toolkit design philosophy very much
in the Unix tradition of command line tools.

The Git toolkit is divided into two parts: the plumbing and
the porcelain. The plumbing consists of low-level commands that enable
basic content tracking and
the manipulation of directed acyclic graphs (DAG). The porcelain is
the smaller subset of git commands that most
Git end users are likely to need to use for maintaining repositories and
communicating between repositories for collaboration.

While the toolkit design has provided enough commands to offer fine-grained
access to functionality for many scripters, application developers
complained about the lack of a linkable library for Git. Since the Git binary
calls die(), it is not reentrant and GUIs, web interfaces or longer
running services would have to fork/exec a call to the Git binary, which can
be slow.

Work is being done to improve the situation for application developers; see
the "Current And Future Work" section for more information.

6.5. The Repository, Index and Working Areas

Let's get our hands dirty and dive into using Git locally, if only to
understand a few fundamental concepts.

First to create a new initialized Git repository on our local filesystem
(using a Unix inspired operating system) we can do:

$ mkdir testgit
$ cd testgit
$ git init

Now we have an empty, but initialized, Git repository sitting in our testgit
directory. We can branch, commit, tag and even communicate with other local
and remote Git repositories. Even communication with other types of VCS
repositories is possible with just a handful of git commands.

The .git directory above is, by default, a subdirectory of the root working
directory, testgit. It contains a few different types of files and
directories:

Configuration: the .git/config, .git/description and
.git/info/exclude files essentially help configure the local repository.

Hooks: the .git/hooks directory contains scripts that can
be run on certain lifecycle events of the repository.

Staging Area: the .git/index file (which is not yet
present in our tree listing above) will provide a staging area for our
working directory.

Object Database: the .git/objects directory is the default
Git object database, which contains all content or pointers to local
content. All objects are immutable once created.

References: the .git/refs directory is the default location
for storing reference pointers for both local and remote branches, tags and
heads. A reference is a pointer to an object, usually of type tag or
commit. References are managed outside of the Object Database to
allow the references to change where they point to as the repository
evolves. Special cases of references may point to other references, e.g.
HEAD.

The .git directory is the actual repository. The directory that contains the
working set of files is the working directory, which is typically the
parent of the .git directory (or repository). If you were creating a
Git remote repository that would not have a working directory, you could
initialize it using the git init --bare command. This would create
just the pared-down repository files at the root, instead of creating the repository
as a subdirectory under the working tree.

Another file of great importance is the Git index: .git/index. It provides the
staging area between the local working directory and the local repository.
The index is used to stage specific changes within one file (or more), to
be committed all together. Even if you make changes related to various types
of features, the commits can be made with like changes together, to more
logically describe them in the commit message. To selectively stage
specific changes in a file or set of files you can using git add -p.

The Git index, by default, is stored as a single file inside the
repository directory. The paths to these three areas can be customized
using environment variables.

It is helpful to understand the interactions that take place between these
three areas (the repository, index and working areas) during the execution
of a few core Git commands:

git checkout [branch]

This will move the HEAD reference of the local repository to branch
reference path (e.g. refs/heads/master), populate the index with
this head data and refresh the working directory to represent the tree
at that head.

git add [files]

This will cross reference the checksums of the files
specified with the corresponding entries in the Git index to see if the
index for staged files needs updating with the working directory's
version. Nothing changes in the Git directory (or repository).

Let us explore what this means more concretely by inspecting the contents of
files under the .git directory (or repository).

What we are starting to see here is the content representation inside Git's
object database.

6.6. The Object Database

Figure 6.2: Git objects

Git has four basic primitive objects that every type of content in the
local repository is built around. Each object type has the following
attributes: type, size and content. The primitive object
types are:

Tree: an element in a tree can be another tree or a blob, when
representing a content directory.

Blob: a blob represents a file stored in the repository.

Commit: a commit points to a tree representing the top-level
directory for that commit as well as parent commits and standard
attributes.

Tag: a tag has a name and points to a commit at the point in
the repository history that the tag represents.

All object primitives are referenced by a SHA, a 40-digit object identity,
which has the following properties:

If two objects are identical they will have the same SHA.

if two objects are different they will have different SHAs.

If an object was only copied partially or another form of data
corruption occurred, recalculating the SHA of the current object
will identify such corruption.

The first two properties of the SHA, relating to identity of the objects, is
most useful in enabling Git's distributed model (the second goal of Git).
The latter property enables some safeguards against corruption (the third
goal of Git).

Despite the desirable results of using DAG-based storage for content
storage and merge histories, for many repositories delta storage will be
more space-efficient than using loose DAG objects.

6.7. Storage and Compression Techniques

Git tackles the storage space problem by packing objects in a compressed
format, using an index file which points to offsets to locate specific objects in the
corresponding packed file.

Figure 6.3: Diagram of a pack file with corresponding index file

We can count the number of loose (or unpacked) objects in the local
Git repository using git count-objects. Now we can have Git pack
loose objects in the object database, remove loose objects already
packed, and find redundant pack files with Git plumbing commands if desired.

The pack file format in Git has evolved, with the initial format storing
CRC checksums for the pack file and index file in the index file
itself. However, this meant there was the possibility of undetectable
corruption in the compressed data since the repacking phase did not involve any
further checks. Version 2 of the pack file format overcomes this problem by
including the CRC checksums of each compressed object in the pack index
file. Version 2 also allows packfiles larger than 4 GB, which the initial
format did not support. As a way to quickly detect pack file corruption the
end of the pack file contains a 20-byte SHA1 sum of the ordered list of all
the SHAs in that file.
The emphasis of the newer pack file format is on helping fulfill Git's second
usability design goal of safeguarding against data corruption.

For remote communication Git calculates the commits and content that need
to be sent over the wire to synchronize repositories (or just a branch), and
generates the pack file format on the fly to send back using the desired
protocol of the client.

6.8. Merge Histories

As mentioned previously, Git differs fundamentally in merge history approach
than the RCS family of VCSs. Subversion, for example, represents
file or tree history in a linear progression; whatever has a higher revision
number will supercede anything before it. Branching is not supported directly,
only through an unenforced directory structure within the repository.

Figure 6.4: Diagram showing merge history lineage

Let us first use an example to show how this can be problematic when
maintaining multiple branches of a work. Then we will look at a scenario to
show its limitations.

When working on a "branch" in Subversion at the typical root
branches/branch-name, we are working on directory subtree adjacent to
the trunk (typically where the live or master equivalent code
resides within). Let us say this branch is to represent parallel development
of the trunk tree.

For example,
we might be rewriting a codebase to use a different database. Part of the
way through our rewrite we wish to merge in upstream changes from another
branch subtree (not trunk). We merge in these changes, manually if necessary,
and proceed with our rewrite. Later that day we finish our database vendor
migration code changes on our branches/branch-name branch and merge
our changes into trunk. The problem with the way linear-history VCSs
like Subversion handle this is that there is no way to know that the
changesets from the other branch are now contained within the trunk.

DAG-based merge history VCSs, like Git, handle this case reasonably well.
Assuming the other branch does not contain commits that have not been merged
into our database vendor migration branch (say, db-migration in our
Git repository), we can determine—from the commit object parent
relationships—that a commit on the db-migration branch contained
the tip (or HEAD) of the other upstream branch. Note that a commit
object can have zero or more (bounded by only the abilities of the merger)
parents. Therefore the merge commit on the db-migration branch
knows it merged in the current HEAD of the current branch and the
HEAD of the other upstream branch through the SHA hashes of the parents.
The same is true of the merge commit in the master (the trunk
equivalent in Git).

A question that is hard to answer definitively using DAG-based (and linear-based)
merge histories is which commits are contained within each branch. For
example, in the above scenario we assumed we merged into each branch all
the changes from both branches. This may not be the case.

For simpler cases Git has the ability to cherry pick commits from
other branches in to the current branch, assuming the commit can cleanly
be applied to the branch.

6.9. What's Next?

As mentioned previously, Git core as we know it today is based on a toolkit
design philosophy from the Unix world, which is very handy for scripting
but less useful for embedding inside or linking with longer running
applications or services. While there is Git support in many popular
Integrated Development Environments today, adding this support and
maintaining it has been more challenging than integrating support for VCSs
that provide an easy-to-link-and-share library for multiple platforms.

To combat this, Shawn Pearce (of Google's Open Source Programs Office)
spearheaded an effort to create a linkable Git library with more permissive
licensing that did not inhibit use of the library. This was called
libgit2.
It did not find much traction until a student named Vincent Marti chose it
for his Google Summer of Code project last year. Since then Vincent and
Github engineers have continued contributing to the libgit2 project, and
created bindings for numerous other popular languages such as Ruby, Python,
PHP, .NET languages, Lua, and Objective-C.

Shawn Pearce also started a BSD-licensed pure Java library called JGit that
supports many common operations on Git repositories. It is now maintained by
the Eclipse Foundation for use in the Eclipse IDE Git integration.

Other interesting and experimental open source endeavours outside of the
Git core project are a number of implementations using alternative datastores
as backends for the Git object database such as:

libgit2-backends,
which emerged from the libgit2 effort to create Git object database
backends for multiple popular datastores such as Memcached, Redis,
SQLite, and MySQL.

All of these open source projects are maintained independently of the Git
core project.

As you can see, today there are a large number of ways to use the Git format.
The face of Git is no longer just the toolkit command line interface of
the Git Core project; rather it is the repository format and protocol to
share between repositories.

As of this writing, most of these projects, according to their developers, have
not reached a stable release, so work in the area still needs to be done
but the future of Git appears bright.

6.10. Lessons Learned

In software, every design decision is ultimately a trade-off. As a power
user of Git for version control and as someone who has developed software
around the Git object database model, I have a deep fondness for Git in its
present form. Therefore, these lessons learned are more of a reflection
of common recurring complaints about Git that are due to design decisions and
focus of the Git core developers.

One of the most common complaints by developers and managers who evaluate
Git has been the lack of IDE
integration on par with other VCS tools. The toolkit design of Git has made
this more challenging than integrating other modern VCS tools into IDEs
and related tools.

Earlier in Git's history some of the commands were implemented as shell
scripts. These shell script command implementations made Git less portable,
especially to Windows. I am sure the Git core developers did not lose sleep
over this fact, but it has negatively impacted adoption of Git in larger organizations
due to portability issues that were prevalent in the early days
of Git's development. Today a project named Git for Windows has been
started by volunteers to ensure new versions of Git are ported to Windows in
a timely manner.

An indirect consequence of designing Git around a toolkit design
with a lot of plumbing commands is that new users get lost quickly; from
confusion about all the available subcommands to not being able to
understand error messages because a low level plumbing task failed, there
are many places for new users to go astray. This
has made adopting Git harder for some developer teams.

Even with these complaints about Git, I am excited about the possibilities
of future development on the Git Core project, plus all the related open
source projects that have been launched from it.