Git - Revision Control Perfected

In 2005, after just two weeks, Linus Torvalds completed the first version
of Git, an open-source version control system. Unlike typical centralized
systems, Git is based on a distributed model. It is extremely flexible and
guarantees data integrity while being powerful, fast and efficient. With
widespread and growing rates of adoption, and the increasing popularity
of services like GitHub, many consider Git to be the best version control
tool ever created.

Surprisingly, Linus had little interest in writing a version control tool
before this endeavor. He created Git out of necessity and frustration. The
Linux Kernel Project needed an open-source tool to manage
its massively distributed development effectively, and no existing tools were up to
the task.

Many aspects of Git's design are radical departures from the approach
of tools like CVS and Subversion, and they even differ significantly from
more modern tools like Mercurial. This is one of the reasons Git
is intimidating to many prospective users. But, if you throw away your
assumptions of how version control should work, you'll find that Git is
actually simpler than most systems, but capable of more.

In this article, I cover some of the fundamentals of how Git works and
stores data before moving on to discuss basic usage and work flow. I
found that knowing what is going on behind the scenes makes it much
easier to understand Git's many features and capabilities. Certain parts
of Git that I previously had found complicated suddenly were easy and
straightforward after spending a little time learning how it worked.

I find Git's design to be fascinating in and of itself. I peered behind
the curtain, expecting to find a massively complex machine, and instead saw
only a little hamster running in a wheel. Then I realized
a complicated design not only wasn't needed, but also wouldn't add
any value.

Git Object Repository

Git, at its core, is a simple indexed name/value database. It stores
pieces of data (values) in "objects" with unique names. But, it does this
somewhat differently from most systems. Git operates on the principle of
"content-addressed storage", which means the names are derived from the
values. An object's name is chosen automatically by its content's SHA1
checksum—a 40-character string like this:

1da177e4c3f41524e886b7f1b8a0c1fc7321cac2

SHA1 is cryptographically strong, which guarantees a different checksum for
different data (the actual risk of two different pieces of data sharing
the same SHA1 checksum is infinitesimally small). The same chunk of data
always will have the same SHA1 checksum, which always will identify only
that chunk of data. Because object names are SHA1 checksums, they identify
the object's content while being truly globally unique—not just to one
repository, but to all repositories everywhere, forever.

To put this into perspective, the example SHA1 listed above happens to be
the ID of the first commit of the Linux kernel into a Git repository by
Linus Torvalds in 2005 (2.6.12-rc2). This is a lot more useful than some
arbitrary revision number with no real meaning. Nothing except that commit
ever will have the same ID, and you can use those 40 characters to verify
the data in every file throughout that version of Linux. Pretty cool, huh?

Git stores all the data for a repository in four types of objects: blobs,
trees, commits and tags. They are all just objects with an SHA1 name and
some content. The only difference between them is the type of information
they contain.

Blobs and Trees

A blob stores the raw data content of a file. This is the simplest of
the four object types.

A tree stores the contents of a directory. This is a flat list of
file/directory names, each with a corresponding SHA1 representing
its content. These SHA1s are the names of other objects in the
repository. This referencing technique is used throughout Git to link all
kinds of information together. For file entries, the referenced object
is a blob. For directory entries, the referenced object is a tree that
can contain more directory entries, in turn referencing more trees to
define a complete and potentially unlimited hierarchy.

It's important to recognize that blobs and trees are not themselves
files and directories; they are just the contents of files and
directories. They don't know about anything outside their own content,
including the existence of any references in other objects that point
to them. References are one-way only.

Figure 1. An example directory structure and how it might be
stored in Git as tree and blob objects (I truncated the SHA1 names to
six characters for readability).

In the example shown in Figure 1, I'm assuming that the files MyApp.pm and MyApp1.pm
have the same contents, and so by definition, they must reference
the same blob object. This behavior is implicit in Git because of its
content-addressable design and works equally well for directories with
the same content.

As you can see, directory structures are defined by chains of references
stored in trees. A tree is able to represent all of the data in the files
and directories under it even though it contains only one level of names
and references. Because SHA1s of the referenced objects are within its
content, a tree's SHA1 exactly identifies and verifies the data throughout
the structure; a checksum resulting from a series of checksums verifies
all the underlying data regardless of the number of levels.

Consider storing a change to the file README illustrated in Figure
1. When committed, this would create a new blob (with a new SHA1),
which would require a new tree to represent "foo" (with a new
SHA1),
which would require a new tree for the top directory (with a new SHA1).

While creating three new objects to store one change might seem inefficient,
keep in mind that aside from the critical path of tree objects from
changed file to root, every other object in the hierarchy remains
identical. If you have a gigantic hierarchy of 10,000 files and you
change the text of one file ten directories deep, 11 new objects allow
you to describe both the old and the new state of the tree.

Note:

One potential problem of the content-addressed design is that two large
files with minor differences must be stored as different objects. However,
Git optimizes these cases by using deltas to eliminate duplicate data
between objects wherever possible. The size-reduced data is stored
in a highly efficient manner in "pack files", which also are further
compressed. This operates transparently underneath the object repository
layer.

The limitation that I immediately ran into when I considered migrating to git is to check out some (rather randomly selected) subset of files on a small/portable computing device.

Say I have a big repository of files and I only needed a very small subset of files while on the go -- to refer to and to be edited.

It was originally a small netbook computer where I could check out a few directories from a big repository and be able to edit files on the netbook computer while on the bus.

Netbook might have grown larger with regard to its disk storage, but now, I want to do the same on an Android phone.

git's sparse checkout feature still pulls the entire repository to the device. It only checkout a subset of files to give the appearance of sparse checkout, but it doesn't resolve the storage issue.

I don't think git submodules help, as, I think, one can't easily move selected files across repositories with all history intact (i.e., every now and then, add some additional directories to the list available to small devices by moving them to a submodule, when it becomes necessary), as one can easily do with CVS.

The only solution that I can think of is to remotely mount .git/objects/ directory and deal with its limitation.

Is there any creative brain power would find a solution lift this limitation?

Can splitting git repository be implemented by splitting some git's Tree object into 2 (sub-) Tree objects on a personal workstation, (perhaps new Commit objects to keep track of the split,) allowing a smaller tree be checked out to a small device.

Remote changes (done by others) can, then, be merged to the personal workstation (as staging), before merging to the splitted Tree branches for the small devices if necessary.

Changes on the small devices can be merged to the personal workstation (as staging), before being pulled by others?

Would that solve the disk space problem by limiting checkout to a small (sub-) Tree?

If this idea works, would some able developer turns it into an implementation?

First of all I want to thank the author for this clear and concise article.

However, I want to point out some inaccuracy regarding the paragraph on SHA1. The author states that SHA1 guarantees that the data in the blobs is different, and that the chance that two pieces of data have the same SHA1 is infinitesimally small. I disagree on this point.

The 40-character string that SHA1 outputs gives us 16^40 = 2^160 ~~ 10^16 different checksums. Although this is big enough to assume the above descripted 'guarantee', the claim about the infinitesimal chance is just wrong.

Consider for example 2^160 + 1 pairwise distinct files (this is data, be it hypothetical). The chance that there will be two different pieces of data in this set having the same checksum is 1. And 1 is very very different from infinitesimal.

I agree that it is highly unlikely that two such files will occur in practice, let alone in one project. (For example, each person on earth would have to create about 100.000 distinct files, to come close to the 2^160 files.) Still I wanted to point this out about the cryptographic features of SHA1.

Trending Topics

Upcoming Webinar

Getting Started with DevOps - Including New Data on IT Performance from Puppet Labs 2015 State of DevOps Report

August 27, 2015
12:00 PM CDT

DevOps represents a profound change from the way most IT departments have traditionally worked: from siloed teams and high-anxiety releases to everyone collaborating on uneventful and more frequent releases of higher-quality code. It doesn't matter how large or small an organization is, or even whether it's historically slow moving or risk averse — there are ways to adopt DevOps sanely, and get measurable results in just weeks.