How git works

There is always during that manly ritual of viewing a friend’s new car that he will pop open the bonnet to show me the engine powering his new steed. I politely comment on its elegance and power, perhaps throwing in an admiring whistle if I feel his new fan belt and spark plugs really deserve that extra modicum of praise, but I am inevitably disappointed when he quickly closes the bonnet again and moves on to show me its hubcaps, or how capacious its boot is. It is disappointing to me because I don’t want to move on from the engine. I want to find out how it works. Can we not disassemble it? compare the exhaust manifold to that in his previous car? see what improvements they have made. Maybe we could improve it even further? Sadly not.

Given this natural hacker-reflex to probe and tinker, it is odd how little I knew about Git until recently. Perhaps my daily life is so reliant on Git not failing to do its job, that I don’t want to poke around lest I find a flaw. Until now I have been content with kicking the tyres and going for a spin around the block, rather than dismantling that mysterious .git directory. Nevertheless, I recently became curious and dug in. I was thrilled to discover that Git is even more beautiful internally, than it is functional externally.

At heart a Git repository is a key-value object store where all objects are indexed by their SHA-1 hash value. All commits, files, tags and filesystem tree nodes are different types of objects living in this repository.

When an object is added to the repository it is hashed, and from then on it is referred to by its SHA-1 hash value. Effectively a Git repository is a large hash table with no provision made for hash collisions. Luckily, with SHA-1 the probability of hash collisions is so vanishingly small that it is nothing to be concerned about.

To see an example of some simple objects, initialise a super-simple git repository with the following commands.

The first line corresponds to an object with SHA-1 hash 02b365d4af3ef6f74b0b1f18c41507c82b3ee571. When stored the first two hex digits determine the directory, and the remaining digits determine the filename.

To read the contents of these files you must decompress them. You can do this with a python one liner. For example to read the 02b365d... object I type

This is a tree object. Git stores the file system structure in these tree objects. The first column shows the unix permissions, the second column is either blob or tree depending on whether it is a pointer to a file or another directory, the third is the hash of the object pointed to, and the fourth is the filename. In this case there is only one file tracked by git, Readme.md, and you can see that this tree node reflects that by listing one file, and pointing to the blob holding its contents.

Here you can see that not only are there five files, but there is also another tree node pointed to. This is a subdirectory. As with file snapshots, tree nodes are created, hashed, stored in the object database, then referred to from then on by their hash value.

Here you can see the format of the commit, with a header containing author, committer details and timestamp, followed by the commit message itself. If you type git log you will recognise that the commit number is just the hash of this commit object.

The first line of the commit is a pointer to the tree object that stores the snapshot of the files at this revision number. In this case, this is the tree object we just discussed.

Now reconstruct the entire repository at that commit. Read the root tree object from the commit object, traverse that tree object recursively if necessary and reconstruct all files, permissions from those tree objects, and finally fill them with the contents stored in the blobs pointed to by the tree objects.

An uncommon aspect of this commit is the lack of a parent commit - this is because it is the first commit in the repository. All other commits will have one or more parents specified in the header, where multiple parents imply a merge commit.

If you were to change that Readme.md file and commit again you will see three new objects in the database. A new blob containing a second snapshot of Readme.md, a new tree object updated for that snapshot and a second commit object. You may wonder why it is a snapshot, not the diff you are familiar with seeing.

Don’t let Git’s interface fool you, all those diffs are calculated on the fly. When you commit, git stores snapshots, it does not store diffs from the previous commit.

Much of the compression in Git comes from the fact that if a file or tree node that has not changed since the previous commit, that file or tree node will have the same hash as before and it will not take up space twice in the database. In fact if you have multiple copies of the same file, the tree nodes may show different filenames and permissions, but they will all point to the same blob object. Add to all this the compression of the objects themselves and you can see that the repository is already remarkably compact. Nevertheless Git has one further trick up its sleeve - Packfiles.

As a repository grows, the object count climbs from the hundreds, to the thousands, and clearly it becomes inefficient to store the data in flat files. Instead, git can store these objects in a single, indexed, pack file.

Run git repack -a -d to pack all commits so far into the pack file and remove the now unnecessary loose files. Running find .git/objects -type f again will yield something similar to

All the loose objects have been packed together in the .pack file, which is indexed via the .idx file. The repository contains the same objects, they are just packed in a single file to speed up access and reduce the repository’s disk space usage. You can see this with git verify-pack

You can see exactly the same objects are stored in the pack as were stored in the flat files, and the results of running git cat-file on the objects are unchanged.

An additional benefit of pack files is that they allow git to compress your repository even further. My statement earlier that git stores snapshots, not deltas is not entirely true. The objects themselves are snapshots, but when they are stored in a pack file, git will compare that object to other similar objects, then rather than store both objects in full, git will store one object in full, and the other as a delta from that object. Thus a large file with a number of small changes will be storedinternally as a single snapshot and a number of deltas from that snapshot (known as a delta chain). If you run git verify-pack on a less trivial repository you will see the details of these delta chains as well.

I hope you have enjoyed kicking the tires of Git with me. There are many complexities beneath the surface, but I have been stunned to discover how simple Git really is. Something that I am certain has contributed to its robustness, speed and success.

EDIT

Amended the post to clarify a few points as a result of feedback and the redit discussion. Aditionally please note that git does not support all permission modes, the only supported modes are: