Macs, Modularity and More

Git Tip of the Week: Objects and Packfiles

This week’s Git Tip of the Week is about objects and packs. You can subscribe to the feed if you want to receive new instalments automatically.

So far, we’ve talked about commits, trees and objects. We’ve seen how they bind to the logical object model as well as being represented on disk in the .git/objects directory.

But storing every version of every file in separate files (albeit compressed) is going to be a huge waste of space, right? Yes, there’s some sharing of identical content between commits, but Git would hardly be the efficient store that it’s known for with storage structure like that.

Pack files

Fortunately, Git has the ability to merge together multiple objects into single files, known as pack files. These are, in essence, multiple objects stored with an efficient delta compression scheme as a single compressed file. You can think of it as akin to a Zip file of multiple objects, which Git can extract efficiently when needed.

Pack files are stored in the .git/objects/pack/ directory. For new projects, this is likely to be empty; what happens is that Git starts off adding all files as non-packed objects, or loose objects. One of the reasons it does this is because as you’re working through changes, you’re quite likely to re-write various files (blobs) and directories (trees) before you commit. In fact, each time you do a git add to stage a file, you’re creating a new object in the loose objects structure.

What happens is that periodically (or on user demand), Git will run a compression on the loose objects. This is triggered either by a git gc request, or automatically after various thresholds have been met. Git will then create the pack file and remove the loose object files.

You may recognise the ‘e6’ directory as being the prefix of the empty file in Git, which we covered earlier and is identified by e69de29bb2d1d6434b8b29ae775ad8c2e48c5391. However, at this stage, there’s no content in the pack directory. What happens if we pack it?

The pack file’s contents on disk is smaller than the set of files on their own (though in trivial examples like this, there isn’t that much difference between them). The pack file is actually made up of two entries; the index (.idx) and the pack (.pack) files. Whilst the latter stores data, the former stores a table-of-contents list of objects contained within the pack itself:

You’ll recognise in the hex dump of the index the ‘empty object’ stored in Git (e69d..5391), along with the tree containing the empty file (417c…67fb).

The purpose of the index file is really a marker to tell Git that the corresponding object is in this pack file. In this case, we’ve only got one pack file but large repositories will have many such files. The index allows Git to load many small files to determine the answer to “Where are these objects?” so that it can extract them in the most efficient manner.

Summary

Whilst Git stores objects in loose form whilst you work on new changes, it will compress them into pack files to take greater advantage of delta compressions. This happens when you run a git gc or when various thresholds are met automatically. It also explains why Git’s storage requirements follow a sawtooth like structure; each time the ramp goes up, it’s because new objets are being created, and each time it goes down, it’s because a pack has been run and new pack files have been created (along with the corresponding objects being deleted).

Come back next week for another instalment in the Git Tip of the Week series.