Locating large objects in a git repository

13 Jun 2013

If you manage git repositories it isn’t uncommon to run into a problem
where someone, maybe even yourself, commits a large file into a repository.
This is especially common in repos with many committers where you may not have
visibility into every branch or commit. Thankfully there are several commands
that are part of git that will help you to locate these large files/commits.

How do I get a list of commits for a repo?

If you’re reasonably certain that there was a recent commit that added this new
large file, a quick way to identify it is using git rev-list. From
within the bare repo you would run something like the following:

I’ll breakdown the command above for anyone not familiar with using git rev-list:

--all tells git to operate on all refs in the refs/ directory.

--pretty=oneline condenses the output to one line to make it easier to
digest.

--since={1.month.ago} tells git to limit the output to commits that are
less than 1 month old.

The two column output above consists of the SHA1 and subject of all matching
commits. The bit we’re especially interested in is the SHA1 for the commit, the
subject may or may not have any relevant information to our search.

How do I get blobs from a commit?

Now that we have a list of commits, with their SHA1’s, we can use another git utility,
git diff-tree. This will compare the content of blobs between
trees. Based on the output above if I ran:

With the options to git diff-tree above, we’re asking it to track
copies/renames, display all changes from the commit’s parents, traverse into
subtrees, and suppress the commit id. I won’t go in to what all the output here
means, but mainly we’re interested in columns 4-6. These represent SHA1 of a
blob object, the operation type, and the path to the file.

How do I get the sizes of blobs?

Now that we have the SHA1 of a blob we can get its size using git
cat-file.

$ git cat-file -s e6556c4a65229fbad1b07ce36f5eeac42c069a4f
10240000

We’ve passed the -s option to git cat-file to have it output the size of
the object in bytes. The object in question is 10240000 bytes, or ~10MB. This
may or may not be the file we’re interested in, but we did manage to get the
file size of a blob from a particular commit.

How do I find which branches contain the offending commit?

We can also use the SHA1 of the commit from earler to find out what branch or
branches this blob is referenced in. For that we use git branch as
follows:

The output fields are: <commit SHA1> <file path> <size in bytes>.
Additionally, the output could be piped to sort -nk3 to sort on filesize.
Alternatively, you could sum up the blobs for each individual commit, in case
the issue isn’t one large file, but several smaller ones.

As mentioned before, you can take the commit SHA1 and pass it to git branch
-a --contains <SHA1> to see which branches this large file may have been merged
into.

Conclusion

There are many utilities that ship with git that allow you to see certain
pieces of information about your repository’s object store. Using these tools
in concert can yield data that would otherwise be challenging to obtain. There
is another method to achieve a similar result using the index file and git
verify-pack that I may write another blog post about.

There will be a followup blog post to this one about writing a server-side git
hook to prevent these large files from ever being committed to your repository.
That hook will make use of some of the same utilities that were covered in this
post.

I hope this has been helpful to anyone who has found themselves trying to
troubleshoot unwarranted repository growth. Feedback is certainly welcome, if
there’s an easier way to do what I’ve described here I’d be excited to hear
about it.