How to shrink down a git(hub) repository

2017-09-10 - 0 Comments

Starting point

With my Vulkan C++ example github repository approaching 200 MB in size I decided it was about time to shrink it down to a reasonable size again. Shrinking a git(hub) repository isn’t just about deleting locally present files but requires cleaning up the history as files that have been removed are still present in the repository’s history and therefore still contribute to it’s size.

A big chunk of the repo’s size is caused by binary assets like textures and 3d models. When I started out with my Vulkan example there were only a few assets so I just added them to the repository. In hindsight this was the wrong decision, so one of my primary goals was to remove all those assets from the repository and it’s history. I already stopped adding assets while I did some examples using HDR textures and moved them into a separate asset pack that needs to be downloaded to actually run these examples. After removing the assets I’ll no longer add any of them to the repo but rather put them into the separate asset pack.

So in this article I’ll try to describe how to shrink down a long running repository without having to recreate it. For my Vulkan examples this resulted in a much smaller repository that’s a lot faster to clone.

One of the most important things I wasn’t sure about before starting with this: Will github reflect my repository size changes?Yes!

They seem to run house keeping tasks (git gc) at a pretty quick rate, so pushing after removing files from history will also shrink the repository on the github server.

Before:
After:
(I’m using this chrome extension to get the size of a github repository displayed at it’s landing page)

Important note

This process involves rewriting the history of your repository, so everyone that is collaborating needs to rebase or (better) do a fresh clone before doing pull requests again!

Preparations

Once you’re ready to do this clean up on your actual repository consider the following:

Clean up your branches (and tags)

The less branches the faster clean up processes will run. So it’s a good idea to remove all branches that are no longer active and see which branch can be merged into master (and removed). In my case I finished work on the develop branch, merged it into master and removed it. Same goes for tags. Remove those that you don’t need anymore.

Take care of open pull requests

As we are rewriting history you should take care of all open pull requests. Either merge before starting to clean up or close those that you don’t want to merge. For PRs you’re unsure about drop a note that you’ll be rebasing and ask the author to resubmit after the rebase.

Test run

As the changes we’re going to do can’t easily be reverted, it may be a good idea to test this on a copy of the repository and just do the changes in one single run on the live repository at a later point. Creating a copy of your repository (with a different name) in github is pretty easy using the import function (which also works with github repositories):

Tools used

I’m going to use rtyley’s BFG Repo-Cleaner to remove the files from git history. The other option would be using git-filter-branch, but BFG is much faster and easier to use, especially on larger repositories and also adds some safety checks and outputs detailed log files.

Setup

For the cleaning process we’ll be working with two versions of the repository we want to clean up. For this I created a separate folder with only these two repositories.

Clone the bare repository

Cleanup will be run on a bare repository that doesn’t contain the actual files but rather only the administrative and control files normally hidden in the .git sub folder of your full repository:

$ git clone --mirror repository_url

This will create a folder named repository.git.

Clone the full repository

We also clone a full copy of our repository so we can remove files still present and push changes to the remotes:

We also use the bare repository to check progress on shrinking the repository:

$ cd vulkan_slim.git\
$ du -sh .
199M .

This gives us an initial size of 199M to start with.

This results in the following structure for my Vulkan cleanup test run:

cleanup/
vulkan_slim/
vulkan_slim.git/

Step 1: Removing files still present

Textures and 3d models currently make up a huge chunk of the repository size so removing them is the first step in getting the size down. BFG will only remove files that are not longer present (and therefore protected).

Before we can run BFG to remove them from the history we need to remove them locally on the full clone and push the changes to the remote:

Checking the size of the bare repo still returns the same size as the files are still present in the history. So our next step is to remove them from the history using BFG. The current version of BFG doesn’t support of removal by folder name but works fine with wildcard masks. As a positive side effect this will also remove assets deleted at an earlier point:

BFG will now clean up and update all commits (including branches and tags). If there would still be a file present with one of the above file extensions it wouldn’t get removed. When done BFG will output a small summary and also saves a full report to disk. As a result you should get a (partial) list of deleted files that include the assets we just removed from the full repository:

If you now check the repository size you may notice that it hasn’t really changed. That’s because BFG doesn’t delete anything when cleaning the commits, to strip the no longer needed files from the repository we’ll be using git’s gc command for this:

The git reflog expire command prunes all entries older than the current time while git gc removes unreachable files and recompresses the repository.

Checking the size of the bare repo:

$ du -sh .
103M .

Removing the assets reduced the size by ~46%, cutting the size almost in half!

If you’re sure about the changes push them to the remote repository via git push. This will force all refs (branches and tags included) to be updated, so it may take a while.

Step 2: Removing deleted files

With a long running repository chances are that you deleted files months or years ago. Even though these files are no longer present in your local repository they are still stored in the git database (and is the case with other source versioning systems too) adding to the repository’s size. Having binary files like textures, dlls, static libraries stored in the repo isn’t of much value so we want to get rid of those too.

So what we need is a list of file deletions on our repository which can easily be done with git’s log feature:

The --diff-filter=D only lists commits with file deletions, grep is then used to only filter lines that contain deleted file names, stripping away commit messages. We pipe the output to a text file that we can then search for files we want to be removed.

We can now go through that list, find the files we want to be permanently removed from our history and tell BFG to remove them. In my case I want all those pre-built libraries to be removed due to their file size. So I went through that list and made up a file name filter for BFG:

If you want to scrape a few megabytes walk through the list of deleted files and remove them using the same commands as above. Either with single BFG runs or by putting them into grouped file name filters. In my case I got my repo down to 35 MB, which is about 18% of the initial repository size.

Wrapping it up

If you ran these commands on a separate copy of your repository, like I did, the next step is applying these changes to your actual repository. While doing this on a copy I saved all the commands I ran into a single script so I can now run all this on my actual Vulkan repository once I took care of all the open pull requests.

Once that’s done it’s time to put the binary files that are still required somewhere else. There are multiple options here so go with the one that suits you best:

I did try git lfs, but aside from the technical troubles it gave me (like download errors, etc.) most hosters have tight limits and quotes on the large-file-storage that limit it’s usefulness unless you pay for it.

Another option would be storing the assets in a separate directory that is included as a submodule in the main repository.