Here at the Hosted Operations team, we have many small repositories, and sometimes they just contain single scripts. This approach eventually produced some duplicated code and effort, leading to considerable maintenance issues.

What we decided then, was to create a repo in which we could consolidate many of those scripts and concentrate our refactoring efforts. After this refactoring, we ended up with a pretty big repo that was naturally divided in binaries and libraries. Every script that made use of these libraries was included inside this repository, trying to maximize the reuse of code as much as possible.

In the meanwhile, other big projects wanted to use that mighty pool of awesome libraries without carrying along the binaries included in the repo.

We finally decided to do the only logical thing: separate the libraries from the main repo, maintaining these in their own space.

Scenario & Goals

We will call the original repository of the story by the codename base; this will be the repository that will be split into two:

scripts – this will hold the binaries only

libraries – this will hold the libraries that many projects will end up using

The challenge here, is that the history for these “wanna-be repositories” is mixed all together in that one big repo that we called base.

In our case, we had all the scripts in the bin directory and all the libraries in lib inside our base repo.

The next step is to filter out unwanted history from each of the two repos. Instead of tracking down individual files, we can use an amazing filter-branch switch: –subdirectory-filter. This will rewrite the repo history picking up only those commits that actually affect the content of a specific subdirectory.

Note that this switch will also instruct Git to convert the subdirectory as being the root of the whole repo.

This will rewrite the current branch (master in this case) extracting only the history belonging to the wanted folder.

Instead of specifying only a branch to be rewritten (master in this case), you can also specify to rewrite multiple branches and even tags. Obviously, not every tag can be successfully rewritten on the new history: the tagged commit must be within the rewritten ones for the tag to be reapplied.

As you might imagine, this operation can be harmful. For this reason, filter-branch will create a backup copy of every ref it modifies, as original/refs/*.
Git will rewrite commits creating a copy of them. Old commits are kept alive by the original references. To restore a reference, you can point it back to the original:

If you want to get rid of the old history right away, you have to delete every reference to it and force an expiration of those dead objects from the reflog, too (yes, your safety net). The reflog might prevent those objects from being actually pruned. Once you have deleted all the references, you can start a garbage collection cycle on the repo to permanently remove these old objects. This process is not detailed here since it is not actually necessary for our purposes, it may be covered by a separate article. Take a look at the manpages of git-reflog and git-gc if you are interested on how Git objects are kept consistent, resolved and cleaned.

The only thing left to do is to adjust the new repos’ remotes, since they will have the local copy of the base repo as their origin remote.

Moreover, in our case, we had dependencies between scripts and libs, that are now broken due to the splitting. As a Python project, we decided to tackle this by following a very pythonish approach: eggs artifacts, requirements files and an internal pip server.

That was pretty easy after all, right?

Alternative approach: forgetting about the path

In the case we don’t want to leave the original repo abandoned, stripping every part of it into other repos, we can take a different approach.

We can, for instance, after pushing away the libs, convert the base repo to the only holder of the binaries. For this, we have to purge all history belonging to the lib directory.

This will take a while on a big repo: Git has to go through every commit and delete any lib occurrences from the diff.
This command will apply “git rm -r –cached –ignore-unmatch lib” on the index of each commit. This command, in turn, will delete lib from the index, leaving the working tree alone and not caring about a missing path.

The last part of the filter-branch command “–prune-empty” is used to strip possible empty commits that could derive from the operation.

As a non-vanilla solution, we could have treated subdirectories as submodules using the amazing subtree as explained by this awesome article.