Git LFS Sucks the Least: Prototyping and Version Control with Large Binary Assets

Here’s a story of my struggles with version control at Raktor as I push it to the limit for a variety projects in the Unity engine. Pour a drink and commiserate with me.

I love git. My background is in handling large, complex codebases that go all the way down to the metal, so distributed version control with branching and rebasing is essential. As I juggle many different third party libraries and projects while pumping out MVPs, a robust, well-documented repo history is important to diagnose when bugs appeared and why. I use git-blame, git-cherry-pick and git-bisect regularly.

For our large repo, I’ve found git + git-lfs to be “good enough but still terrible at handling large binary assets” so here’s my experience over the past year and a half. It’s important to emphasize that this repo is intentionally messy; we’re moving at the speed of prototyping, and I’m not taking the time to worry about whether we’ll use an asset frequently before we add it. I’m also not taking time to cull assets that we haven’t used in a while, as we are often remounting old projects. We’re not worrying about a “shippable” state, we’re worrying about a “runnable” state as we move fast and break things for demos we are running ourselves.

Here’s a look into the repo, a total of 13 GB and 504 commits to date:

Partway through development, as the repo started to get very heavy, I reorganized it so that any asset content that was updated infrequently was moved to the “Dressing Room” folder, which weighs in at 10.7 GB. I fantasized that, at some point, I’d move this content out of git and manage it separately. This is mostly Unity Asset Store downloads.

Multi-platform
This repo is used to ship to 4 separate platforms (macOS, Windows, Android, iOS) and we use third party libraries with inconsistently and naively documented compatibility with different versions of Unity (at the moment, 5.6.0, 5.4.3xEditorVR-p3, 5.4.2, 5.3.4).

Since even individual projects need to be compiled on multiple platforms to run, I need to switch back and before between these quickly to build and test. Often, when switching platform or Unity version, this triggers an asset re-import. Unity Cache Server helps a bit with this. However, whether the asset is being imported from scratch, or “downloaded” from the cache server (I only ever used localhost), this can take up to 10 minutes on my faster Windows machine, or up to half an hour on my slower Macbook.

Whitespace
I switch back and forth from programming on macOS and Windows,
and MonoDevelop and Visual Studio have different default attitudes toward whitespace. I haven’t dumped enough time into figuring out the most smooth way to do things. Also, I haven’t been able to get a handle on git’s autoclrf settings in a way that “just works”. One time a bunch of ^M showed up in my .gitignore file and I had no idea why, and didn’t want to touch it.

Unity Cache Server
While Cache Server has been great, a different version ships with each version of Unity. It’s not clear to me what a given Cache Server version’s compatibility is going backwards and forwards. Note that since my machines move around physically, I’m only ever using a localhost cache server, and haven’t shared one between machines.
Scary Anecdote: I once had two copies of the big repo on the same computer, for two separate versions of Unity. (This was to handle another problem I’ll get to later.) Unity Cache Server was running, and both versions of Unity had been linked to it. I opened repo A with Unity version A, then closed it with no changes. Then, I opened repo B with Unity version B, then closed it with no changes. Then, I opened repo A with Unity version A again, and Unity downloaded changes from the cache server! What’s going on there!? I wish it was more transparent what the Cache Server was doing.

Special Characters
Special characters that appeared in an asset downloaded from the Unity store have been the bane of my existence and will not die.

Here’s the results of git status immediately after a fresh clone on macOS:
Here’s an asset appearing twice in Unity because it has a special character:

These files show as modified even when they haven’t been yet, and re-appear every time I have to git clone, or navigate forwards or backwards over the commit where I made changes to them. I don’t how to fix this problem, and how much effort I should put into it. I’m guessing it’s a macOS Windows compatibility issue, but to solve it once and for all, I think I’d need to go and edit git history to excise them from ever existing, right? For all I know, the special characters that refuse to die may also persist in the Library or Cache Server cache and resurrect themselves after I naively believe they are gone, like some cyberpunk version of The Thing. I’d love advice on this.

Git LFS: Large File StorageGit-lfs is, in principle, a great idea: for big binary files that aren’t going to change often, keep them outside of the regular git tree and only download them as needed. Don’t store the entire binary files’ history in the .git directory. GitHub charges a small premium for Git-lfs bandwidth, and if it worked 100%, it would be totally worth it ($5 per month for 50 GB of bandwidth). Git-lfs is open-source and managed by GitHub themselves, and clearly aimed at keeping git-familiar devs like me using git instead of switching to a more game-tailored version control system.

Installing and running git and git-lfs on Windows is fucked. By way of explanation, I’m used to Unix-based systems where there seems to be one agreed-on method to install and access programs. On Windows, I had to resort to using the GUI app GitHub for Windows to install git because it sets up GitHub’s 2FA right, and I couldn’t get the keys (via Putty, etc.) working without it.
When uploading or downloading large assets, sometimes the network would hang, or the git operation would fail for some other reason. This appeared to leave the repo in a corrupt state. While git status would finish execution, files would show as changed even if they hadn’t been, and git checkout . would hang indefinitely, even if the files were relatively small, like a jpg. Poking around in the git lfs issues, it appears that this is due to smudge errors (smudging, I think, is the process where a file tracked by git is replaced by a git-lfs pointer in the .git history). I would end up with a repo that was corrupt due to an unrecoverable smudge error. Hey, take a look at how many corrupt repos I have, each of which are ~13 GB and required me to freshly download all of those hot gigabytes!

To avoid having to freshly re-download, I tried “backing up” my repo periodically by zipping it, but this seemed to cause even more problems with OS-specific files getting added on unzip. Zipping itself took ~15 minutes due to the sheer number of files (29,542) and folders (1,460).

On further investigation, git-lfs 2.0 supposedly handled smudge error recovery much better. However, git lfs version showed I was on 1.5.5. I upgraded to git-lfs 2.0 and then continued to diagnose issues, but kept having them. Imagine my gaslight-y horror when git lfs version revealed I’d been reverted to 1.5.5 somehow! Imagine how horrifying it was to discover this when I was also trying to diagnose other reasons why the repo was corrupt, and everything I was tried had processing times from 15 minutes to an hour!
Turns out that the shell launched from GitHub for Windows uses git-lfs installed at %UserProfile%/AppData/Local/GitHub/lfs-amd64_1.5.5/git-lfs and if you update it to a later version, like I did, it reverts! So there’s no way to update the git lfs version with GitHub for Windows to a more stable version.

Next, I installed git-lfs via the terminal offered through Sourcetree. Somehow, first installing Github for Windows, and letting it make 2FA settings, and then installing Sourcetree, and then installing git-lfs 2.0 via Sourcetree’s terminal, made it work. Before, when I’d straight installed Sourcetree, I couldn’t get it to work without GitHub for Windows setting up 2FA right. Yes, I know about GitHub’s auth tokens and I know Sourcetree 1.8 and 1.9 sometimes cached server passwords in a buggy way.

(Let’s take a breath and remind ourselves that my goal in all this is to get to work, not diagnose git issues.)

As a final git-lfs puzzle, periodically, git-lfs seems to “discover” files that were already in commit history that should have been added to lfs a long time ago, but somehow have not been yet. Is there some git-lfs-doctor I can run? I’d love to know.

Alternatives: Plastic SCM
I know there’s game-dev-oriented version control systems like Perforce, but I’ve been resistant because git has been so powerful and anything I read about others indicates that they aren’t as much.

I’ve had Plastic SCM strongly recommended by a developer I trust, so I gave it a shot over a game jam, taking a copy of my existing big repo and making 98 commits over 72 hours, as a solo dev.

Here’s a peak behind the curtains at my commit history:

Reactions:
– the output of cm diff is not helpful
– Files are often labelled as “changed” even if they’ve only been “checked out”, and there are no actual changes, not even whitespace.
– I don’t like that commit labels are incrementing numbers, not hashes. I didn’t try branching and merging, but this doesn’t make me optimistic that the results will be easy to read.
– Pre-commit, there’s no git-like concept of staging. While I’m working in git, I use staging to indicate to self what parts of a current chunk of work are “good to go” versus “still messy/working on it”.
– The Plastic SCM client I used, as far as I can tell, allowed for only one “active workspace”, aka a repository, at once. This limitation is pretty insane. While I’m all for big mono-repos, when I’m diagnosing behaviour of external libraries who have their own git history, I need to be able to examine and operate on multiple histories at once.
– Plastic SCM’s ignore format is not as regex-friendly as git’s .gitignore, so I could not rename the same file and get going.

Even Other Alternatives
I refuse to make my own version control system like Jon Blow. My needs as a developer can’t be that unique and novel right?

Other Question: What shell should I be using on Windows?
Like I said, I’m used to using macOS or *nix systems, which have a one-stop-shopping shell. On Windows, we have: cmd, Powershell, Powershell opened via GitHub for Windows (which adds GitHub for Windows’ git to its path), the MINGW64 terminal launched by SourceTree (which, oddly, is missing fundamentals like man and which). Finally, there’s Bash for Windows, which installs its own Unix environment. However, anecdotally, I’ve found any git operations via Bash for Windows take about 5x longer than via Powershell. I’m not sure if this is due to some level of abstraction, but it makes it pretty unusable. Also, none of these shells support copy-and-paste as elegantly as macOS does, so I automatically feel disdain towards them.

Back to Git-LFS: As I was trying out different Windows shells, I once ran git checkout . on a repo using lfs in a git environment that didn’t have lfs. This corrupted the repo unrecoverably, so I had to download all 13 GB from scratch yet again. Please: I’d love a command like git-lfs-doctor or git-lfs-unbreak that can diagnose and repair repos.

4 Responses to Git LFS Sucks the Least: Prototyping and Version Control with Large Binary Assets

I’d like to add info to some of your objections with Plastic SCM (I’m a Plastic SCM developer, btw):

* Restricted to single repo/workspace: not really. You can have as many workspaces as you want, and many workspaces pointing to the same repo, something only recently added to Git. This has been always present in Plastic. Not sure why you don’t see it, but you can create all the workspaces you need :-)

* cm diff: well, yes, we don’t print unified diff stuff, but we provide super useful built-in GUIs, even supporting Semantic diff for C#, Java, C, C++ and more… I mean, it *understands* a method has been moved… simply a few steps ahead of any other system out there.

* Files labelled as “changed” just when they were checked-out: correct, this is how we work. But there is a setting to differentiate from real modified files. The reason why we rely in the timestamp by default is… speed! We want it to be super fast, so we don’t diff the actual file, we just see it was modified, so it is a “candidate” to be checkedin. If you check in the file and it doesn’t have changes, it will be discarded.

* Incremental numbers in commits: every single commit (and branch, and label, anything) has a GUID too. You can refer to it all the time if you want to. That’s what I do :-)

* Staging are: correct, we don’t have that, so you can’t do ‘stash hunks’. No reason not to add it, but not yet there.