Converting FreeBSD to git

I've started a project to create a continually-updating mirror of FreeBSD into git. This is part of ongoing discussions of a future revision control system for FreeBSD. I would like to see it at least replace the usage of the proprietary, slow, non-distributed p4 system we currently use for side-project development in FreeBSD.

Reasons to use git instead of the others

git's repository structure is solid

Unlike many other systems, git has a very simple, but reliable and efficient repository storage system. Even in the presence of a power outage resulting in losing or corrupting recently-written changes, the state of the repository will be trivially recoverable and verifiable (and since it's distributed, you've got massive backups anyway).

git's repositories are small

The packed repository you would download in your initial clone would probably be 1/3 to 1/4 the size of the CVS repository, which itself is smaller than the repository would be in svk. I don't know what hg's repo sizes are like.

git is incredibly powerful

One of the interesting problems with explaining git is that it allows so many development styles. You can use it almost like CVS. You can use it like a dVCS. You can use it like the linux kernel community does if that's your style (cringe). Of course, this leads to trouble: many tasks can be completed through different sets of git commands, so you can't declare a canonical way to do things, and even saying "no you can't do that" is something I've never been able to do and be right.

git is distributed

Now, you can commit as you develop, then test, then push. If you find things in your testing that are wrong, you can commit fixes before pushing, or even go back and edit your local history to erase your mistakes, making you look even more ninja than you really are.

You can also push your changes up to a personal repository for others to access. They can merge it to a personal tree of their own, do repeated merges all sorts of directions, and have it just Do The Right Thing.

easy cherry-picking

MFCs become easy: just find the sha1 of the commit to merge, and git-cherry-pick <sha1> from the branch to merge to. If you need to do a series, you could commit them all in one with git-cherry-pick -r <sha1> and a final commit, or just repeated git-cherry-pick.

git is blindingly fast

Committing, diffing, branching, checking out other branches, and merging often go faster than it feels should be possible. I've searched the patches of the entire history of Mesa (8 years old, 180MB repository) in a couple of minutes, which was wonderfully useful at the time.

git is easily importable to FreeBSD

We could import a sufficient subset of the git tools to FreeBSD, which consist of the shell and C code. There are a couple of tools in perl we might want, and I've heard that the community is open to accepting replacements using C for scripting languages, it's mostly an artifact of rapid development. Some of the shell script precelain included with git do use perl one-liners, though. The python tools probably aren't important to us. The tk tool (gitk, graphical history viewer) is, but should only be subjected to those who ask for it. These could all continue to be provided by the devel/git port. Cogito, on the other hand, has a large array of bashisms in it, so probably wouldn't be a viable tool to import without a lot of hacking.

It is GPLed, though.

Reasons not to use git

No checkout-only mode

So far, git lacks a mode where you don't check out the full history of the repository. One hope would be that the repository is small enough that we don't care. However, there is active designing going on on the mailing list for how to do a mode where your local tree retains a limited history to keep file size down, while keeping fast updates and the normal toolset. I hope this will be resolved soon.

In the mean time, if users are only concerned about download size for checking out their whole repository, the packed git repository is estimated to be going to be about 25% larger than the expanded current tree, so some may not be concerned abut the additional overhead where they would have been with CVS (the checkout is far faster, as well, due to the lack of more than a couple of roundtrips).

Additionally, an rsync of a continually-updated checkout could be offered for those that don't desire history at all and don't make changes locally. It's been suggested that a shell wrapper around rsync and hardlinks could allow for local changes to be preserved with very little disk overhead.

git-tar-tree can also be used to provide snapshots, hooked up to a web script to provide a .tgz of arbitrary pieces of the repository.

However, these last 3 methods are just temporary solutions until the proper support is implemented upstream.

Note: Git Does have a checkout only mode:

git clone --depth 1 SOMEREPO

will clone the last node of the repo. The checkout repo is read only however.

Bad UI in git-core

The core tools (ports/devel/git) commit some serious POLA violations. They're all documented in the manpages, if you know to look for them, but you don't. The cogito tool adds some porcelain on top of the git plumbing to provide a better UI in many ways, but doesn't provide as much functionality on its own as git does.

Non-reasons to not use git

ACLs

Using the update hook (the same thing used to push out commit messages), you can examine a set of changes being pushed and reject them if desired by exiting with an error code.

How to get FreeBSD playing with git

I'm working on getting together a system suitable for active mirroring of FreeBSD's entire src and ports trees to a git repository. The main issue is RAM. On an i386 with 3GB of RAM and a pile of swap, it runs out of address space converting src. The current tool, called parsecvs, uses memory in O(revisions * files). We've got a plan for how to fix parsecvs to use memory in O(revisions + files) based on a few assumptions that we think are going to be good.

Alternatively, cvs2svn and git-fastimport together may end up doing what we need. However, having not seen the quality of cvs2svn's conversions, I haven't jumped in to play with it. Figuring out what really happened from CVS history can be hard, and many tools get it wrong.

There are a few shortcomings related to both cvs2svn and git-svnimport.

cvs2svn seems to have difficulties when branches have more than one name (eg. "WARNING: in 'src/contrib/binutils/ltcf-gcj.sh,v': branch '1.1.1' already has name 'FSF', cannot also have name 'GNU', ignoring the latter") and fails completely on scvs ("ERROR: Multiple definitions of the symbol 'isdn' in 'src/gnu/usr.sbin/isdn/ulaw2alaw/ulaw2alaw.c,v'")

Note, there is a number of 'unnamed branch x' and bad branches indicated by parsecvs, mainly related to 'old_RELENGxxx' and RELENG_2_2_x branches which should be considered if parsecvs is used. Also the repack was done with: git repack -a -d -f -l and took longer than the conversion itself.

One though to consider is if the entire src should be converted to 1 git repo. Git now supports sub modules so it might make more sense to make src/contrib/x individual repos. Then the selective parts of the tree could be obtained

Speed Comparisons

These comparisons are done against a completely converted src respository.