gg - hybrid version control system

gg is a version control system based on git. It’s called "gg" so it’s easy to type, but I suppose the "good game" meaning is apt, as I wrote it for fun, not out of necessity.

I wanted to see how long it would take me to write a self-hosting version control system from scratch. I allowed myself to use Steve Reid’s public domain SHA1 code, but shunned shell scripts, restricting myself to C.

It took a few hours to achieve my goal. I was mildly disappointed. Despite following a well-trodden road, I tripped up many times. I highly recommend the exercise by the way. If you’re one of those programmers who gets a kick out of writing self-modifying, self-reproducing, self-replicating, self-interpreting or self-compiling programs, why not try your hand at a self-hosting version control system?

Afterwards, I couldn’t stop picking at it. The code is now my a playground for various version control ideas. I’m still sticking to C only, though copying code copiously from git.

Update 2/09: I haven’t worked on this in a long time as I have a short attention span. For example, since I stopped working on this, I was obsessed with continued fractions and implementing them with Communicating Sequential Processes (CSP). Shortly afterwards I had forgotten this too, because I was busy writing a Dalvik compiler. I’ve since dropped that on the floor to focus on an Android game.

for a standard clone. This fetches a lot of files because I haven’t implemented packs yet.

Grand Plans

Git is my version control system of choice. If imitation is the sincerest form of flattery, this code is proof of that. However, I still see plenty of room for improvement. Below are some of my hare-brained schemes.

The zero commit

Index improvements

Checking out the earliest commits of git project shows that Linus originally used one tree object per commit. The tree object was really a list object. Files in subdirectories were simply stored with "/" in the tree; its contents resembled the output of "find *". Accordingly, the index was also a list.

Soon a hierarchy was introduced to combat ballooning tree sizes. Tree objects now pointed to other trees. To release as rapidly as possible, the index was not similarly upgraded. To this day it remains a list. For a commit, git seeks out each "/" in each filename and figures out where to go on the main tree. For a checkout, git painstakingly walks the tree and reconstructs the list.

The index should also have a tree structure, which is how I implemented the index in gg. This naturally allows empty subdirectories to be tracked. Subdirectory stats can also be cached, though I have yet to exploit this. Furthermore, the index can be more efficient: in git, if you have ten files in "/some/deep/down/subdirectory/" then this string sppears ten times, while in a tree-based index, it would only appear once.

The infamous git index race condition has several solutions. I believe my implementation is the cleanest and cheapest. I also avoid the long-index-write race too, by using Linus' trick of reading the mtime when creating a new index, and using that as the timestamp. This allows index updates even as files are being modified.

I intend to look into inotify one day. Perhaps it can replace the current index mechanism entirely, at least on Linux.

Hashing sizes

The size of an object is fed into SHA1 to compute its hash. I don’t see the reason for this. It precludes optimizations when hashing sequences with common prefixes.

I know of one instance when you might want to feed the message length to the hash before the message itself, and that’s when constructing a MAC (Message Authentication Code). If you naively choose MAC(k, m) = SHA1(k || m), an attacker can construct bogus messages with valid MACs by extending the message. One solution is to put the key at the end, while another is to hash the length of the message. But for version control, we’re using the hash for identification, not authentication.

Permissions

I’ve taken a purer approach with gg and trees do not contain any mode or permission information. Trees and blobs truly track content only.

Each commit holds a single blob object which records the executable files in the commit tree. (I have plans for symlinks, owners, ACLs, etc.)

Purer still would be to have commits track content only, and make the permissions blob a regular file in the tree. I may do this in the future. However, since almost everyone would use this feature, I see no harm in removing one step and pointing to the blob in the commit.

Rename cache

This feature should be easy to add to git, as it concerns some of its utilities, not the core.

With certain options, rename detection can be slow. Smarter algorithms will be slower still.

I’d like to cache results from the rename detector, and avoid paying the cost for future operations. Furthermore, I’d like to edit the cache, and perhaps specify a few renames of my own, or insert renames from another rename detector. I envision rules like "file A in commit X renames to file B in commit Y".

These would be hints only, and options would control if they are to be used. One could even have several sets of hints, perhaps to benchmark and compare rename detection algorithms.

This would also give git the ability to do explicit renaming. Of course, typical users can ignore this feature, but there are some who insist on this capability. Renaming seems to be a flashpoint in version control debates. If git could do both, there’s nothing to complain about.

Pointing to predecessors

What if you want to use git on a gargantuan repository? A massive directory containing related projects?

A stock answer from git users (myself included) is to break up the directory into submodules, allowing partial checkouts: developers can checkout one component, with its complete history, yet all the components are still related.

But reorganizing a large project is no small feat. Perhaps the version control system can do it automatically for you, so that subdirectories containing more than a certain amount of content automatically become like submodules, in that they point back to the commit, as well as the previous versions of themselves.

In fact, what if we did this automatically with every subdirectory? The overhead seems slight for typical projects, yet the benefits are enormous. Partial checkout with history is easy.

Maybe even individual blobs could point to old versions of themselves, so particular files to be checked out quickly with their history. This would be ideal for a directory containing many large files.

There are hairy complications that need research. What if you rename a directory? Should you use the rename cache to store this information, or bake the rename into the commit? I lean towards to former possibility, as the latter implies there would be times when explicit renaming is mandatory.

Re-entrancy

I’m trying to minimize the usage of global variables. I believe this will force the project to be better organized. Also, it means it might be easier to add threading if needed later, and to provide a library interface.

Aggressive Abstraction

Git’s code is sprinkled with low-level implementations of lists and hash tables. I’m using ADTs instead, to reduce code duplication, to make it easier to maintain (unfortunately my aversion to macros means I use pointer casting), and to encourage different implementation strategies. For example, I use crit-bit trees extensively, which would be painful to implement multiple times.

It could be slightly slower, but I’m happy as long as the memory overhead is fairly small. Git is generally I/O-bound, so we can afford to burn a few cycles.

Git abstracts transports. I’m taking this further and abstracting object database operations. This should allow easy implmentation of object sharing, and lazy object fetching, and perhaps even bidirectional communication between different SCMs.

Compression

Git’s compression is already very good, but I want to experiment with this just out of interest.

Git uses zlib’s deflate on individual files or deltas, or in other words, LZ77 compression followed by Huffman encoding, with an Adler32 checksum. I want to try some sort of PPM across similar files combined with range encoding. And dispense with the checksum: SHA1 takes care of integrity.

SHAM1?

Now I’m really clutching at straws. For an efficiency boost, we could resort to cryptographic heresy and have an option to setup repositories that only use the first 64 bits of SHA1 (SHA1 Minus :). Heuristically, you’d need 2^32 commits, blobs and trees or so before collisions are likely, but you’d have to be fairly sure there will never be any malicious behaviour. Tampering could go undetected.

I haven’t thought of good applications for this yet. If there were a commercial Git-like product they could give away a truncated SHA1 version for free as a demo, and upgrade your repos when you buy.

Or perhaps you’d get bragging rights for coding the world’s most space-efficient somewhat reliable version control system.