Friday, 6 August 2010

CVS's problems resurface in Git

Although modern version control systems have improved a lot on CVS, I get the feeling that there is a fundamental version control problem that the modern VCSes (Git, Mercurial, Bazaar, and I'll include Subversion too!) haven't solved. The curious thing is that CVS had sort of made some steps towards addressing it.

In CVS, history is stored per file. If you commit a change that crosses multiple files, CVS updates each file's history separately. This causes a bunch of problems:

CVS does not represent changesets or snapshots as first class objects. As a result, many operations involve visiting every file's history.

Reconstructing a changeset involves searching all files' histories to match up the individual file changes. (This was just about possible, though I hear there are tricky corner cases. Later CVS added a commit ID field that presumably helped with this.)

Creating a tag at the latest revision involves adding a tag to every file's history. Reconstructing a tag, or a time-based snapshot, involves visiting every file's history again.

CVS does not represent file renamings, so the standard history tools like "cvs log" and "cvs annotate" are not able to follow a file's history from before it was renamed.

In the DAG-based decentralised VCSes (Git, Mercurial, Monotone, Bazaar), history is stored per repository. The fundamental data structure for history is a Directed Acyclic Graph of commit objects. Each commit points to a snapshot of the entire file tree plus zero or more parent commits. This addresses CVS's problems:

Extracting changesets is easy because they are the same thing as commit objects.

Creating a tag is cheap and easy. Recording any change creates a commit object (a snapshot-with-history), so creating a tag is as simple as pointing to an already-existing commit object.

However, often it is not practical to put all the code that you're interested in into a single Git repository! (I pick on Git here because, of the DAG-based systems, it is the one I am most familar with.) While it can be practical to do this with Subversion or CVS, it is less practical with the DAG-based decentralised VCSes:

In the DAG-based systems, branching is done at the level of a repository. You cannot branch and merge subdirectories of a repository independently: you cannot create a commit that only partially merges two parent commits.

Checking out a Git repository involves downloading not only the entire current revision, but the entire history. So this creates pressure against putting two partially-related projects together in the same repository, especially if one of the projects is huge.

Existing projects might already use separate repositories. It is usually not practical to combine those repositories into a single repository, because that would create a repo that is incompatible with the original repos. That would make it difficult to merge upstream changes. Patch sharing would become awkward because the filenames in patches would need fixing.

This all means that when you start projects, you have to decide how to split your code among repositories. Changing these decisions later is not at all straightforward.

The result of this is that CVS's problems have not really been solved: they have just been pushed up a level. The problems that occurred at the level of individual files now occur at the level of repositories:

The DAG-based systems don't represent changesets that cross repositories. They don't have a type of object for representing a snapshot across repositories.

Creating a tag across repositories would involve visiting every repository to add a tag to it.

There is no support for moving files between repositories while tracking the history of the file.

The funny thing is that since CVS hit this problem all the time, the CVS tools were better at dealing with multiple histories than Git.

To compare the two, imagine that instead of putting your project in a single Git repository, you put each one of the project's files in a separate Git repository. This would result in a history representation that is roughly equivalent to CVS's history representation. i.e. Every file has its own separate history graph.

To check in changes to multiple files, you have to "cd" to each file's repository directory, and "git commit" and "git push" the file change.

To update to a new upstream version, or to switch branch, you have to "cd" to each file's repository directory again to do "git pull/fetch/rebase/checkout" or whatever.

Correlating history across files must be done manually. You could run "git log" or "gitk" on two repositories and match up the timelines or commit messages by hand. I don't know of any tools for doing this.

In contrast, for CVS, "cvs commit" works across multiple files and (if I remember rightly) even across multiple working directories. "cvs update" works across multiple files.

While "cvs log" doesn't work across multiple files, there is a tool called "CVS Monitor" which reconstructs history and changesets across files.

Experience with CVS suggests that Git could be changed to handle the multiple-repository case better. "git commit", "git checkout" etc. could be changed to operate across multiple Git working copies. Maybe "git log" and "gitk" could gain options to interleave histories by timestamp.

Of course, that would lead to cross-repo support that is only as good as CVS's cross-file support. We might be able to apply a textual tag name across multiple Git repos with a single command just as a tag name can be applied across files with "cvs tag". But that doesn't give us an immutable tag object than spans repos.

My point is that the fundamental data structure used in the DAG-based systems doesn't solve CVS's problem, it just postpones it to a larger level of granularity. Some possible solutions to the problem are DEPS files (as used by Chromium), Git submodules, or Darcs-style set-of-patches repos. These all introduce new data structures. Do any of these solve the original problem? I am undecided -- this question will have to wait for another post. :-)

11 comments:

We think that the Darcs approach could help with this problem; however, the current version of Darcs is fairly limited in what it will let you do.

In principle, you can merge repositories together by taking the union of sets of patches. You can also split repositories by taking the subset. I've done this for some small repositories, myself.

Now the bad news: Merging repositories that have overlapping filenames (say README) may lead to a conflict. Your ability to take a subset of patches is limited by Darcs' patch dependency enforcement. So if you have patches that straddle the two pieces of the repository that you want to split, you won't be able to select that depend on it.

I think it should be feasible to use a Über-repository that contains a bunch of Git submodules for the tag-multiple-repositories-at-once problem. You still need to do individual commits in the respective repositories, but can then unify these into one commit in the Über-repository.

You should look at the subrepo feature in Mercurial. We basically have a single root repo and it keeps track of nested repositories *and the revision they're at*. There are some further use cases to shake out, but I think it's a very powerful feature.

As long as you import the standalone module as an independent branch before moving code, you can move stuff around at will without losing history. Just use got-merge with one of the 'no op' strategies, do your mI've, then commit.