Sex, software, politics, and firearms. Life's simple pleasures…

Main menu

Post navigation

Another bite of the reposturgeon

Five weeks ago I wrote that direct Subversion support in reposurgeon is coming soon. I’m waiting on one final acceptance test before I ship an official 2.0; in the meantime, for those of you kinky enough to find the details exciting, description follows of why this feature has required such a protracted and epic struggle. With (perhaps entertaining) rambling through the ontology of version control systems, and at least one lesson about good software engineering practice.

Scene-setting: reposurgeon is a command interpreter for performing tricky editing operations on version-control histories. It gets a lot of its power from the fact that it knows almost nothing about individual version-control systems other than how to turn a repository into a git-style import stream, and an import stream back into a repository. And it expects to call helper programs to do those things – in the git case, git fast-export and git fast-import.

What looks like editing of repositories is actually editing of import streams. By leaving serialization and deserialization to be somebody else’s problem, reposurgeon avoids getting entangled in a lot of low-level hairiness about how individual VCSes store and represent files and versions.

The benefit of this strategy is that reposurgeon gets to concentrate on the interesting part: high-level surgical operations on the repository’s metadata and changeset DAG. The cost is that reposurgeon has a lot of trouble editing any metadata that won’t fit into the import-stream representation. In practice, because import streams do a pretty good job of capturing the right abstractions, reposurgeon can win big.

Well, that’s how the model looked two months ago. There have been some changes since. The reposurgeon 1.x model that I’ve just described fails on Subversion because the pre-existing tools for exporting a Subversion repository to an import stream are either weak or broken or both. But no blame attaches; it turns out that these tools suck because the problem is really quite difficult. Two months, people – two months of my concentrated attention.

The symptoms of the problem are this:

1. Subversion doesn’t have a native exporter to import-stream format. I have pretty good zorch with the Subversion developers (I’m a past code contributor with commit access, though never a core dev) and I’ve campaigned hard for them to write an official exporter, but it has never happened.

2. Many of the third-party export tools out there only handle linear repositories (no branching). There’s a strong tendency for people writing these to get to the point where they need to do the mapping from Subversion branches to gitspace branches, pull up short, leave an embarrassed “to be done” comment in the code or README, and disappear never to be heard from again.

3. There are a few export tools that do support branchy repositories; the git project’s own git-svn is probably overall the least bad of these. These require the user to pre-declare the repository’s branch structure rather than deducing it. Goddess help you if you skip those declarations or get them wrong.

4. Even if you do the right thing, the tools often don’t. They are brittle, slow, and lossy.

Example: in Subversion it is possible to commit a changeset that modifies files in multiple branches. This is a bad idea and people seldom do it on purpose, but it can happen by accident too – every sufficiently large and old Subversion repository has a few such accidents in it, and they confuse conversion tools horribly.

Example: The right way to create a Subversion branch is “svn copy”; the wrong way (which not infrequently happens by finger error) is an ordinary directory copy followed by an svn add of the directory. If you do this, later checkouts will look right but the internal information linking the new branch to the rest of the repository will be missing. When you try to convert such a repo (with anything other than reposurgeon), you’ll end up with a detached branch floating in midair.

Example: In Subversion it is possible to delete a branch or rename it, then immediately create another branch with the same name. But if you feed such a repository to a conversion tool, the result is almost certain to unpleasantly surprise you.

I could go on at book length about more symptoms. Underlying them, Subversion’s model of how version control works is tricky and complicated, with edge cases that sneak up on you. It allows combinations of operations that are rather perverse (cross-branch mixed commits being one of the easier cases in point to understand).

The import-stream model is much simpler – there are fewer combinations of primitive operations and thus less to go wrong there. This is good, but moving content and metadata from one to the other in full generality is a stone bitch. My criticisms of the pre-existing tools may seem harsh, but having grappled with this problem myself I have nothing but sympathy for the people who failed at it previously. The difficulties are very like what hardware people call an impedance mismatch.

My first real step to solving this problem was to let Subversion itself do as much of the work as possible. It has no import-stream exporter. but it does have the ability to serialize a repository into a dumpfile not totally dissimilar from an import stream, and relatively easy to parse. What reposurgeon does is parse this dump file.

Actually, in a very early version of the code I didn’t parse the dumpfile; instead, I used Subversion’s own client tools to mine information from each repository. There turned out to be two problems with this: (1) piecing all the data together from different tool reports is complicated, and (2) the Subversion tools are horrifyingly slow. I finally gave up on this approach when I discovered that mining a 3K-commit repository this way took eight hours.

In retrospect I should have started with a Subversion dumpfile parser sooner; I was distracted from that approach by the prospect of building a general history-replaying framework that could be applied with minor changes to mining other repository types as well. I had to give up on that objective to make real progress with Subversion, though the replay framework still lives, unused, inside of reposurgeon. It might get used for something else someday, if it hasn’t bit-rotted first.

Those of you familiar enough with Subversion might be wondering, at this point, why I didn’t use Subversion’s client API rather than writing a dumpfile parser. Three reasons: (1) complexity, (2) stability, and (3) documentation.

(1) Holding the markup features and semantics of a relatively simple plain-text dump format in your head is generally easier than remembering all the ins and outs of a complicated API accessing the deserialized version of the same data. It’s more concrete; you can visualize things.

(2) APIs change. Subversion’s client API has changed on a faster scale than the dumpfile format, and that can be expected to continue. By parsing the dumpfile directly I avoid a whole class of completely artifactual version-skew problems with the API. (Actually, APIs – there are two competing ones for Python.)

(3) The APIs are thinly and poorly documented. So was the dumpfile format. But because it’s easy to see all the way down to the bottom of the latter, I liked my chances of coping with the poor documentation of the dumpfile better.

One of the things I ended up doing as a side effect of this project, actually, was writing much more complete and detailed documentation of the Subversion dumpfile format than had existed before – I plied the Subversion developers with questions in order to do this. The results now live in the Subversion repo.

If you take no other lesson from this essay, heed this one: Should you ever find yourself in a similar situation (exploratory parsing of a poorly-documented textual format), stop. Stop coding immediately. Document the format first. Check your conjectures with the host program’s developers, make sure you know what’s going on, push the resulting document upstream, and get it accepted.

This may sound like a lot of work, but I guarantee you that a few days of pre-documenting will save you weeks of arduous debugging time. Writing that documentation is not just a worthy service to other programmers in the future, it’s an implicit specification of what your parser has to do that will save you from flailing around in the dark.

Once I fully grokked the dumpfile format, syntax and semantics both, I could tackle the actual meat of the problem – mapping Subversion’s ontology to the ontology of import streams. This is where “impedance mismatch” starts to be a relevant concept. It’s where the pre-existing tools fall down badly.

The differences between these ontologies cluster around two large ones: (1) Subversion has flows, while import streams do not, and (2) the treatment of branching is quite different.

A “flow” is what internal Subversion documentation rather confusingly calls a “node” – a time series of changes to a single file or directory considered as a unit for purposes of change tracking. If you create a file, modify it several times, delete it, and then again create a file with the same name, that will be two different flows that happen to have the same path.

In the import-stream world there are no flows – just file paths pointing to blobs of content. The practical difference is that the semantics of some legal Subversion operations – notably directory deletes and renames – are difficult to translate into the language of import streams. In fact import streams don’t have any notion of “directories” in themselves; they’re expected to be automatically created when the creation of a file requires it, and to be garbage-collected when they become empty due to file deletions.

(I’m not actually very clear about why Subversion has flows; the obvious guess would be to help in deductions about history-sensitive merging, but that’s something Subversion has never actually handled very well. On the other hand, it’s only fair to note that nobody was handling it well when Subversion was designed.)

While the existence of flows in Subversion mainly just produces a few odd edge cases that you have to be careful of, the other major difference – in the semantics of branching – is a much bigger deal. Its effects are pervasive.

In Subversion, a branch is nothing more or less than a copy of a source directory, made with “svn copy” so it preserves an invisible link back to the source directory and the revision when the copy was done. All branches are always visible from the top level of the repository. Some branches are used to represent tags (release states of the code) and are never touched by commits after the copy; others represent lines of development and are changed by later commits. A branch copy is an operation that stays visible in the commit history.

Conventionally there is a “trunk” branch directly under the repo root that represents the main line of development, tags live under a “tags” subdirectory, and branches live under a “branches” subdirectory – but nothing enforces these conventions. Hello, cross-branched mixed commits!

In git (and other version-control systems that speak import streams) the model is completely different. Branches are always used for lines of development, never for tags – the import-stream model has real annotated tags instead. Only one branch is available for modification at any given time – there’s no possibility of a cross-branch mixed commit. Branch creations don’t show up as operations in the commit history.

It took quite a bit of time and thought to figure out how to map smoothly from the Subversion branch model to the import-stream one. One of my requirements was that the user should not have to declare the branch structure! You’ll be able to read the detailed rules on reposurgeon 2.0’s manual page; the short version is that if trunk is present, then trunk, branches/*, and tags/* are treated as candidate branches, and so is every other directory immediately under the repository root. But: a candidate branch is turned into a tag if there are no commits after the copy that created it.

If trunk is not present, no branch analysis is done – that is, the repo is translated as a straight linear sequence of commits. There’s an option to force this behavior if your repo’s Subversion branch structure is so weird that the above rules would mangle it. In that case you’ll need to do your own post-conversion surgery.

The rules I ended up with are simple, but implementing them is not. Analysis of the Subversion copy operations is the trickiest and most bug-prone part of the dumpfile analyzer. I had a pretty fair idea going in how hairy this part was going to be, which is why I approached the Network UPS Tools project and asked to convert their Subversion repo for them.

As a past contributor, I knew that the NUT Subversion repo is large, complex in branch structure, and old enough to have begun life as a CVS repo. That last part matters because some of the ugliest translation problems lurking in the back history of Subversion projects are strange Subversion operation sequences (including combinations of branch copy operations) generated by cvs2svn.

So: I explained to the NUT crew what I’m doing. I told them they’d get a better quality conversion to git than any of the existing tools will deliver … eventually. I was up front about the conversion code being a beta that would probably break nineteen different ways on their repo before I was done, and that’s sort of the point. Fortunately, I got support from the project lead (thank you, Arnaud Quette!) and active cooperation from the project’s internal advocate for a git switchover (thank you, Charles Lepple!).

Setting up this real-world test turned out to be a Good Thing. Charles Lepple has pointed out more than a dozen bugs that turned out to be due to my code not handling strange cases in the Subversion metadata well enough, cases that a smaller and younger and less grungy repository history might never have exhibited. I fixed another one this morning. There is a realistic chance it will have been the last one…but maybe not.

I’ll ship reposurgeon 2.0 when Charles and Arnaud sign off on the NUT-UPS conversion. Besides stomping all the branch-analysis bugs we can find, there’s one more feature to make work. I want to add a surgical primitive that can find and perform merges back to trunk for Subversion branch tips that are in a mergeable state. This would not actually be a Subversion-specific feature, but applicable to any import stream.

If I shipped 2.0 today, reposurgeon would already blow every other utility for lifting Subversion repositories clean out of the water. I’m not satisfied with that; I want it to be bulletproof. But for those of you itching to get your hands on the beta:

git@gitorious.org:reposurgeon/reposurgeon.git

No warranties express or implied, etc. I have documented it all, however. Throw it at the gnarliest repo you can find and let me know if you spot bugs.

19 thoughts on “Another bite of the reposturgeon”

Am I the only one who smiles about the typo “reposturgeon”? Yes, I know it’s caused by writing “repost” so often (I have a similar problem writing “int” when I mean “in”), but the idea of naming a repo project after a mythical fish amuses my ever-adolescent mind more than it should.

> Writing that documentation is not just a worthy service to other programmers in the future,…

The process behind your writing of that documentation was also a worthy service to other programmers in the Present! You forced many of us Subversion old-timers to check and re-check ancient assumptions about the dumpfile format, and offered a learning opportunity for many of the newer devs who’ve not played in this corner of the codebase and APIs before. Thanks, Eric.

Because that’s the way Jim Blandy and Karl Fogel conspired to model the best of behaviors from single-file version control ala CVS, recognizing that files sometimes have a “change of address” path-wise, and all while simultaneously treating directories as first-class versioned objects (instead of merely dumb containers)? I know, weak answer. :-)

To be honest, the flow concept predates me by just a bit and my memory is pathetic anyway, so I can really only talk about our *uses* of the concept, which mostly boil down to determining the “relatedness” of objects for various reasons.

Some of those reasons are user-visible. Maybe they allow us to more precisely model the user’s actions rather than merely the results of them (for example, being able to distinguish a deleted-file-plus-newly-added-one-at-the-same-path from a mere content replacement of that file). Or they allow us to save the user some headache: quickly suggesting that the user has made a mistake when trying to merge two completely unrelated directories, or when trying to perform a local “switch” operation to an unrelated repository location.

Some of the reasons are more closely tied to the underlying storage implementations. For example, we track the chain of predecessors between “node-revisions” (the flow) because that helps our binary delta storage algorithm’s efficiency. We reason that if two node-revisions come from different “flows”, their contents are less likely to be similar than if they came from the same flow, which is knowledge we apply when transmitting tree deltas from the server to the client (for updates, merges, etc.)

that reminds me, when i was in grad school for cs (at the university of louisville) around 2004, there was a poem on the notice board outside the acm club room. it was about also about debugging, and i think it was marked as having been printed in some journal decades ago. i believe it was in two parts, the first being about normal software practices, and the second, which i think was titled something like “a vision of the millennium”, was about ideal practices. the second part contained the only bit i remember at all close to verbatim: two lines that went something like “they have no need to take bugs out/who never put them in”.

unfortunately, i never wrote anything down about it, and i’ve been unable to find any trace of it online. by any chance, does anyone here either recognize it, or have the google-fu to dig it up?

Have you ever looked at Twisted (http://twistedmatrix.com/trac/browser)? I don’t know if they want to move to Git or Hg, or whether it has a CVS pre-history, but it surely looks gnarly and ancient enough.

There was once Git Google Summer of Code project, though I don’t remember from which year, that attemted to add “remote helper” to Subversion repositoriest to Git (the intent of “remote helpers” is to be able to treat foreign repositories as git repositories WRT fetching). The first step was translating svn dump format into fat-import stream; “remote helpers” are based on import stream, IIRC. One of results was svnrdump tool, to make dump remotely.

I don’t know if you heard of this project, and if you make use of discoveries done during creation of this tool… I guess that it helped you at least indirectly, as the author was (from what I remember from git mailing list discussion) corresponding extensively with ubversion developers.

Have you ever looked at Twisted (http://twistedmatrix.com/trac/browser)? I don’t know if they want to move to Git or Hg, or whether it has a CVS pre-history, but it surely looks gnarly and ancient enough.

Just from looking here, it’s clear it was a CVS repository once, converted with cvs2svn (I don’t know when). And it looks rather nasty indeed ;)

Seems that Twisted has a Bazaar mirror ready, but not Git. Take a look at the Git Mirror page and see that it’s merely instructions on cloning via git-svn, and moreso, they don’t recommend a complete clone as is typical (git-svn is indeed very slow, and rather bandwidth wasting; it might be faster to use a native SVN clone (svnrdump if they’re using Subversion 1.7 on the server-side) and do git-svn against the local copy). Still seems to be a perfect testbed for reposurgeon to me.

I’m not familiar with Bazaar much to make a judgment on the quality of that mirror. I’ve tried using bzr before and it is quite… bizarre.

There was once Git Google Summer of Code project (though I don’t remember from which year) which attempted to add “remote helper” to Subversion repositories to Git (the intent of “remote helpers” is to be able to treat foreign repositories as git repositories WRT fetching). The first step was translating svn dump format into fast-import stream; “remote helpers” are based on import stream, IIRC. One of results was svnrdump tool, to make dump remotely.

Have you at all considered making a svn remote helper for git, or at least writing an API for the subversion converter that someone else can use? Reposurgeon may not be designed as a mirroring mechanism between VCSs, but the subversion components can be used for that purpose. I personally can’t wait for the day I can git-clone a svn:// URL and be able to push any changes back as if it were just another git repository.