Main menu

Post navigation

Looking for reposurgeon test cases

I just released reposurgeon 1.2 and am continuing to develop the tool. In order to test some of the newer features, I’m looking for repository conversions to do. If you run an open-source project that is still using CVS or Subversion, or some odd non-distributed VCS, I may be willing to lift it to git for you (and from git to any other DVCS you might prefer is a pretty small step). Details of this offer follow; limited time only, first come, first served.

(Why have me do it? Well…especially for older projects with a complex revision history, it’s a messy and daunting job. The tools are somewhat flaky, the difference between a sloppy conversion and a good one is significant, and good conversions require experience and judgment.)

The ideal test for reposurgeon is a Subversion repository of a project that was formerly CVSed and contains a lot of junk commits and artifacts generated by cvs2svn conversion. I’d also like to lift at least one project now in CVS so I can get a good feel for how cvs2svn behaves today (I know it has it has substantial improvements from older versions because I wrote at least one of those improvements myself).

The conversion process will look like this:

1. If starting from CVS, I’ll make a preliminary conversion with git-cvsimport. If starting with Subversion, I’ll do the preliminary conversion with git-svn. If your repository is in something weird, I’ll need to either find a lifting tool, or possibly build one, or tell you it’s more work than I’m willing to do.

2. This is the interesting part: clean up the mess. Up-converted repos tend to be full of conversion artifacts. For example, many versions of cvs2svn mechanically generate commits to represent CVS release tags; a high-quality conversion should create actual tag objects corresponding to the junk commits and delete the junk. Also, any commit references in the change comments need to be fixed up (generally I convert things like Subversion revision numbers to committer + date stamp).

The result of a really good after-conversion cleanup looks as though the project had been using git from day one. I’ve done several of these now, mostly on my own projects but recently for the Roundup bug tracker. Each time I do one of these reposurgeon gets better – more features, bugs exposed and fixed. That’s the point; reposurgeon is a good tool, and I want to case-harden it into a great one.

There are some conditions on this offer.

First and most importantly, I want the result to be used. A conversion typically involves three to four days of hard work. If your repo has a kind of cruft or malformation in it that I haven’t seen before, well, teaching reposurgeon to deal with that is the point of the exercise but it also means the conversion may take longer. A precondition for me to put in that kind of work is that the political ducks have to be lined up first – the project has to have decided to move and be willing to use the results. (Yes, the project should exercise due diligence to verify that I haven’t screwed up; that’s a different issue.)

I’m only willing to do a limited number of these, so if I get a flood of requests I’m going to be choosy. Preference will go to projects that are older and/or more important and/or larger. The ideal candidate would be an important piece of open-source infrastructure with a long, messy history rooted in CVS or RCS or SCCS.

If you want it, conversion from git to another DVCS (hg, bzr, whatever) is your problem. I’ll point you at tools, but the only part I’m interested in is already done when you have your git repo.

Again, the sort of capability I’m looking to improve in reposurgeon is automated recognition and cleanup of conversion cruft. I may experiment with features like branch merge detection if conditions seem right.

UPDATE: When you make your request, please have the following things ready:

23 thoughts on “Looking for reposurgeon test cases”

Hercules? 10 year history, current svn revision number is 7802, converted from CVS to Subversion about three years ago by cvs2svn. Final destination will be Mercurial, but as you say, I’ll deal with that.

There’s this little editor your may have heard that still uses CVS. It’s original maintainer keeps hanging around though, despite having supposedly giving it up 2 years ago. What was it called? Emu, no, that’s not it… Emmmm…..Macs? Emacs, yeah that’s it!

When it’s time to make the conversion to a new VCS, why not just take HEAD and make it the initial commit and start clean from there. And keep the old repo around in the (very) unlikely event you need to look at an old revision.

Sometimes a clean break is a good thing. And all the work to do a perfect conversion seems like it would have a very minimal return. After all, the interesting part of a project is the present and the future, the past is just past.

Not necessarily, Michael. Being able to reach back into the past for a bit of code is immensely valuable at times, and going back to an old version by switching VCSs can range from simple to nightmarish. It’s better to use just one tool to do the job.

@Michael Hipp: Not only what Jay Maynard said, but maintaining two VCSes creates unnecessary confusion. New developers stumbling onto a project and looking to contribute code or patches can often land at the wrong source repository. Similar story for package maintainers — many projects do not maintain packages for every distro and instead rely on each distro to package their release tarballs.

Not sure if this helps you greatly, but if you need to suck a repository out of SourceForge, they do have an advantage of providing rsync access to CVS and Subversion server directories. Especially helpful in the case of Subversion, where git-svn is a rather large bandwidth hog (it can fetch the same commit over the network dozens of times for large repositories). For Subversion, it’s rsync://${project}.svn.sourceforge.net/svn/${project}/ and for CVS, it’s rsync://${project}.cvs.sourceforge.net/cvsroot/${project}/

@Morgan Greywolf: I wasn’t speaking of leaving the old repo exactly where it sits. Stash it on a r/o fileshare and provide access only to the very few people who might need it and would know that it is for historical reference only. Keeping a binary of ‘svn’ in your path isn’t much of a workload :-)

I’m still unsure of the use case that would have someone constantly referring to old versions. I’ve only needed such in the rare case of investigating some odd regression or pondering some long gone feature … those are easily served by the archived repo. But YMMV I s’pose.

I think Morgan’s point is that, for FOSS, you don’t actually know up-front who will be among the “very few people” who might need it. Users will sometimes submit detailed bug reports and patches, and having the history and ability to check for regressions available helps to enable that.

Is RCS actually an interesting case? I remember that back in the good old days (for me, that was the mid 90s), I didn’t trust CVS and so kept everything under RCS. I’m sure I could dig something up with sets of files spanning multiple directories, but it wouldn’t be something that I still work on and I don’t think RCS is capable of providing “a complex revision history”.

Well, speaking of ancient VCSs, I actually just did think of something in SCCS that I’ve tried to tackle in the past (last time, I tried converting it to git in 2008, which took many hours and ended up with a repository not exactly in the best shape…). That would be none other than UCB’s SCCS tree for BSD, as available on the fourth CD-ROM in this set: http://www.mckusick.com/csrg/index.html

The web page does explicitly grant permission to redistribute it, and in a single squashfs volume, it is only 345MB (able to fit on just one CD!). If you’re up to the challenge, I wouldn’t mind sharing it ;)

Though, just like RCS and CVS, SCCS didn’t track file renames or deletions, the history is rather… messy a best. Early revisions won’t be a functional replication of what early BSD was actually like. From all the files arched on that CD-ROM set, it might be possible to fudge them around with best-guesses at proper filenames for the various points in time, but that endeavor would likely take a few months of dedicated work, at least.

+Patrick Maupin: Yeah. You never know who’s going to need what and who’s going to find what where. You also risk people getting confused about the old VCS vs. the new one, etc. Web sites don’t always get updated timely, etc. Too messy.

+Mike Swanson / +Patrick Maupin: I wouldn’t even know where to begin converting an SCCS repository. BitKeeper uses SCCS internally so expect all problems encountered by the kernel developers in migrating to git and more.

Eric, since you were one of the co-authors, do you know of any plans for a new edition of O’Reilly’s Learning GNU Emacs? There have been a lot of changes in the last decade, and that book is much easier to use as a reference than the FSF’s manual, either paper or on-line (I truly hate texinfo and emacs’s info-mode).

I find its branching model weird and broken compared to git or hg. When you clone a bzr repo you don’t typically get the whole repo; instead you get a branch (or possibly multiple branches) that is bound back to the parent. It’s not really peer-to-peer the way git and hg are, and restricts the kind of workflows you can set up.