Sunday, June 17, 2012

Source control is an area of software development in need of reform. There is need for a clean, clear semantic model. To the extent that existing source control systems have some sort of model, each system is different. Each has its own terminology, usually entangled with the mechanics of file systems and directories. As with IDEs, the use of files and text has spread in this domain because it is a lowest common denominator.

Semi-tangent:Well, almost a common denominator; it doesn’t cover Smalltalk, but this can be fairly viewed as Smalltalk’s fault, not source control’s.

As with IDEs, the tie to files is unfortunate because low level abstractions like text files and file systems are completely extraneous to the problem at hand.

The major advantage of the text file based approach is that, rather than invent a source control system for every language, we can build one system that assumes source code consists of text files and go from there. A big disadvantage is that such a system has no understanding of the source code. It doesn’t understand the structure of a program - be it functions or classes or prototypes or procedures or what-have-you.

The mainstream approach also has another advantage: it can integrate artifacts from multiple languages. And another: we can go even lower than text files, and just consider files, so we can manage binaries and resources as well. In general, while we would like programming language-specific understanding, we also want to deal with multiple languages, and with artifacts that go beyond source code.

Again, we return to the lack of a semantic model: not just for understanding the sources, but for the language-independent part of the system. What are versions, what are differences, what are repositories exactly? The answers differ from system to system, and are hard to disentangle from the mechanics of files and directories.

People have addressed parts of the problem, but I don’t know of a completely satisfactory solution. For example DARCS has a model of differences that is rather interesting. However, it doesn’t tackle other issues, and the experience of the Haskell community using it has been mixed at best.

In the Smalltalk world, Monticello (and more recently, Metacello) provide a language aware source code management (SCM) system. I’ve explained some of the problems with that approach above. We tried to mitigate these somewhat in the Hopscotch IDE, where we mated Monticello with svn and a new GUI. The idea was to use a mainstream standard tool with a language specific front end. No need to reinvent the entire wheel, only select parts. That too has been a mixed experience.

On the one hand, we’ve enjoyed a nice GUI. For example, the changes presenter displays semantically meaningful diffs - the system tells us what classes have changed, and what methods within them, resorting to textual diffing only within methods. The diff is displayed side-by-side in the traditional manner; the key difference is that we get individual diffs for each unit of program structure. For example, if you’ve changed two methods in a class with 30 methods, you’ll only see the diffs for those two methods.

In the screenshot, we can see that the startup: method in the class ObjectiveCAlien is the only thing that has changed.

On the other hand, the cost and effort of rolling one’s own VCS tool is considerable even when it is done on top of a standard VCS that does the heavy lifting. Because of this, we have not yet been able to realize all of the advantages we could from such a system. We could potentially show you time-machine like views of individual classes or methods, since concepts like versions and history apply to these entities.

The system shown above is based on svn, and svn doesn’t support distributed development well; this became an acute problem when the project went open source.

So - do we need to build variants of such a system for other SCMs? Given N languages and M SCMs, you get N x M systems. Unattractive. One can see why people have stuck with the standard tools.

If we had a uniform abstraction of an SCM then we could implement the abstraction once on top of every real SCM we wanted to use. We could then implement language specific functionality on top of the abstract model. Now you get N+M pieces you need to build.

This is what Matthias Kleine set out to do in his masters thesis. The result is Pur, which defines a model that is general enough to describe several of the leading SCMs (mercurial, git, svn). MemoryHole, a Newspeak specific version control tool, has been built on top of Pur using a binding to mercurial. Since August 2011, we’ve been using MemoryHole instead of the svn-based tools. One nice thing is that MemoryHole can work with git as well, and potentially even with old-fashioned svn. Here’s a screenshot of MemoryHole in action:

We see two columns listing top level classes that differ between the running system and a repository. Each class is presented as a tree view of parts that differ. At the level of individual methods, we revert to a text diff. The configDo: method of class VCSMercurialBackedProvider`Backend`LocalRepository`Commands`NonCachingCommand is expanded to show what’s changed (a flush was added, highlighted in red).

Tangent:We see 4 levels of class nesting here, which is as deep as I’ve seen in any Newspeak program.

MemoryHole gives us a distributed source control GUI application that is language-aware, but can work with differing standard SCMs. And of course, it is written in Newspeak so it is modular and extensible.

Keeping a language specific VCS running in sync with an evolving language was a problem in the early days of Newspeak. This is again part of the price one pays for dedicated language support. When the language is stable I think it is well worth the cost, just like any other language-aware tooling, be it an Eclipse plugin, an emacs mode, or something better.

Indeed, the situation is analogous to what happens with text editors versus IDEs. The text editor is a lowest common denominator: the same tool can handle any programming language, and many other things as well. The IDE needs to be tuned extensively to each and every language, but in the end can give you a better experience. It took a long time for people to appreciate IDEs with their language specific support for editing, and to this day not everyone does. I imagine we’ll see a similar evolution in the area of source control.

Most intriguing to me is the connection to the general problem of synchronizing data, including programs (being a special case of data), across the network. In the past, I’ve discussed the idea of objects as software services and full-service computing. I see source control as just a special case of that. Something to discuss another time.

The story doesn’t ned here of course. There is a need for mathematical models and theory, and bindings of many languages to many SCMs. In general, more researchers should look at source control; no doubt they will have their own ideas. I hope we move the world a little bit forward, beyond files and text diffs.