One of the tough choices VCS designers make is “what do we REALLY care about”. If you can eliminate some use cases, you can make the tool better for the other use cases. So, for example, the Git guys choose not to care too much about annotate. By design, annotate is slow on Git, because by letting go of that they get it to be super-fast in the use cases they care about. And that’s a very reasonable position to take.

My focus today is lossiness, and I’m making the case for starting out a project using tools which are lossless, rather than tools which discard useful information in the name of achieving performance that’s only necessary for the very largest projects.

It’s a bit like saying “shoot your pictures in RAW format, because you can always convert to JPEG and downscale resolution for Flickr, but you can’t always get your top-quality images back from a low-res JPEG”.

When you choose a starting VCS, know that you are not making your final choice of tools. Projects who started with CVS have moved to SVN and then to Bitkeeper and then to something else. Converting is often a painful process, sometimes so painful that people opt to throw away history rather than try and convert properly. We’ll see new generations of tools over the next decade, and the capability of machines and the network will change, so of course your optimal choice of tools will change accordingly.

Initially, projects do best if they choose a tool which makes it as easy to migrate to another tool, as possible. Migrating is a little bit like converting from JPEG to PNG, or PNG to GIF. Or PNG to JPEG2000. You really want to be in the situation where your current format has as much of the detail as possible, so that your conversion can be as clean and as comprehensive as possible. Of course, that comes at a price, typically in performance. If you shoot in RAW, you get fewer frames on a memory stick. So you have to ask yourself “will this bite me?”. And it turns out, that for 99% of photographers, you can get SO MANY photos on a 1GB memory stick, even in RAW mode, that the slower performance is worth trading for the higher quality. The only professional photographers I know who shoot in JPEG are the guys who shoot 3-4000 pictures in an event, and publish them instantly to the web, with no emphasis on image quality because they are not to sort of pics anyone will blow up as a poster.

What’s the coding equivalent?

Well, you are starting a free software project. You will have somewhere between 50 and 500 files in your project initially, it will take a while before you have more than 5,000 files. During that time, you need performance to be good enough. And you want to make sure that, if you need to migrate, you have captured as much of your history in detail so that your conversion can be as easy, and as rich and complete, as possible.

I’ve watched people try to convert CVS to SVN, and it’s a nightmare, because CVS never recorded details that SVN needs, such as which file-specific changes are a consistent set. It’s all interpolation, guesswork, voodoo and ultimately painful work that results often enough in people capitulating, throwing history away and just doing a fresh start in SVN. What a shame.

The Bazaar guys, I think, thought about this a lot. It’s another reason the perfect rename tracking is so important. You can convert a Bazaar tree to Git trivially, whenever you want to, if you need to scale past 10,000 files up to 100,000 files with blazing performance. In the process, you’ll lose the renaming information. But going the other way is not so simple, because Git never recorded that information in the first place. You need interpolation and an unfortunate goat under a full moon, and even then there’s no guarantee. You chose a lossy tool, you lost the renaming data as you used it, you can’t get that data back.

Now, performance is important, but “good enough performance” is the threshold we should aim for in order to get as much out of other use cases as possible. If my tool is lossless, and still gives me a “status” in less than a heartbeat, which Bazaar does up to about 7,000 files, then I have perfectly adequate performance and perfectly lossless recording. If my project grows to the point where Bazaar’s performance is not good enough, I can convert to any of the other systems and lose ONLY the data that I choose to lose in my selection of new tool. And perhaps, by then, Git has gained perfect renaming support, so I can get perfect renaming AND blazing performance. But I made the smart choice by starting in RAW mode.

Now, there are projects out there for which the optimisations and tradeoffs made for Git are necessary. If you want to see what those tradeoffs are, watch Linus describe Git here. But the projects which immediately need to make those tradeoffs are quite unusual – they are not multiplatform, they need extraordinary performance from the beginning, and they are willing to lose renaming data and have slow annotate in order to achieve that. X, OpenSolaris, the Linux kernel… those are hardly representative of the typical free software project.

Those projects, though are also the folks who’ve spoken loudest about version control, because they have the scale and resources to do detailed assessments. But we should recognise that their findings are filtered through the unique lenses of their own constraints, and don’t let that perspective colour the decision for a project that does not operate under those constraints.

What’s good enough performance? Well, I like to think in terms of “heartbeat time”. If the major operations which I have to do regularly (several times in an hour) take less than a heartbeat, then I don’t ever feel like I’m waiting. Things which happen 3-5 times in a day can take a bit longer, up to a minute, and those fit with regular workbreaks that I would take anyhow to clear my head for the next phase of work, or rest my aching fingers.
In summary – I think new and smaller (<10,000 files) projects should care more about correctness, completeness and experience in their choice of VCS tools. Performance is important, but perfectly adequate if it takes less than a heartbeat to do the things you do regularly while working on your code. Until you really have to lose them, don’t discard the ability to work across multiple platforms (lots of free software projects have more users on Windows than on Linux), don’t discard perfect renames, don’t opt for “lossy over lossless” just because another project which might be awesomely cool but has totally different requirements from yours, did so.

This entry was posted
on Tuesday, June 12th, 2007 at 11:50 am and is filed under free software, thoughts.
You can follow any responses to this entry through the
RSS 2.0 feed.
Both comments and pings are currently closed.

2. Why is the information of witch file name changed where and when so important to you? Isn’t it most important that the software gets a clean merge and the people can work decentralized on what and when they want, according to there file names?

3. Why is it so important to switch SCM Systems? Switching SMC systems I believe is like currency conversion. You always loose no matter what. So that why you do not want to convert to many times back and forth.

4. What I like about the GIT approach – and Linux mentions that in his video as well – is that with GIT you can mess around with your code as much as you like and you can screw up as much as you like; the faults will never be found in the central repository because there is none.

1+4: bzr is distributed. Check out previous posts in this series.
2: Check out previous posts in this series; Mark explained it better than I can here.
3. After hanging out in the software industry for awhile, you have to realize that the best tools change. Now in this “renaissance” of SCM tools, projects like bzr are realizing that it is inevitable that many users will eventually want to be able to migrate from bzr to another tool. I’m going to feel a lot more confident about switching to a new tool if it has a clear exit strategy.

I should also note that git and mercurial are also really good about interoperability with other popular SCM tools (distributed and centralized). Though Mark is right that they don’t have the same history info as bzr.

Err.. constructing parallels with other things is often useful to explain but most of the time it’s not accurate. Consider that ; is the picture format lossy if it doesn’t store the GEO position, the altitude and other environment related informations ? I think not, because those informations are outside of what “we” consider as a picture. Now, if you try to translate that to DSCM, you’ll find that the definition of a DSCM is not that accurate.

IMHO storing lots of data is okay, but not necessarily relevant. To take the “renaming” issue, most of the time, you create a branch when you’re unsure on the fate of the code. You let it evolve and then live or die. Renaming is not really an issue because it’s not definitive. Also, most of the software has a short living span. Thus, history is not really vital after one year or even one week. Code gets rewritten, or the project dies, especially in the OpenSource sphere.

Surely that Bzr is great, and that’s not the point, but I’ve yet to see a case where renaming is an issue. Storing lots of informations also doesn’t guarantee you that you’ll be able to switch easily or to make that possible, you’ll need ALL the kind of informations used by ALL SCM. And then you’ll want to use SCM+1 that uses a new kind of informations

> 2. Why is the information of which file name changed where and when so important to you?

Because the SCM has to know somehow that changes to foo.rb in one person’s repo should be applied to bar.rb in another’s, because these files were the same some time ago. Also, I don’t want to lose a two year history of a file just because I renamed it yesterday.

> 3. Why is it so important to switch SCM Systems?

Because the newer ones are better.

> 4. […] with GIT […] the faults will never be found in the central repository because there is none.

So, bazaar for example has the capability to store SVN’s properties when importing an SVN repository? (And exporting those back.) Or, like git, can track code lines that were moved between different files? Does it know of symlinks? Are directories and file permissions being versioned?

I don’t know Bazaar good enough yet to answer these questions by myself (the time is near, though), but even in case all of these are true, I think it’s a bit unfair to demand all of those from a VCS. Especially when import and export is involved.

Those really are advanced features, and there will always be software that doesn’t support the full feature set of another software, and vice versa. If that was the case, we wouldn’t have the need for competiting software at all. Of course, this is especially true for VCS, because their features directly translate to data, and data is what distinguishes between “lossy” and “lossless”.

I don’t like CVS because of its many shortcomings, but it’s wrong to critizise it for not supporting features that more modern VCS possess. It was state of the art at the time when it was current, and people know better now. Who knows which requirements are still to come up in the next few years? Is there a guarantee that today’s Bazaar will have all the data that a then-current VCS needs for its full potential?

In my opinion, it’s pretentious to demand such a thing. Software is written for now, not for forever. It may be necessary to be able to import data in their original form, but demanding that other, less recent systems need to be able to import your new VCS’s data without loss is unfair.

Man Firefox sucks. It just froze once again before I could hit the Post button. It is about time for an improvement to Firefox. I will try to rephrase:

1. GIT is about software development. I remember Linus saying in the GIT mailing list that if he wants to rewind into the past he will load the complete repository and not the single file. Here GIT’s speed comes handy.

2. The reason why I think GIT is the best SCM out there is because the software works (and lives) in repositories and the resulting merge. How cares about the filename (why track one file name if you can reload the complete history from the repository) – what has to happen is that the software has to merge with any possible better patch out there.

3. Generally – and this is how I believe good OSS development works – every repository ends up in one developers hand and brain. So it is all about the developer and how easily he can work together with other repositories with whom he wants to merge (Linus mentions this in his presentation at Google as well).

4. Why would you want to track filenames? I believe it is about the software and not about the file name.

Zeno: the main use case people are talking about is something like the following:

1. Bob branches off Alice’s repository

2. both Bob and Alice modify a particular file in their respective branches.

3. one of Bob or Alice renames that file.

4. Bob tries to merge from Alice.

Ideally he should end up with both his and Alice’s changes and the file should have the new name (i.e. we are still talking about the code here). Bazaar handles this case through the use of file IDs that don’t change over renames. GIT tries to guess the move (which it does successfully most of the time).

Another case that people run into is renamed directories:

1. Bob branches from Alice.

2. Bob moves a directory in the project.

3. Alice adds another file to the same directory (which is still in its original location on her branch).

4. Bob merges Alice’s branch.

Here, we’d want to see the file from Alice’s branch created in the new location of the directory.

I know its off topic but EVERYONE is waiting to hear your response and the one of ESR about Linspire.

Will you go into the Click N Jog adventure with collaborators or not?

Is extortion acceptable or not?

I felt queasy enough with the annoucement of the Linspire/Ubuntu collaboration last year to have changed 2 out of 4 of my (K)(X)Ubuntu
distros for other I experimented with since then.

7.04 is a blast.
My folks love it.
My folks also lived through WW2 where for every german soldier killed, 100 civilains would be executed. Their homeland suffered the greatest casualties of any country in the war because collaborating and accepting ultimatums is not in our DNA.

Vlad
– You can hold a bull by its horns but
you can only hold a man to his word.

Mark Shuttleworth says:

Neither Canonical nor the Ubuntu project have any interest in signing an agreement with Microsoft on the back of the threat of unspecified patents. We have consistently (but politely ;-)) declined to pursue those conversations with Microsoft, in the absence of any details of the alleged patent infringements.

Speaking for myself, I welcome Microsoft’s openness to the idea of improving interoperability between free software components such as OpenOffice and Microsoft Office, and believe that Microsoft’s customers, many of whom are now also Linux users, will appreciate Microsoft’s efforts in that regard. I have substantial reservations about the quality of the specification for Microsoft’s OpenXML document formats and do not believe that Microsoft will limit it’s own Office implementation to that specification, which makes the specification largely meaningless as a standard. A specification which Microsoft won’t certify as being accurate as a representation of Office 12’s behavior, and will not commit to keeping up to date in advance of future revisions to MS Office, is not a credible standard.

Instead of OpenXML, I would urge Microsoft to join the ODF working group. They are already a member of OASIS, I believe. Their participation in ODF would be genuine engagement with an open standards process. Microsoft would benefit from the innovation that comes from clean, well written standards that are widely implemented. They would have a large share of a larger market.

After many years of participating in the free software community I know that neither I nor any other free software programmer has any desire to infringe on any intellectual property (trademark, copyright or patent) of any other person or company. Many of us are motivated precisely to ensure that we work on platforms which DON’T cross that line. So it is somewhat offensive to be threatened with an allegation of an IP infringement. I’m sure Microsoft doesn’t realise that its actions are being received in that light, otherwise they wouldn’t continue. But it is getting rather tiresome. I would be very happy to see the details of any alleged patent infringement so that we can engage with Microsoft more constructively on the subject.

I am really off-topic, but the feeling of urgency pushes me to post here.

I have just read an article of Steven J. Vaughan-Nichols, a name I am sure you know, titled “Microsoft’s next Linux partner is…?”, where he reasons that Ubuntu might be the next distro to strike a deal with Microsoft.

I hope from all my heart that this is not the case, but since I have quite some respect for SJVN, I really feel the need for a reassurance. I could surely use a comment from your part. And I am sure I am not alone in this situation.

Thank you.

Mark Shuttleworth says:

Hi Apolodor, please see my response to Vlad above. I hope that settles the rumour – there is no truth to them at all.

You discussed in your previous post some of the reasons why Ubuntu selected Bazaar as its VCS of choice. In the context of this and the present post, it would be interesting to have a more detailed description of how Ubuntu uses Bazaar-NG in their work—what are the good sides, what are the problems, and in particular how you interact with upstream. For example, do you use Bazaar to track all patches/packages or do you use other VCS for some (such as the kernel) where the demands are particularly heavy? How do you go about moving and tracking patches from your system to upstream and vice versa? How do you deal with the fact that upstream generally doesn’t use Bazaar?

Mark Shuttleworth says:

Good questions! I’ve made a note to try and answer them in the course of my blog series on VCS’s.

Thanks for letting us know your position. It seemed the speculation was building and building, and even I was beginning to ask questions! I am very glad to know you are not going to sign any deals with Microsoft. The reason I left Microsoft Windows was because they were telling me (not personally, but in general) that I owed them money. I would be sick if such a deal was made, since I am not willing to shell out more money to Microsoft , but I am very glad to know that is not going to happen.

Anyways, back to the topic of your post, I was wondering… is the VCS you are talking about a Version Control System? I am a programmer-in-training… and thought this would be something handy to know.

Mark – I wouldn’t want MS to join the ODF working group because if they ever “embraced” ODF, their goal would be to embrace AND EXTEND it – they would no doubt abuse their position of dominant Office software to create a “Microsoft ODF” with subtle incompatibilities with everyone else’s ODF, just like they did with Java and with HTML, with CSS, and so on and so forth – and because they are dominant, the industry would have to adapt to them again rather than them adapt to the standard, just like many Web developers coded only for IE for years. Interoperability would break and the purpose of ODF would be effectively mooted, and ODF would be damaged more so than if Microsoft just stays out of it and it becomes a real standard – it would remain on the fringe for much longer, but when the time comes it will be stronger.

Mark, please don’t ever sign an agreement with MS only because of the FUD they’re throwing around.
Yes, IP is something to consider, but MS has a history of “embrace, copy, destroy” behind it.
One good example is OpenGL vs DirectX, Netscape vs IE, and I`m sure there are many others.
I don’t want to see my favorite distribution, KUbuntu, going down the drain. It would be a shame since
it has a huge potential.
I honestly believe their FUD is a big pile of steaming [insert bad word here], but I’m not the one to be
convinced of this, but the ones who are poorly informed.
They need to know! GNU / Linux does not infringe anything. And if it does, it surely was not by intent
and surely there is a workaround.

I hate to re-open the question, but with regards to Linspire’s recent announcement that they have taken the Microsoft koolaid – will this have any impact on the inclusion of Linspire techonology (click&run) in Ubuntu ?

Or if it is as I understand it that Linspire’s latest and greatest is Ubuntu based that that could have any impact ?

Mark, I appreciate your stance on MS and their alliances. I have been a linux user for more than a year. I have it on both of my desktop computers. I believe that I will not put Ubuntu on both of them. I have helped others set up Ubuntu computers and have encouraged my students (I am a High School Science Teacher) to give it a try. I couldn’t be happier to know that you and your company will hold MS’s feet to the fire and make them prove their claims. Those of us who use this software do so because it is cost effective for us and gives us the freedom to truly master our computing lives. Without Linux (Ubuntu included) many of my online purchases would never have happened, didn’t trust the MS software enough to do so very often, thus it does have an economic impact. I read your blog often and appreciate your candor. In addition I hope that your venture with Dell is very profitable.

I, too, read the article and started to prematurely panic. And then seeing your response to Vlad, which quelled my fears.

Thank you, you’ve proven to me that you’re a class-act, and that you’ll not let anything happen to such a great body of work.

I do have my fears about M$, and time will show the results of the moves they are making now. I’m not one to be a fence-sitter, but in case where Linux and M$ is concerned, we’ll all have to wait and see.

Ballmer’s “. . . 235 patents . . .”, was (as my brother put it) the verbal equivalent of throwing chairs; a spoiled brat acting out a temper tantrum. Ever since that effusion, MS has been side-stepping, back-pedaling, bobbing and weaving trying to deny the intent of the statement while hoping to maintain its implications.

It’s a simple fact, MS doesn’t ‘get it’ when it comes to open source. They are trying, mind you, but they (and I mean Ballmer and Gates) just can’t wrap their heads around the concept. If they were smart, they’d offer Shuttleworth a position on their board. MS needs to adapt and for that they need insight they have been unable to produce internally.

Gates and Ballmer must realize that this is a pivotal moment in the evolution of our still emerging digital society. What they are struggling with is how to adapt. What they have to find is a new, more rational perspective.

Your decision was truly great, I personally respect your decision _not_ to play games with the devil (M$). I hope this whole fiasco by M$ will finally rest to end. especially on IP FUD that they spit from time to time.

[…] Update: Mark Shuttleworth flatly denies Steven J. Vaughan-Nichols’ prediction of an Ubuntu-Microsoft patent deal in his blog here. Also, as was noticed by another blogger (and then on Slashdot), Shuttleworth also made almost identical statements in the comments to one of his previous posts. […]

“Lossless” is a relative term. If you define what the programmer comes in as 100% quality, then sure you’re right. However once you realise that they are human, and will make mistakes, then the whole “lossless” argument is undermined.

Then you realise you’re back with a warehouse of revisions, and looking back at history is not following the breadcrumbs left by the original developers as gospel but instead *data mining*.

I think you’ve seriously misunderstood git. It doesn’t track renames because that’s the wrong thing to do. It can easily reconstruct that information later though. This is not the “interpolation” you mention in your post; it is not guessing, git would know. How? Let me explain:

What git actually stores is the content of the files; it stores them in a file named as the hash of the contents – let’s pretend that hash is 12345ABCDEF. Git then keeps a list of hashes against filenames; so:

12345ABCDEF somefile.c

Now, let’s say you rename that file, from one commit to the next; git will store the new list as

12345ABCDEF newname.c

Notice that the hash hasn’t changed. So, when you compare these two lists, it’s really easy to see that somefile.c was renamed to newname.c, because the hash is common to both. Similarly for copies:

12345ABCDEF somefile.c
12345ABCDEF copy.c

Comparing this list with the original one, it’s easy to see that somefile.c hasn’t changed, but copy.c has been introduced and is a copy of somefile.c.

See? Git didn’t need to record the rename explicitly – it’s inherently available in what it does store.

What’s even better is that you get the copy free, because it’s got the same content, this list just references the same content twice, and on checkout git reads the same source object twice.

As it happens, git is even cleverer than I’ve described above and can make educated guesses about copies and renames that were changed during the revisions.

What’s really great about this, is that git figures it out on its own. So you don’t need special commands for copy, move, mkdir, rm, etc. Git knows what you’ve done because you did it, not because you told it you did it.

Begin fair, there are currently a few UI issues with git erring on the side of speed by default, and not doing these detections as it parses history. However, they aren’t particularly expensive operations and so if you wish (as I do), you can make git always detect these copies and renames. However that is a user interface issue, and should not be used to say that git doesn’t track renames. Who cares that it doesn’t track them – it can show you, the user, them, which is all that matters.

Another point – only those systems which use revision numbers which hash the content and history to that point (ie, monotone, hz and monotone) are actually checking the integrity of the content by design as they go. It is therefore possible (unless you use Testaments for every revision) that your bzr repository could have historical corruption (more likely tampering) without you noticing.

So, historical tampering would result in what comes out of bzr not being the same as what went in. And people who copied the tampered repository would never know.

Anyone who does business as a partner with Microsoft should know that first and ensure that they have adequate legal safeguards.

Which of Microsoft’s joint business ventures with other companies did not leave Microsoft as the sole beneficiary of that arrangement?

Microsoft may want to tie others to a standard and thus be free to innovate and to create new ad hoc standards using their monopoly power.

Today, IT workers spend more of their effort using Microsoft’s software than creating new applications themselves. The overhead of using such tools saps the workforce. We were far more productive using less innovation some years ago.

Actually, git can only get copy 100% right if nothing is changed in the target file in the same commit. Otherwise, you have to give it a similarity threshold (70%? 80%?) for it to decide whether something is a copy. This is, of course, a guess, resulting from lost information. Nothing stops other systems from making the same guess if they have to (for instance, mercurial supports a similarity option for addremove), but it is better to have the information.

git also pays a performance penalty for digging up copy and rename information, which is why you need flags like –find-copies-harder etc.

[…] loss when interchanging data with the core product. Mark Shuttleworth captures this point nicely in Choose lossless VCS tools if you have that luxury. Truly caring about integration goes even deeper in my opinion: it means explicitly making it […]

[…] Mark Shuttleworth: What’s good enough performance? Well, I like to think in terms of “heartbeat time”. If the major operations which I have to do regularly (several times in an hour) take less than a heartbeat, then I don’t ever feel like I’m waiting. Things which happen 3-5 times in a day can take a bit longer, up to a minute, and those fit with regular workbreaks that I would take anyhow to clear my head for the next phase of work, or rest my aching fingers. […]