What How Bazaar talks about

Posts tagged with 'bazaar'

As I'm sure most of you are aware, Launchpad hosts Bazaar branches. One early design decision that we had on Launchpad was that branches should be able to be renamed, moved between people and projects without limitation. This is one reason why each branch that was pushed to Launchpad had a complete history. We wanted to make sure that there weren't any problems where one person was blocking another pushing branches, or that people weren't able to get at revisions that they shouldn't be able to.

The Bazaar library, bzrlib, gives you awesome power to manipulate the internals giving you access to the repository of revisions through the branch. This can be a blessing and a curse, as if you have private revisions, then they can't be in a public repository.

Having a complete copy of the data for every branch became a severe limitation, especially for larger projects, of which Launchpad itself is one. A solution to this was a change in Bazaar itself that allowed a fallback repository which contained some of the revisions. This is what we call stacked branches. The repository for the branch on Launchpad has a fallback to another repository, which is linked to a different Launchpad branch. We ideally wanted all of this to be entirely transparent to the users of Launchpad. What it means is that when you are pushing a new branch to Launchpad, the bzr client asks for a stacked location. If there is a development focus branch specified for the project, this is then offered back to the client. The new branch then only adds revisions to its repository that don't exist in the development focus branch's repository. This makes for faster pushes, and smaller server side repositories.

The problem though was what do we specify the stacked on location to be? When we created the feature, we used absolute paths from the transport root. What the mean was that we stored the path aspect of the branch. For example, lp:wikkid gets translated to bzr+ssh://bazaar.launchpad.net/~wikkid/wikkid/trunk or http://bazaar.launchpad.net/~wikkid/wikkid/trunk depending on whether the bzr client knows your Launchpad id. The absolute path stored would be /~wikkid/wikkid/trunk. This information was then stored in the branch data on the file system.

The problem however was that the web interface allows you to rename branches. The actual branch itself on disk is referred to using a database id, which is hidden from the user using a virtual file system which has rewrite rules for http and at the bazaar transport level. However since the stacked on location refers to a full branch path, changing any part of that, whether it is the branch owner, branch name, or the project or package that the branch is for, would cause any branches stacked on that changed branch to break, bug 377519.

In order to fix this we had to change the location that the branch is stacked on to be independent of the branch path. The best solution here is to use the database id. I really didn't want to expose the user to this opaque id, but one opaque id is as good as another. Now when pushing branches to Launchpad, when it is creating a stacked branch you'll see a message like:

Created new stacked branch referring to /+branch-id/317141.

Existing branches still have their old branch paths saved for now. We'll run a migration script early next week to fix all these up, and hopefully we'll have seen the last of this bug.

As with any creation being released, I'm writing this with some trepidation. I'd like to announce the first release, 0.1, of Wikkid Wiki.

What is it?

Wikkid is a wiki that uses the Bazaar DVCS as an underlying storage model. Wiki pages are text files in a branch.

Why another wiki, surely there are enough already?

There is the obvious reason, because I felt like it. But this is not the primary reason. Since I started working for Canonical I've come to appreciate the whole culture of free, libre, and open source software more. During the day I work on Launchpad. Primarily in the area of integrating the Bazaar DVCS. Launchpad doesn't have a wiki integrated, and it is my plan to see wikkid be the wiki that is integrated.

I have to admit to having very strong opinions myself on what I wanted for wikis in Launchpad. I've tried to encapsulate that in the vision below. Since no one else was looking at it, I took it upon myself. Discussions started last year between a small group of Launchpad developers, but there was no traction. At the start of March I started writing the Bazaar backed wiki. It needed a name though. Thankfully after trying several I got the perfect name from Aaron Bentley - wikkid.

The wikkid vision

Any wikkid wiki can be branched locally for off line editing.

Any branch can be viewed using wikkid - not limited to branches created through wikkid

A local wikkid server can be run using a Bazaar plugin

Local commits use the local users Bazaar identity

Wikkid can be configured to operate in a stand alone, public facing mode where it has a database of users

Some people in the GNOME community have suggested that if Bazaar has nice usability, then GNOME can just use Git on the back-end, and Bazaar lovers can just use the Git back-end via Bazaar. It's true that Bazaar could support this — an experimental plug-in exists to do this right now. But this suggestion betrays several wrong assumptions.

People assume Git and Bazaar are the same. They're not. People assume that if Git and Bazaar have technical differences, then Git must have it right.

The problem with these assumptions is that usability begins at the ground level. Bazaar started with a focus on usability. Git began with a focus on speed. The data models of both Bazaar and Git reflect their initial focus. But Bazaar's model can also be fast. In fact, the Bazaar developers are currently optimising a number of key operations for speed.

Data retrieval

Git and Bazaar are both key/value mapping systems. When bytes are needed, they are requested with that key.

The big difference is that Git's keys are also the hashes of the bytes. This is why it's called a content-addressable file system. This allows git to offer a guarantee that if the value hashes to the key, it has not been modified, whether deliberately or by accident. The Bazaar team considered adopting this approach, but decided it was too constricting. Bazaar uses UUIDs instead.

Bazaar uses revision-signing. All revisions can be PGP-signed. No signed revision can be forged. And the hashed representation can easily be generated and passed around to ensure that exactly the same content is used.

If SHA-1 is broken, both Bazaar and Git will lose their ability to detect malicious modification. But since Bazaar uses UUIDs to identify revisions, users can re-sign their old revisions with whatever method proves to be secure. Changing the hash used by Git would make it incompatible with all existing repositories.

Data Integrity and Serialization formats

Bazaar stores hashes of every value, so it equally capable of detecting accidental modification. It can be useful to have different representations of a tree in different repositories. For example, when Git lists files, it divides this data by directory. This is a good approach, but not necessarily the best approach. An alternative approach would be to use a radix tree. This would ensure that Git performed quickly even if users put unreasonable numbers of files in a single directory. But Git's keys are hashes, upgrading Git's format to use radix trees would change the keys, which means that people could not use the commit-id from one repository to refer to the same tree in an other form.

Bazaar doesn't assume it has the perfect format. It provides an upgrade path, and does't change the commit-id of a revision if you change your format. What's more, Bazaar can even reference data it has never seen. This allows partial imports from other VCSes to be fully compatible with more complete imports. And if a VCS provides UUIDs (content hashes certainly qualify as UUIDs), Bazaar can refer to those UUIDs directly.

File and directory representation

Git refers to files by path. It makes no attempt to track renames in its data store.

Bazaar has an inode abstraction; files and directories both have ids. When a file is renamed, its id stays the same. Bazaar's core code refers to files by their id, so merging a renamed file requires no special effort.

Git's approach means that users are warned not to rename files while changing their content. But when files are renamed, those files that refer to the renamed files must have their contents changed as well. For example, if you rename foo.h and foo.c to bar.h and bar.c, you should update the contents of bar.c, or else you will break the build. With Bazaar, users can do whatever they want, and the VCS just works. While Git must always use heuristics to deduce renames, Bazaar does not have to. Of course, it can if it wants to. This is an example of why it is important to design a model for usability from the beginning.

Bazaar can import rename data losslessly from foreign VCSes. Some other VCSes support file-ids, and Bazaar can reuse those without change. For VCSes that support renames, but not file-ids, Bazaar's representation is also non-lossy. When data imports are deterministic and non-lossy, it's easy to export them back to their source VCS. Bazaar's Subversion integration is a great example of how this can work.

Choose the back-end with the right model

In any situation it makes sense to use a back-end that stores the richer dataset. It makes more sense to have a front end client that doesn't use all the functionality or data representation of the back-end than it does to have a richer client that isn't able to store the required information as the back-end is not able to represent it.

If a single back-end storage is going to be used, it makes more sense to use a Bazaar back-end as Bazaar is able to represent everything that Git does, but the reverse is not true.

Conclusion

The Bazaar developers focused on usability, which requires having a model that supports usability. Bazaar has improved its model to increase the usability of the system. We believe that Bazaar has the right model.

Yes there were more GIT users than Bazaar users at the BOF, but the numbers were more like 50% of the audience were GIT users, and about 40% were Bazaar users. Someone piped up and said "What about Mercurial?" and so the question was asked, and there were about five or six people. There was an overlap of the GIT and Bazaar groups, and there was by far the larger majority of the audience that had not used any DVCS.

What conclusions can we draw from this? Not much. Many people attending the pre-conference work for larger companies, like Red Hat, Novell, and Nokia, and many of those people work on some hard core linux stuff, many of which have chosen GIT. Many have chosen GIT because that is what the linux kernel is using. Is that a good reason to chose a DVCS? I don't feel that we can really answer that question as I am sure there are strong advocates for both sides.

An interesting question is "Which DVCS is easier for the casual contributor to use?" Surely one of the reasons that a project chooses a DVCS is to allow for more community contributions in an easy to merge way that has a clear contribution history. Bazaar just works. It works for the hard-core developers, but is also easy for those soft-core (?).

From the people I talk to, and I've tried to talk to many here, is that of those that use Bazaar it just works. Bazaar doesn't get in your way of developing the software that you are working on. It is just a tool that works.

One final point. The questions were "Who uses <insert DVCS>?", not "Who likes/loves using <insert DVCS>?".

I now officially have some of my code in an open source project where the work was actually done entirely on my own time. Despite being involved with professional software development for almost twenty years, I've normally kept my programming to work hours, or private projects.

This time though, it was different. This time it was truly scratching an itch. I've become somewhat of a Bazaar convert. Bazaar has for a long time allowed you to define command aliases in your bazaar.conf configuration file. I have always used bash aliases for commands that I do really often, so it seemed natural to me to define aliases for bazaar for commands that I used often as well.

Next came the internal conflict. I am an inherently lazy person. This is why I like aliases, less typing. One of the things that bugged me was having to actually edit the configuration file any time I wanted to add or modify an alias. This bugged me. It bugged me so much it caused me to actually do something about it. Luckily for me Bazaar is written primarily in python, and this just happens to be my current favourite language.

The code is now committed to trunk, and should be available generally in bzr 1.6.

bzr alias — lists your current aliases

bzr alias commit="commit --strict" — sets the alias for commit

bzr alias commit — print out the current alias for commit

bzr alias --remove commit — removes the commit alias

Obviously if you use alaises as much as I do, one of the first things you'll do is set the alias for unalias.

Launchpad offers many things to developers, and open source software developers in particular. One of these things is the ability to host Bazaar branches. For those that have looked a little deeper, they will have noticed that there are four types of branches in Launchpad: Hosted; Mirrored; Remote; and Imported. Hmm, this isn't really what I was intending to talk about at all, but I'm going to go with the flow.

Hosted branches are those where Launchpad is the primary public location of the branch. Hosted branches are normally created by pushing a branch directly to Launchpad. Before you do that though, you need to have registered on Launchpad, and supplied an SSH key. This is how Launchpad knows who you are. There are two ways you can push a branch to Launchpad: one is via SFTP; and the other using the Bazaar smart server (bzr+ssh).

As an example I'm going to use my alias-command bzr branch. The complete SFTP location would be sftp://thumper@bazaar.launchpad.net/~thumper/bzr/alias-command, and the smart server one bzr+ssh://thumper@bazaar.launchpad.net/~thumper/bzr/alias-command. These are a bit unwieldy, so we extended the lp type urls for bzr to support writing if the launchpad plug-in knows who you are. In order for you to do this you use the lp-login command. bzr lp-login will tell you the username that is currently set. If you have not done this yet, you'll see a message like "No Launchpad user ID configured." I set mine by saying bzr lp-login thumper. This stores thumper as the launchpad_username in the bazaar.conf file. This also means I can use bzr push lp:~thumper/bzr/alias-command to push to my hosted Launchpad branch.

Mirrored branches allow you to have your branches stored publicly in some location that you control, and you let Launchpad know where this is. Launchpad will then update its copy of your branch every six hours. This is handy if you don't have an SSH key, or you have a slow network connection, or you just like having your branches available on your own server.

Remote branches are a bit different. Remote branches were sort of created out of necessity. Some people were registering mirrored branches with unreachable locations. Some of these were possibly by mistake, but quite a few were obviously inaccessible. But more strange is that those branches were linked to bugs or blueprints. There was obviously a desire to have branch meta-data there, but not actually allow Launchpad to get access to the branches. So we have remote branches. You cannot get a copy of a remote branch from Launchpad as Launchpad does not have a copy of it.

Imported branches are those branches where Launchpad get the code from either CVS or Subversion, and puts it into a Bazaar branch. I was really wanting to talk about this as I saw two projects recently where we are importing code that I didn't know about. One is my favourite music player, Amarok, and the other was MPlayer. Just out of curiosity I looked at both of these branches on Launchpad. The Amarok one has 12195 revisions as I'm writing this, and the last revision was 11 hours old, and MPlayer had even more revisions, at 26761. However that isn't even the cool bit. What is really nifty is you can go bzr branch lp:amarok or bzr branch lp:mplayer to get the code. Just to check I did just that, and got a copy of the amarok source. It was the first bit of C++ I had looked at in a long time (it used to be all I did).

Recently Bazaar grew a branch directory service. This allows plug-in developers to define custom "protocols" that resolve the branch names into some other branch location.

The Launchpad plug-in defines a protocol "lp". Launchpad uses the other parts of the URL to relate to projects, series or individual's branches. The shortest valid URL for Launchpad is something like lp:do. You can also use an empty authority (or site part of a URL), so lp:///do is exactly the same, just longer. Personally I prefer the one without all the slashes.

The do part of the branch location relates to the GNOME Do project on Launchpad. There is a little magic (a.k.a. configuration) that is needed to make this work. Projects in Launchpad get an initial development focus series created for them. This is intended to relate to the branch of development that is where current or new work goes. In order to have the code available through the Launchpad directory service, the code has to be available through Launchpad as a normal branch.

Once a branch has been either hosted, mirrored or imported for the project, one of the people responsible for the project in Launchpad can relate the branch with the series. Once this is done the branch is easily accessible. People that have permission to make these links will be shown a link on the main code tab for their project (we don't taunt people who can't make the links with an invalid option).

If there is another series, say 1.0 for our project fooix, and we have branches associated with them, then we could get the 1.0 branch using lp:fooix/1.0. Normal Launchpad branches are also accessible using the lp protocol using lp:~username/project/branch-name.