Thursday, May 6, 2010

The quest for the perfect permalink

For SUSE Studio we are looking into adding nice permalinks to appliances. This turns out to be an amazingly difficult problem. The implementation is not too hard, but getting the scheme of the links right poses quite some interesting challenges.

So what do I actually mean by permalink? A permalink is a nice and convenient way to point to objects on a web site from outside of the web site itself. In our case this would be links which point to appliances on SUSE Studio. To make this nice and convenient the link needs to have a couple of attributes:

Permanent. The link should not change or depend on the state of the site or attributes of the user session. If you publish the link on another web site, e.g. in your blog, it should not break after a while or for other users.

Pretty. As the permalink is meant to be suitable for publication, it should have a pretty format, so that you can integrate it into text without completely destroying formatting and flow.

Expressive. When you see a permalink, it should be recognizable where it points to, so you don't have to click it to find out what it actually is about.

Short. For sharing the link it's nice, if the link is short. This is especially important when there are limitations for the length of the link, for example when sharing it via Twitter.

Meeting all these requirements is not easy, but there are also some additional challenges:

Handling change. The objects permalinks point to are being worked on, so they change in various ways. For example the name of an object could change. Permalinks have to handle this conflict between permanence and change in some way.

Namespacing. A site might handle different types of objects, so they need to be addressed in a way which doesn't cause conflicts. As we are talking about user-provided content here, there also is a cause for conflict by different users trying to use the same names. So we need to do some namespacing to handle these conflicts.

Potential abuse. The permalinks are pointing to user-provided content. So depending on how much influence the user has on the link, there might be some potential for abuse by users who try to create links which misrepresent the site.

Non-ASCII characters. If you base permalinks on names, you have to deal with characters which are natural in names, but not in URLs. This can make it hard to create permalinks.

Let's use a fictive example to illustrate the requirements and challenges:

John Doe likes baking cakes. He also likes to share, so he publishes his recipes on example.com. His favorite recipe is the chocolate cake of his aunt Tilly. So he publishes it and the site creates the permalink example.com/chocolate_cake. John tweets the link. His friends get it, bake the cake, and everybody is happy. This permalink is short and pretty. It conflicts with all other recipes named chocolate cake, though. So the first to publish a recipe wins. This is good for John, but bad for other users, so not a perfect solution.

One way to avoid the problem of conflicting names would be to add a namespace for the user, so the permalink would become example.com/jdoe/chocolcate_cake. It makes it longer, though, and there still is the potential for conflicts in the user name. So when John's sister Jane Doe joins example.com, she'll not be able to use her favorite user name, which also is jdoe, but has to choose something else. Still not perfect.

Now aunt Tilly is a modern lady. She reads the tweet and sends an email to John: "Hi John, I gave you this recipe. Be a good boy and mention this on your web site. All the best, Tilly". John is a good boy and changes the name of the recipe to "Aunt Tilly's Chocolate cake". The web site creates the permalink example.com/aunt_tillys_chocolate_cake. This makes aunt Tilly happy, it's still expressive, pretty and relatively short, but it breaks the link in John's tweet. So the site has to redirect the old link to the new one. It at least has to prevent that the old URL is used for something different. It makes the nicer URL example.com/chocolate_cake unavailable for other recipes in any case. This is good for permanence, but bad for pretty and short links.

Another problem which is illustrated by the name change is handling of special characters. The apostrophe is hard to handle in an URL, so the site just removes it for the link. You can come up with all kind of rules to handle these special characters, but they will eventually fail to generate pretty URLs, e.g. when somebody uses a Japanese name. This means that either the user edits the link, which introduces lots of opportunity to change and the problems associated with it, you give up on pretty URLs for at least some cases, or you let users deal with the problems of encoding special characters in URLs and the issues you can run into, e.g. when using tools which don't properly handle all of this. Another stumbling block on our quest for the perfect permalink.

An easy solution to avoid most of these problems is to generate random permalinks. This also removes all complexity with user-editable content and changes, as the the link is independent of the content. So aunt Tilly's yummy chocolate cake would be referenced by example.com/3hd63lbdxz. This is short and permanent, but not pretty or expressive.

You can think of various variations and combinations of these schemes, but meeting all requirements really is very hard. Seems like there is no perfect permalink. But let's look at some real-world examples.

Real-world examples

Wikipedia provides short and pretty permalinks. They have the advantage that the number of terms represented in links is limited, pretty well-defined and not completely up to users. They still have to deal with conflicts and do that with their disambiguation pages. They take on the challenge of encoding special characters, which is nice.

Gitorious lets users choose the permalink (or slug as they call it). They forbid special characters. The permalink becomes a top-level path, which is nice. You can't name your project login, though. You can change your slug, but this breaks old URLs.

Github goes a slightly different path. They prefix all projects with the user name. This is nice as it avoids conflicts and it also stresses the social aspect of the site. Some URLs become a bit ugly, e.g. github.com/rails/rails. They seem to cleanup their users and projects from time to time, as a nasty URL, which used to exist, I wasn't able to find anymore today.

s.opensu.se is a site to provide links to various openSUSE resources. You can for example reference repositories by short links like s.opensu.se/r?network:utilities. This is nice and short and reasonably expressive. If it's pretty is a bit a matter of taste. It doesn't address changing links.

Markmail is an example for random permalinks. This is probably the only way to cope with the number of objects they manage as a mailing list archive and the links are still short and relatively pretty: markmail.org/message/vyjutm3jkecxprzj. Change is not an issue for them as objects are static.

There are tons of other examples out there. If you know of one which solves the problems of permalinks in a particular good or interesting way, please let me know.

What do you think?

My preliminary conclusion is that it's probably impossible to come up with a perfect scheme for permalinks, and we need to do some compromise. I like including user names in the URLs as it solves some of the conflict issues, is actually useful information, and makes for expressive URLs. But of course there are other solutions as well.

What do you think? How would you like permalinks to SUSE Studio appliances to look like?

9 comments:

This is just an idea I had when reading your post, don't know if it's sensible:hostname.tld/random_number_for_object/some_description

Example:hello.org/sdfg44532/chocolate_cake

The object is identified by the random number, the title is random. The url hello.org/sdfg44532/xyz would lead to the same page. This would also solve the problem of encoded URLs since it does not matter if an app misinterprets the the last parts it would still end up at the page identified by the random number.

Of course, you could also add the username at the end, if useful:hello.org/sdfg44532/jdoe/chocolate_cake

The benefits of this solution:Expressive (if necessary)Short (if necessary)Permanent

So the user can actually choose between a short and an expressive version.

Why not have a counter ie example.com/jdoe/choccycake/1 for john and example.com/jdoe/choccycake/2 - depending on whether the user wishes to share their work, this would also allow people to look at alternative recipies!

The basic problem is that you will _never_ be able to cater to everyone. If I have the name RichiH, no one else can have it. Not even in 100 years when I am long gone, but the permalink is still supposed to work.

That being said, I would go with usernames in DNS chars, optionally a _$i to discern richih from richih_2. I.e.

example.com/richih/$my_tag

For completeness, most paste & URL shortening sites use random upper/lower/numeric. That is not really an an option for what you want, though.

As an aside, maybe someone does not wish to use his or her username for a link. In this case, a default UUID or UID might make sense. UUIDs have the additional advantage that they are easy to find with a parser.

@flow that idea is almost a reductio ad absurdum of "expressive links" - the link can be made shortish and expressive if you want but you are aware that you are adding information that does nothing other than express what the link is, so it might as well be written beside the link rather than in it.

Of course it is "abusable" since you can put anything after the UUID but then I don't see a major problem with that.

can be replaced by anything (but must be of the format:...t5/sometext/sometext/m-p/...

I don't really like having formatting information in the URL like this. Markmail on the other hand is brilliant, with it's AJaX interface that nevertheless has a one-to-one correspondence between the interface state and the URL.

I've seen the scheme described by flow used in several places, e.g. OpenDesktop.org sites (kde-look.org, kde-apps.org etc.), some video sites etc. The problem is that it's prone to one kind of abuse: people can have fun creating links with derogatory URLs such as example.com/sdfg44532/baby_meat_cake (instead of example.com/sdfg44532/chocolate_cake) which in some cases can discredit the author. This attack has been used recently against the French president Nicolas Sarkozy's website for much fun. The admins responded by changing the site to no longer match only based on the ID, but only accept the full exact URL.

@flow: This scheme is nice, but the abuse argument also is strong. It's a bit odd to add unneeded text to the link. I guess you can always do that anyway by adding an anchor fragment like #yummy_chocolate_cake.

@RichiH: The UUID idea is interesting. This would make it possible to discover objects without knowing a lot about the used scheme. It lacks in terms of prettiness, though.

@hads: I agree that the user/name URL is the nicest one. It has the problems of being affected by changes of user or object names, and conflicts for common names. So especially for big and active sites this will lead to strange URLS for many people and objects. Underscores in URL is a matter of taste, I suppose, or is there any special reason why you dislike them?