Snook.ca

Building a URL Shortener

With all the talk of URL shortening services, I decided to add a quick service into Snook.ca, which is run on CakePHP, to redirect a short URL to a post. Because my static content already has short URLs and all I have are posts, creating a short URL handler for it was very easy.

To give you some context, I route my posts through a specific structure:

/archives/:category
/archives/:category/:articlename

In this case, I have a couple routes that route everything to my Posts controller and the bycat or view actions. These action take the named parameters and pulls out the appropriate content. Easy peasy.

The key thing here is that my articles have two identifiers: one is the slug, the other is the post ID. The process for the short service just takes a post ID and redirects it to its fully expanded URL.

I grab the named parameter. I unbind my Comment model to prevent the findById call in the next line from returning too much. Then I find my post which will return the associated tags (which are my "categories"). I build the URL and then redirect the user onwards.

I haven't exposed the short URL in any way, yet. For now, it's more to allow myself quick posts to Twitter without having to use another service and to see if people are retweeting the link.

Building your own

With a single model and a single action taking a single parameter, wiring up a URL shortener was very simple. How could you do it with a more complicated system?

Multiple Routes

Another easy way to extend this concept is to simply map each prefix to each model's view action that needs to be shortened. You could have a Posts model on /p/ and a Comments model on /c/. the :id for each of them would simply point to the view page for each one. That offers up a little more flexibilty but not much.

Automatically Creating and Caching Short Links

In thinking this through, especially for an established site, you could have it automatically create a short link for any URL on your site once it has been visited once. First, create a new model called Short (or whatever you feel like it should be called).The short model will consist of two fields: the primary key (id) and character field to store the URL.

Within your AppController, grab the current URL (available via $this->url). With the URL being a unique key as well as the ID, you'll only have a single ID for each URL.

If you want to find the short URL for an existing URL, just look it up in the database.

$this->Short->findByUrl($this->url);

If a URL is not found, you'll need create a new record for it. It'd be advantageous for you to create a method on your model that'll do the find/not found/create process.

You can use your ID as your short form (as I did) which, given most sites, will be quite small. If you have 4000 unique URLs, you're using 4 characters. What if you wanted to optimize that even further?

You could convert that integer into a hexidecimal value. Anything under 4096 items will only take 3 characters. That's not bad. Anything over 4096 and you're back to 4 characters.

Creating a super-compressed URL

But what if you wanted to optimize that even further? The trick is to create your own base system with a custom set of characters. This next bit of code isn't CakePHP-specific.

It loops over the original number, converting it into the base that you want. In my particular example, it converts the 300 into 4Q. But you'll get up to 3844 before you need more than 2 characters. And up to over 238,000 before you get past 3 characters. Precious bytes.

If you were setting up a route for this, you can use the following regex pattern:

Of course, feel free to customize the acceptable character list — some people drop 0, O, 1 and l to avoid confusion.

Converting the Compressed Version back to Decimal

Going back is straightforward. In the retrieve method of the shorts controller that we set the route up for, we need to take our compressed ID and uncompress it into an integer we can search the database for.

Each point in the string is multiplied by the base to the power of that position. Then it grabs the URL field for that item. Finally, it redirects them off to their final destination.

Wrapping it up

This blog post twisted and turned but ended up in a great place. The principles of the shortening system could be applied to any system whether it's CakePHP or not. If you're a CakePHP fan, feel free to take this example and build it into a component or plugin.

@Ulf: I hadn't seen the lilURL project. Thanks for pointing it out. Admittedly, I prefer my approach as it leaves the database to what it does best: incrementing integers for primary keys. Converting from one base to another is fairly easy (although it did take me a bit to track it down and put together a JavaScript-based proof).

The other limiting factor with lilURL is that it only seems to take advantage of lower-case letters (from what I can tell). However, maybe I'll look to contribute my ideas back into the project.

On a separate note, I took a quick look at what other characters you could use to expand the character set and possibly compress your URL further, here are the other unreserved characters: -_.!~*'(). I think that'd make the URL less readable but just wanted to share.

Just doing a little research after finding the lilURL library, I found the tighturl project, which is also open source. It does spam filtering, as well.

Finally, and this is something I wasn't aware of, PHP has a base_convert function that'll convert anything up to a base 36 [0-9a-z]. Being able to do upper and lower case, as I've done it, certainly gives you greater compression. Of course, use what works best for you.

Having just made my own URL shortener for personal use a couple weeks ago (in action here: http://a.stronaut.com/z1), I thought it was important that I share a couple of things I learned:

1. Check out the PHP built-in function:

base_convert()

2. The default HTTP header that PHP throws when you do a Location redirect is 302, which is bad because you will end up with your pages possibly twice in Google search results, one of those listings being under the shortened URL. You want to throw a 301 code before the location code:

Nicely Done! I've also thought about building a url shortener, and would've done it much the same way. Although, it'd be great to know: how often do urls get shortened?, and how often a particular shortened url gets used?

@Zach: To prevent duplicate links, you can also use the canonical "link" tag that google endorses:
http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonical.html As per google, it's a hint that they "honor strongly". It's now also supported by Yahoo, Ask.com, and MS Live Search.

I agree we should be mindful of sending 301 v 302, but I haven't seen too many browsers that change behavior based on the redirect code.

Have you considered using rev="canonical"? Drew McLellan pointed me to it when I was recording my podcast - it's a semantic way of representing a shorter link. You simply embed a link tag into each page, with the target pointing to the shorter link.

PHP.net makes good use of it - if you view source on every single function page you can see the shorter URL. And if you visit http://revcanonical.appspot.com/, you can see it working in action.

I guess it's the next step for the semantic web - but what you've already accomplished is sweet, and I reckon rev="canonical" will make it even sweeter.

Anyway, if you've got any questions about rev="canonical", semantics or Microformats, give me an email - I'll be happy to help!

I made a similar post a day or two ago (with far less graceful code), but the whole setup is kind of useless on my site (the domain alone has seventeen characters). I may be able to fit up to 40 links in my bar, using the trick where you don't need a question mark. I spent hours looking at different ways of doing it, though.

Anyone interested in hosting their own URL Shortener should check out the project I manage, urlShort at http://urlshort.sourceforge.net. Weâ€™re aiming to have it include all the good features of URL shorteners, but be free, open-source, and easy to use/setup.

Your writeup was great. The main gem for me was the base conversion code. I wrapped it in a static class and started experimenting. Your solution, base 58/60, is a much better solution than base 36 because codes. With 4 character code you can represent 0 to approximately 11.3M vs only 1.7M in base 36. Things get crazy at 6 codes, 58^6 vs 36^6.

I started experimenting with really large numbers to encode and your code fell over. It lost precision and was not able to accurately encode large integers though the decoding process worked fine. Basically after MAX_SIGNED_INT 32 bit (approximately 2.7T) it would fall over. Encoding and then decoding back would result in rounding errors.

Working in finance, I knew that I would need to convert your algorithm to use a big math library. PHP's floor blows up with big numbers and of course modulo and division. I had to implement floor since bcmath doesn't come with that baked in. The class works with insanely huge numbers now. Like all of the atoms in the universe size numbers. You can use it as a basis for an arbitrary base encoding by just augmenting the character set. Base 60 is great if you really aren't worried about people having to ever read the codes over the phone. Base 61 basically adds a hyphen to the mix, though leading,trailing, and multiple hyphens would look weird to most people so it's probably a bad idea. "---" would be a valid number.

I'm using the class in a project to include canonical short urls in content pages on my site. The added benefit is that I can just use the existing integer table ids. I tack on a prefix letter to identify the type of content (table type) and then I can convert the trailing code, get the id, and then lookup the SEO friendly long URL. I return that back to the browser with a 301 permanent redirect. Hope this helps peeps.

When I started trying the algorithm I saw that really large numbers where not working =/
Reading PHP manuals I got the same conclusion as |Travell Perkins.
But, I also saw that numbers higher than PHP integer limit are automatic converted to float, that has an even higher limit.
The problem of this conversion is that float numbers cant perform the % algorithm.
So, I came with a solution, that I THOUGHT (only my opinion =P) is a simple way of solving this integer-float problem, and working with a high numbers.

this algorithm have the same idea of yours, but it has a larger range.
Testing it i could see that PHP FLOAT LIMIT is a number with 14 digits.
Its like 11,111,111,111,111 \o/
And the algorithm converts this 14 digits number to a 8 characters key.

Sorry, comments are closed for this post. If you have any further questions or
comments, feel free to send them to me directly.

Hi. My name is Jonathan Snook and this is my site. I write about what interests me, which is usually web design, development, and technology. I'm also in the middle of a food adventure.
I wrote SMACSS. I tweet. Want to learn more?