Next up in the toolbox series is an idea so good it deserves an entire article all to itself: consistent hashing.

Let’s say you’re a hot startup and your database is starting to slow down. You decide to cache some results so that you can render web pages more quickly. If you want your cache to use multiple servers (scale horizontally, in the biz), you’ll need some way of picking the right server for a particular key. If you only have 5 to 10 minutes allocated for this problem on your development schedule, you’ll end up using what is known as the naïve solution: put your N server IPs in an array and pick one using key % N.

I kid, I kid — I know you don’t have a development schedule. That’s OK. You’re a startup.

Anyway, this ultra simple solution has some nice characteristics and may be the right thing to do. But your first major problem with it is that as soon as you add a server and change N, most of your cache will become invalid. Your databases will wail and gnash their teeth as practically everything has to be pulled out of the DB and stuck back into the cache. If you’ve got a popular site, what this really means is that someone is going to have to wait until 3am to add servers because that is the only time you can handle having a busted cache. Poor Asia and Europe — always getting screwed by late night server administration.

You’ll have a second problem if your cache is read-through or you have some sort of processing occurring alongside your cached data. What happens if one of your cache servers fails? Do you just fail the requests that should have used that server? Do you dynamically change N? In either case, I recommend you save the angriest tweets about your site being down. One day you’ll look back and laugh. One day.

As I said, though, that might be OK. You may be trying to crank this whole project out over the weekend and simply not have time for a better solution. That is how I wrote the caching layer for Audiogalaxy searches, and that turned out OK. The caching part, at least. But if had known about it at the time, I would have started with a simple version of consistent hashing. It isn’t that much more complicated to implement and it gives you a lot of flexibility down the road.

The technical aspects of consistent hashing have been well explained in other places, and you’re crazy and negligent if you use this as your only reference. But, I’ll try to do my best. Consistent hashing is a technique that lets you smoothly handle these problems:

Given a resource key and a list of servers, how do you find a primary, second, tertiary (and on down the line) server for the resource?

If you have different size servers, how do you assign each of them an amount of work that corresponds to their capacity?

How do you smoothly add capacity to the system without downtime? Specifically, this means solving two problems:

How do you avoid dumping 1/N of the total load on a new server as soon as you turn it on?

How do you avoid rehashing more existing keys than necessary?

In a nutshell, here is how it works. Imagine a 64-bit space. For bonus points, visualize it as a ring, or a clock face. Sure, this will make it more complicated when you try to explain it to your boss, but bear with me:

That part isn’t very complicated.

Now imagine hashing resources into points on the circle. They could be URLs, GUIDs, integer IDs, or any arbitrary sequence of bytes. Just run them through a good hash function (eg, SHA1) and shave off everything but 8 bytes. Now, take those freshly minted 64-bit numbers and stick them onto the circle:

Finally, imagine your servers. Imagine that you take your first server and create a string by appending the number 1 to its IP. Let’s call that string IP1-1. Next, imagine you have a second server that has twice as much memory as server 1. Start with server #2’s IP, and create 2 strings from it by appending 1 for the first one and 2 for the second one. Call those strings IP2-1 and IP2-2. Finally, imagine you have a third server that is exactly the same as your first server, and create the string IP3-1. Now, take all those strings, hash them into 64-bit numbers, and stick them on the circle with your resources:

Can you see where this is headed? You have just solved the problem of which server to use for resource A. You start where resource A is and head clockwise on the ring until you hit a server. If that server is down, you go to the next one, and so on and so forth. In practice, you’ll want to use more than 1 or 2 points for each server, but I’ll leave those details as an exercise for you, dear reader.

Now, allow me to use bullet points to explain how cool this is:

Assuming you’ve used a lot more than 1 point per server, when one server goes down, every other server will get a share of the new load. In the case above, imagine what happens when server #2 goes down. Resource A shifts to server #1, and resource B shifts to server #3 (Note that this won’t help if all of your servers are already at 100% capacity. Call your VC and ask for more funding).

You can tune the amount of load you send to each server based on that server’s capacity. Imagine this spatially – more points for a server means it covers more of the ring and is more likely to get more resources assigned to it.

You could have a process try to tune this load dynamically, but be aware that you’ll be stepping close to problems that control theory was built to solve. Control theory is more complicated than consistent hashing.

If you store your server list in a database (2 columns: IP address and number of points), you can bring servers online slowly by gradually increasing the number of points they use. This is particularly important for services that are disk bound and need time for the kernel to fill up its caches. This is one way to deal with the datacenter variant of the Thundering Herd Problem.

Here I go again with the control theory — you could do this automatically. But adding capacity usually happens so rarely that just having somebody sitting there watching top and running SQL updates is probably fine. Of course, EC2 changes everything, so maybe you’ll be hitting the books after all.

If you are really clever, when everything is running smoothly you can go ahead and pay the cost of storing items on both their primary and secondary cache servers. That way, when one server goes down, you’ve probably got a backup cache ready to go.

Pretty cool, eh?

I want to hammer on point #4 for a minute. If you are building a big system, you really need to consider what happens when machines fail. If the answer is “we crush the databases,” congratulations: you will get to observe a cascading failure. I love this stuff, so hearing about cascading failures makes me smile. But it won’t have the same effect on your users.

Finally, you may not know this, but you use consistent hashing every time you put something in your cart at Amazon.com. Their massively scalable data store, Dynamo, uses this technique. Or if you use Last.fm, you’ve used a great combination: consistent hashing + memcached. They were kind enough to release their changes, so if you are using memcached, you can just use their code without dealing with these messy details. But keep in mind that there are more applications to this idea than just simple caching. Consistent hashing is a powerful idea for anyone building services that have to scale across a group of computers.

45 Responses to “Programmer’s Toolbox Part 3: Consistent Hashing”

Would sticky sessions enabled on a load balancer help with the caching issue “If you want your cache to use multiple servers (scale horizontally, in the biz), you’ll need some way of picking the right server for a particular key.”? Good article by the way…thanks.

Your clear explanation and illustrations inspired me to write an open source implementation for PHP, as I couldn’t see anything decent around that fit the bill. I’ve put it on Google Code at http://code.google.com/p/flexihash/ in case anybody needs it.

I got half excited about your article… hits most of the squares… However your circle shows the servers nicely equadistant on the 360 but if we leave their placement to the hash function then they could actually all appear in a very accute part of the circle so that the services or resources could mostly be directed unneavenly to one server….

Would it not be best to divide the available resources equally in the pie? (and one could take into account weighting on this also)

If in the case that one server fails, then instead of applying roundrobin, would it not be best to select the best server… i.e. the server with the most currently available processing capability (obtained from server size and current load).

Good Article Tom,
For newbies, if you include active-standby also in this article, then removal use-case will make more sense. I mean when one server has to be removed, what will happen to the data it contained. So either (a) the standby server is assigned the same IP ( say using Virtual IP ) or (b) a new server which was having the original data of the failed node will be added to the hashing and failed node IP will be removed.

Great article. I ran a few tests that started with 3 servers, each weighted with 3000-5000 points handling 2000 resources over several million random resource requests. Half way through the test, I dynamically added another, much more powerful server (50000 points). The hashing worked great. The requests were distributed according to weight over the first half of the test and weighted heavily on the last server for the second half of the test.

The only outlier is if you have, say, 3 resources instead of 2000. In that case, server 1 will never be used unless it has exponentially more weight than the other servers.

Wow. Amazon really screwed the pooch with Dynamo by pretending that the “hash wheel” was a solution to anything (unless the problem that they were trying to solve was “How to create needlessly complex systems.”)

Now tech-lemmings everywhere are preaching this ludicrously bad design as the second coming of Microsoft Bob.

Look, if a design solves a problem poorly, and degrades drastically on just about every type of failure condition, then the design sucks. Logic alone is enough to cause this design to be discarded.

Cameron, it’d be cool if you’d state your case with some pointers to facts, research, or something helpful. You’ve essentially just said “You’re stupid.” which isn’t horribly useful to helping us become smarter.

But I still do not understand how to make the caching server balanced. As shown in the third figure, when IP2-1 and IP2-2 are added, since their positions are based on hash values and thus randomized, is it possible that the second server will only occupy a small fraction of the whole interval, instead of 1/3?

In reality, you would extend this idea by assigning thousands or more keys per server (IP2-1 … IP2-5000, for instance). If you do this for every server and using a uniform hash function, you’ll have balanced load distribution.

Really liked it. I had learned abou the idea of consistent hashing a long time ago, and this article, especially the pictures, brought all the memories back, and in a few minutes, I could freshen up my memory. Thanks for writing a good article.