Search

Introduction

I’ve made a node.js 3rd party module to help you auto-shard your data to any number or type of data stores using an algorithm known as consistent hashing. If you understand that sentence, you can skip this article and check out the node-hash-ring module on github. If you want to read more, I’ve divided this post into the following sections:

Sharding

As a hot startup, you discover that your database is starting to slow down and/or that your memcache instance is no longer large enough to store what you ideally want in memory. You can solve this by adding a 2nd server and then letting half your data live on server 1 and the other half live on server 2. This is sharding.

In order to shard your data across 2 or more servers, you need to choose how to pick the right server for a particular key. For example, if you’re updating a user with an id of 101, you need to choose an algorithm that will determine on which server user 101 is stored or held in memory. There are several strategies to do this [1]. For example, you could keep a lookup table of all user ids and their corresponding servers. Another popular approach uses a concept known as consistent hashing.

Consistent Hasing

A naive way to shard data is to mod the incoming id with the number of servers. For instance, with 6 servers, a user with id 9 is sharded to server 3 (9 % 6 = 3), while a user with id 12 is sharded to server 0 (12 % 6 = 0). This will work…that is, until you need to add another server. Then, a lot of your mapping is invalidated. Now, with 7 instead of 6 servers:

User 6 now shards to server 6 instead of server 0.

User 7 now shards to server 0 instead of server 1.

User 8 now shards to server 1 instead of server 2.

User 9 now shards to server 2 instead of server 3.

etc. (you get the point)

As you see, you’ll have to end up moving around a lot of data so that when you lookup user 9, it’ll be where you expect it to be (on server 2).

Consistent hashing to the rescue. It enables you to grow or shrink your server cluster with a LOT LESS re-mapping of data. This is how it works:

For each server in your list, hash EACH server to several “virtual points” on that ring. The result is a circle with numbered virtual points that link back to the server from which it was hashed.

Mapping a key (e.g., user id 9) to a server

Hash your key onto the same ring using the same hashing function that mapped the servers onto the ring. This produces a number that you can pinpoint on the ring.

Find the next biggest number on the ring that’s a virtual point.

The server linked to from that virtual point is the server to which the key maps.

How does this reduce the amount of data that you have to remap whenever you expand or contract your cloud? I’ve borrowed the following images from Tom White’s consistent hashing post in order to illustrate exactly how. For simplicity, we’ll assume each server is hashed to only one virtual point.

Suppose you have 3 servers (A, B, C) and 4 keys of interest (1, 2, 3, 4). If they hash according to the image below, then 2 will belong to B; 3 will belong to C; 4 and 1 will belong to A.

Now suppose you remove server C. This is what occurs in the below image. Then, the only data that has to be re-mapped is key 3, which now belongs to server D.

Simple enough, right? Except, if we leave this example, and move to a real implementation of consistent hashing, we hash each server to more than one virtual point. Why do we do that? Because it distributes the incoming keys more evenly across all servers. For example, in the most recent image, you can see geometrically that the arc B->D is longer than the arc D->A, and arc d->A is longer than the arc A->B. Therefore, if we assume over the long-run that keys will map uniformly over the ring, then more points (and therefore keys) will end up on D than on A, and more points will end up on A than on B. To avoid this unequal distribution, we hash each server to over 100 virtual points. In practice, this achieves an even distribution of keys among servers.

node-hash-ring

node-hash-ring brings consistent hashing to node.js in an easy-to-use module. I’ve implemented it as a C++ add-on — in other words, it’s a node.js wrapper around C++ code. Here’s an example:

You instantiate a HashRing instance with a hash that lists the servers and their weights. Weights work in the following way. Server 127.0.0.2:8080 will store approximately 2 times as much data as either of the other 2 servers listed. Weights are useful in scenarios where not all servers in your cluster have similar power (e.g., 127.0.0.2 might have 2x the memory as either 127.0.0.1 and 127.0.0.3).

My implementation is based on:

libketama, which was written and open sourced by the team at last.fm and which was used in production there.

I’ll be releasing a library that depends on this soon, my anticipated redis client for node.js (that will support transactions, pipelining, and a distributed api), so stay tuned to see how node-hash-ring is used with a database. If you end up using this in your node project, let me know in the comments below, and I can link to implementations that use this from my post. Happy coding!

[1] Max Indelicatos wrote an excellent guide covering sharding strategies on his blog.

4 Responses to “Towards Auto-sharding in Your Node.js App”

Thanks! My redis client, node-redis, is live at http://github.com/bnoguchi/redis-node, if you want to take it for a spin. I’m currently finishing up an object mapper around it and will be releasing blog posts about both soon after.

This doesn’t solve the problem of actually re-mapping the data though when you need to add a new server. Not an issue for a cache like memcached (which is what libketema was written for), but it is an issue for data storage.

One way around that (not a perfect fix, but it helps) would be to store your hash value in the data store too, and then you can pull out everything that needs re-mapped much more easily.