Developer tips

RabbitMQ in Multiple AWS Availability Zones

When working with AWS, in order to have a highly-available setup, once must have instances in more than one availability zone (AZ ≈ data center). If one AZ dies (which may happen), your application should continue serving requests.

It’s simple to setup your application nodes in multiple AZ (if they are properly written to be stateless), but it’s trickier for databases, message queues and everything that has state. So let’s see how to configure RabbitMQ. The first steps are not relevant only to RabbitMQ, but to any persistent data solution.

First (no matter whether using CloudFormation or manual setup), you must:

Have a VPC. It might be possible without a VPC, but I can’t guarnatee that, especially the DNS hostnames as discussed below

Declare private subnets (for each AZ)

Declare the RabbitMQ autoscaling group (recommended to have one) to span multiple AZs, using:

"AvailabilityZones" : {
"Fn::GetAZs" : {
"Ref": "AWS::Region"
}
}

Declare the RabbitMQ autoscaling group to span multiple subnets using the VPCZoneIdentifier property

Declare the LoadBalancer in front of your RabbitMQ nodes (that is the easiest way to ensure even distribution of load to your Rabbit cluster) to span all the subnets

Declare LoadBalancer to be "CrossZone": true

Then comes the specific RabbitMQ configuration. Generally, you have two options:

Clustering is not recommended in case of WAN, but the connection between availability zones can be viewed (maybe a bit optimistically) as a LAN. (This detailed post assumes otherwise, but this thread hints that using a cluster over multiple AZ is fine)

With federation, you declare your exchanges to send all messages they receive to another node’s exchange. This is pretty useful in a WAN, where network disconnects are common and speed is not so important. But it may still be applicable in a multi-AZ scenario, so it’s worth investigating. Here is an example, with exact commands to execute, of how to achieve that, using the federation plugin. The tricky part with federation is auto-scaling – whenever you need to add a new node, you should modify (some of) your existing nodes configuration in order to set the new node as their upstream. You may also need to allow other machines to connect as guest to rabbitmq ([{rabbit, [{loopback_users, []}]}] in your rabbitmq conf file), or find a way to configure a custom username/password pair for federation to work.

With clustering, it’s a bit different, and in fact simpler to setup. All you have to do is write a script to automatically join a cluster on startup. This might be a shell script or a python script using the AWS SDK. The main steps in such a script (which, yeah, frankly, isn’t that simple), are:

Find all running instances in the RabbitMQ autoscaling group (using the AWS API filtering options)

If this is the first node (the order is random and doesn’t matter), assume it’s the “seed” node for the cluster and all other nodes will connect to it

If this is not the first node, connect to the first node (using rabbitmqctl join_cluster rabbit@{node}), where {node} is the instance private DNS name (available through the SDK)

Stop RabbitMQ when doing all configurations, start it after your are done

In all cases (clustering or federation), RabbitMQ relies on domain names. The easiest way to make it work is to enable DNS hostnames in your VPC: "EnableDnsHostnames": true. There’s a little hack here, when it terms to joining a cluster – the AWS API may return the fully qualified domain name, which includes something like “.eu-west-1.compute.internal” in addition to the ip-xxx-xxx-xxx-xxx part. So when joining the RabbitMQ cluster, you should strip this suffix, otherwise it doesn’t work.

The end results should allow for a cluster, where if a node dies and another one is spawned (by the auto-scaling group), the cluster should function properly.

Comparing the two approaches with PerfTest yields better throughput for the clustering option – about 1/3 less messages were processed with federation, and also there was a bit higher latency. The tests should be executed from an application node, towards the RabbitMQ ELB (otherwise you are testing just one node). You can get PerfTest and execute it with something like that (where the amqp address is the DNS name of the RabbitMQ load balancer):

Which of the two approaches you are going to pick up depends on your particular case, but I would generally recommend the clustering option. A bit more performant and a bit easier to setup and to support in a cloud environment, with nodes spawning and dying often.