Friday, November 23, 2012

Background: I have mentioned earlier that we were baffled by Amazon's Elastic Load Balancer(ELB). We tested with Nginx as reverse proxy, it seems to be hands down better than ELB. (I will have Nginx performance in another post.) To scale this setup horizontally, we configured multiple Nginx with Weighted Round Robin (WRR) DNS configuration; to alleviate some load off Nginx server.

We later decided to have redundant Nginx server in N+1 setup, so that if one server goes down the redundant takes over. I stumbled on Heartbeat. Unfortunately there is no good tutorial on how this can be done on Amazon EC2.

Warning: I will keep this tutorial brief. Limited to Heartbeat configuration, as the Wiki mentioned:

In order to be useful to users, the Heartbeat daemon needs to be combined with a cluster resource manager (CRM) which has the task of starting and stopping the services (IP addresses, web servers, etc.) that cluster will make highly available. Pacemaker is the preferred cluster resource manager for clusters based on Heartbeat.

Goal: We need to make sure auxiliary node takes over, as soon as a main node dies. This is what should happen:

This configuration is same on both Nginx server except one thing, the add_header X-Whom node_1; part. This is to identify which Nginx loadbalancer is serving. This will help to debug the situation later. On the second Nginx, we have add_header X-Whom node_2;. This line says Nginx to inject a header X-Whom to each response.

Configure Heartbeat: There are three things you need to configure to get stuffs working with Heartbeat.

ha.cf:/etc/ha.d/ha.cf is main configuration file. It has list of nodes, features to be enabled. And the stuffs in it is order sensitive.

authkeys: It is basically to maintain security of cluster. It provide a mean to authenticate nodes in a cluster.

haresources: List of resources to be managed by heartbeat. It looks like this preferredHost service1 service2. where preferredHost is the hostname where you prefer the subsequent services to be executed. service1 service2 are the service that stay inside /etc/init.d/ and stops and starts the service when called in /etc/init.d/service1 stop|start.

When a node is woken up, the service is called from left to right. Means service1 first; and then service2. And when a node is brought down service2 is terminated first, and then service1.

---------
For sake of simplicity lets call one node main node and the other aux node.

where,deadtime: seconds after which a host is considered dead if not responding.warntime: seconds after which the late Heartbeat warning will be issued.initdead: seconds to wait for the other host after starting Heartbeat, before it is considered dead.udpport: port number used for bcast/ucast communication. The default value for this is 694.bcast/ucast: interface on which to broadcast/unicast.auto_failback: if ‘on’, resources are automatically failed back to its primary node.node: nodes in the HA set-up, dentified by uname -n.
----

/etc/ha.d/authkeys must be the same on both the machines. You may generate a random auth-secretkey using, date|md5sum

authkeys on both the machines:

auth 1
1 sha1 1e8d28a4627ed7f83faf1d57f5b11645

----

/etc/ha.d/haresources

haresources on both the machines:

ip-10-144-75-85 updateEIP nginx

updateEIP and nginx are shell script that I wrote and stored under /etc/init.d/.

Test: Time to test. Note that we are just watching nodes, not Nginx service. So, we need to make heartbeat service on 'aux' look like as if 'main' server is down.

Keep tailing the debug file in 'aux' machine. This will show you the transition. Now, it's time to kill 'main' node. On main, so this:

service heartbeat stop
/etc/init.d/nginx stop

You can see the debug file on 'aux' shows that 'aux' is taking over 'main'. Now, to see how service switches back to main node when it comes up. Start just the heartbeat service, you will see: a. aux: EIP gets detached. (not really, but you can), b. aux: nginx stops, c. main: EIP is assigned, d. main: Nginx is started.

Note: While we flip EIPs, you may get disconnected from SSH connection. So, keep looking into AWS web console to get new assigned public DNS for node which we revoked the EIP from. And use EIP to connect to the node which assigned it.

Conclusion: So we have basic HA configuration ready, we need to add toppings to this setup to be able to monitor services and respond them. You need to add a Cluster Resource Management (CRM) layer over it to be useful.