After some head scratching I noticed that the problem was only arising on those proxies that explicitly defined the IP address of a virtual interface that was being managed by Keepalived (or maybe Heartbeat for you).

This is because both of these High-Availability clustering systems use a rather simplistic design whereby the “shared” virtual IP is only installed on the active node in the cluster. While the nodes that are in a dormant state (i.e. the slaves) do not actually have those virtual IPs assigned to them during that state. It’s a sort of “IP address hot-swapping” design. I learnt this by executing a simple a command, first from the master server:

Unfortunately this behaviour can cause problems for programs like HA-Proxy which have been configured to expect the existence of specific network interfaces on the server. I was considering working around it by writing some scripts that hook events within the HA cluster to handle stopping and starting the HA-Proxy when needed. But this approach seemed clunky and unintuitive. So I dug a little deeper and came across a bit of a gem hidden away in the depths of the Linux networking stack. It is a simple boolean setting called “net.ipv4.ip_nonlocal_bind” and it allows a program like HA-Proxy to create listening sockets on network interfaces that do not actually exist on the server. It was created specially for this situation.

So in the end the fix was as simple as adding/updating the /etc/sysctl.conf file to include the following key/value pair:

net.ipv4.ip_nonlocal_bind=1

My previous experience of setting up these low-level High-Availability clusters was with Windows Server’s feature called Network Load Balancing (NLB). This works quite different from Keepalived and Heartbeat. It relies upon some low level ARP hacking/trickery and some sort of distributed time splicing algorithm. But it does ensure that each node in the cluster (whether in a master or slave position) will remain allocated with the virtual IP address(es) at all times. I suppose there is always more than one way to crack an egg…

Something I have learnt (and re-learnt) too many times to count is that one of the strange wonders of working for a startup company is that the most bizarre tasks can land on your lap seemingly with no warning.

We’ve recently been doing a big revamp of our data centre environment, including two shiny new Hyper-V hosts, a Sonicwall firewall and the decommissioning of lots of legacy hardware that doesn’t support virtualisation. As part of all this work we needed to put in place several capabilities for routing application requests on our SaaS platform:

Expose HTTP/80 and HTTPS/443 endpoints on the public web and route incoming requests based upon URL to specific (and possibly many) private internal servers.

Expose a separate and “special” TCP 443 endpoint (on public web) that isn’t really HTTPS at all but will be used for tunnelling of our TCP application protocol. We intend to use this when we acquire pilot programme customers that don’t want the “hassle” of modifying anything on their network firewalls/proxies. Yes, really. Even worse, it will inspect the source IP address and, from that, determine what customer it is and then route it to the appropriate private internal server and port number.

Expose various other TCP ports on public web and map these (in as traditional “port map” style as possible) directly to one specific private internal server.

Be easy to change the configuration and be scriptable, so we can tick off the “continuous deployment” check box.

Configuration changes must never tamper with existing connections.

Optional bonus, be source controllable.

My first suggestion was that we would write some PowerShell scripts to access the Sonicwall firewall through SSH and control its firewall tables directly. This was the plan for several months in fact, whilst everything was getting put in place inside the data centre. I knew full well it wouldn’t be easy. First there was some political issues inside the company with regard to a developer (me) having access to a central firewall. Second, I knew that creation and testing of the scripts would be difficult and that the whole CLI on the Sonicwall would surely not be as good as a Cisco.

I knew I could achieve #1 and #3 easily on a Sonicwall, like with any router really. But #2 was a little bit of an unknown as, frankly, I doubted if a Sonicwall could do it without jumping through a ton of usability hoops. #4 and #6 were the greatest unknown. I know you can export a Sonicwall’s configuration from the web interface. But it comes down as a binary file; which sort of made me doubt whether the CLI could do it properly as some form of text file. And of course if you can’t get the configuration as a text file then it’s not really going to be truly source controllable either, so that’s #6 out.

Fortunately an alternative (and better!) solution presented itself in the form of HA-Proxy. I’ve been hearing more and more positive things about this over the past couple years: most notably from the Stack Exchange. And having recently finally shed my long-time slight phobia of Linux, I decided to have a go at setting it up this weekend on a virtual machine.

The only downside was that as soon as you move some of your routing decisions away from your core firewall then you start to get a bit worrisome about server failure. So naturally we had to ensure that whatever we came up with involving HA-Proxy can be deployed as a clustered master-master or master-slave style solution. That would mean that if our VM host “A” had a failure then Mr Backup over there, “B”, could immediately take up the load.

It seems that Stack Exchange chose to use the Linux-HA Heartbeat system for providing their master-slave cluster behaviour. In the end we opted for Keepalived instead. It is more or less the same thing except that it’s apparently more geared towards load balancers and proxies such as HA-Proxy. Whereas Heartbeat is designed more for situations where you only ever want one active server (i.e. master-slave(s)). Keepalived just seems more flexible in the event that we decide to switch to a master-master style cluster in the future.

HA-Proxy Configuration

Here’s the basic /etc/haproxy/haproxy.conf that I came up with to meet requirements #1, #2 and #3.

It is rather straight forward as far as other Keepalived configurations go. It is effectively no different to a Windows Server Network Load Balancing (NLB) deployment, with the right options to give the master-slave behaviour. Note the only reason I’ve specified two virtual IP addresses is because I need to use the TCP port 443 twice (for different purposes). These will be port mapped on the Sonicwall to different public IP addresses, of course.

Mercurial, auto-propagation script for haproxy.conf

#!/bin/sh
cd /etc/haproxy/
#
# Check whether remote repo contains new changesets.
# Otherwise we have no work to do and can abort.
if hg incoming; then
#
#
echo "The HA-Proxy remote repo contains new changesets. Pulling changesets..."
hg pull
#
# Update to the working directory to latest revision.
echo "Updating HA-Proxy configuration to latest revision..."
hg update -C
#
# Re-initialize the HA-Proxy by informing the running instance
# to close its listen sockets and then load a new instance to
# recapture those sockets. This ensures that no active
# connections are dropped like a full restart would cause.
echo "Reloading HA-Proxy with new configuration..."
/etc/init.d/haproxy reload
else
echo "The HA-Proxy local repo is already up to date."
fi

I turned the whole /etc/haproxy/ directory into a Mercurial repository. The script above was also included in this directory (to gain free version control!), called sync-haproxy-conf.sh. I cloned this repository onto our central Mercurial master server.

It is then just a case of setting up a basic “* * * * * /etc/haproxy/sync-haproxy-conf.sh” cronjob so that the script above gets executed every minute (don’t worry it’s not exactly going to generate much load).

This is very cool because we can use the slave HA-Proxy server as a sort of testing ground of sorts. We can modify the config on that server quite a lot and test against it (by connecting directly to it’s IP rather than the clustered/virtual IP provided by Keepalived). Then once we’ve got the config just right we can commit it to the Mercurial repository and then push the changeset(s) to the master server. Within 60 seconds then the other server (or servers, in your case possibly!) will then run the synchronisation script.

One very neat thing about the newer versions of HA-Proxy (I deployed version 1.4.11) is that they have an /etc/init.d script that already includes everything you need for doing configuration file rebinds/reloads. This is great because what actually happens is that HA-Proxy will send a special signal to the old process so that it stops listening on the front-end sockets. Then it will attempt to start the new instance based upon the new configuration. If this fails it will send another signal to the “old”, but now resurrected process, that it can resume listening. Otherwise the old process will eventually exit once all its existing client connections have ended. This is brilliant because it meets and rather elegantly exceeds exceeds our expectations for requirement #5.

The fact that our HA-Proxy’s will contain far more meticulous configuration details than even our Sonicwall, I think that this solution based upon Mercurial is simply brilliant. We have what is effectively a test and slave server all-in-one, and a hg revert or hg rollback command is of course only a click away.

It’s still a work in progress but so far I’m very pleased with the progress with HA-Proxy.