Just a few weeks ago, Amazon Web Services announced Failover DNS records in Route 53. I’ve played with them a little bit, and so far they’re amazing! This is useful because you shouldn’t need to worry about outages anymore. If a hosted zone were to go down, you could failover to a different AWS region (or another hosting provider). Here’s my quick start guide:

1. I’ve created 2 new EC2 micro instances and installed nginx on both. Ideally, you’d want to host these in two different AWS regions. Both of mine are in N. Virginia (oops!). I’ve customized the index.html files on each to display which server (Primary or Failover) they are running on. In this example, the primary server’s IP is 54.*.*.* and the failover is 50.*.*.*

2. Create a new Route 53 and add a domain to it.

3. We need to create a Route 53 “Health Check” (the link for this is in the left side navigation column) which pings the primary server (54.*.*.*). For this test, I am going to check that index.html is available using port 80.

4. Next, we need to create an A record (or CNAME) on the domain for the primary server. Note that we need to set the TTL to something shorter than usual. AWS recommends that you use 60 seconds so the switch happens as soon as possible. We also need to set the Routing Policy to “failover”. Since this is the primary server, I’ve selected “primary” and added the health check we created before.

5. Finally we need to add the secondary A record (or CNAME) for the failover server. The steps for this are just like the primary except under Routing Policy I’ve selected “Secondary”. You do not need to apply a health check to this one. (This won’t work if you apply the same heath check to both records);

Alright, everything is setup. Browsing to davekz.com shows us that we are hitting the primary server. Now let’s pretend something bad happens to primary server. For this example, I’m going to stop the nginx service on the primary box.

If you were to quick jump back to the site, you’d see that the request times out. Hopefully, after ~60 seconds, the DNS is updated and the failover server is now set as the primary. www.whatsmydns.net is a great tool to check the propagation of the DNS change.

10 Comments

Thanks for the article David, very informative…..however, I have not been successful in setting up the failover between Primary and Secondary following a configuration almost identical to yours. I am attempting to have Primary and Secondary in different regions, but from a Route 53 perspective this shouldn’t have any impact on the configuration AFAIK. What is happening is that the Secondary is always the active server, even if the Primary is actually healthy. My health check appears to be correct….only difference between your and mine is the IPs, and I set the path to / instead of /index.html. I can take the Primary URL that is composed by the health check and plug it into my browser, and the Primary site comes up fine, but if I try to browse by the real host name whose IP is now determined by Route 53 health check, the Secondary site always comes up. My best guess is that the health check is failing, but have no idea why. I’ve tried Googling for an answer, but I guess the failover feature is still new enough that there aren’t many posts related to people’s experiences using it yet. Anyways, if anybody has any suggestions, I’d appreciate you posting them!

Hmm… I’d guess there’s something wrong with your health check (but it’s hard to tell with out seeing it). Does /index.php work? Doubt that will change anything, but it’s worth a shot. I can take a look if you’d like. Shoot me an email at (my first name which starts and ends in the letter “D”)@kryzaniak.com

Halo David, Thanks a lot for information that you share through this article. actually,if you don’t mind i would like to ask for your help david. I’m planning to create a failover from one aws account to another aws account with the help of Route 53 feature. I have tried your explanation, by inputting one elastic ip as primary and other elastic ip from other aws account as secondary so that when the health check assign failure the site will goes to my seconday aws account. but still i can’t get the failover function works properly in my aws account. please kindly inform me about how to manage it so that the failover can work properly in my aws account. thank you very much david.

Going to try this…. sounds good ty…. One question, our site is a very very busy social buddypress site running on wordpress multisite… A) How would we keep the 2 servers, as in content in DB synced? and i don’t mean replicating, if you have a social site e.g FB, how would you create failsafe fallover in AWS on EC2 then?

The other issue I have is we run EC2 instance costing us a bit as we are running a larger server, so creating another instance would double our costs? Would really need some advice here

That’s a tricky problem… Before I get into the technical option, have you considered CloudFlare?

First, I’m not sure Route53 is going to help much. Instead, you’re going to need an Elastic Load Balancer http://aws.amazon.com/elasticloadbalancing. This truly splits traffic between instance. 50% of the traffic goes to ‘EC2-One’ and 50% of the traffic goes to ‘EC2-Two’. Route53 Failover is more for “When EC2-One goes offline, switch all traffic to EC2-Two”.

Hi, excuse me, about the health checks I don’t understand understand what counts as a “health check”, for example, if I need to check http status of a single server every 5 minutes then there is 8640 health checks in a month (and I need to pay 0.50 per health check / month 4320 USD?)? Or it’s only 1 health check because it’s the same server all the time? Thanks in advance.

The title is misleading — DNS updates alone will not result in zero downtime. Browsers, operating systems and other DNS servers cache DNS information all the time. To truly have zero downtime will involve some kind of high availability proxy, like HA Proxy or a Cisco load balancer.

Underlining the Dan Esparza comment.
DNS is not a high availability solution, it’s a fault tolerance solution. You will have downtime and probably many applications, e.g. Java, which don’t refresh their DNS cache easily or don’t respect the TTL, will crash. If you aim for zero downtime, you need a Proxy Load Balance. However, that LB may fail.
If HA is a real concern, then you want to use Anycast (CloudFlare, Akamai, Google Global Load Balancer are options)

Agreed with the 2 comments above. Not every company respects TTL on your dns and it will be some time before the original equipment stops receiving requests. One decent option is to fail over DNS and wait 48 hours while having both systems live. After 48 hours, monitor traffic on the old system and then you can possibly avoid a large percentage of your customers experiencing downtime. I dunno, what do you guys think of that solution?