From reading, it seems like DNS failover is not recommended just because DNS wasn't designed for it. But if you have two webservers on different subnets hosting redundant content, what other methods are there to ensure that all traffic gets routed to the live server if one server goes down?

To me it seems like DNS failover is the only failover option here, but the consensus is it's not a good option. Yet services like DNSmadeeasy.com provide it, so there must be merit to it. Any comments?

16 Answers
16

By 'DNS failover' I take it you mean DNS Round Robin combined with some monitoring, i.e. publishing multiple IP addresses for a DNS hostname, and removing a dead address when monitoring detects that a server is down. This can be workable for small, less trafficked websites.

By design, when you answer a DNS request you also provide a Time To Live (TTL) for the response you hand out. In other words, you're telling other DNS servers and caches "you may store this answer and use it for x minutes before checking back with me". The drawbacks come from this:

With DNS failover, a unknown percentage of your users will have your DNS data cached with varying amounts of TTL left. Until the TTL expires these may connect to the dead server. There are faster ways of completing failover than this.

Because of the above, you're inclined to set the TTL quite low, say 5-10 minutes. But setting it higher gives a (very small) performance benefit, and may help your DNS propagation work reliably even if there is a short glitch in network traffic. So using DNS based failover goes against high TTLs, but high TTLs are a part of DNS and can be useful.

The more common methods of getting good uptime involve:

Placing servers together on the same LAN.

Place the LAN in a datacenter with highly available power and network planes.

Use a HTTP load balancer to spread load and fail over on individual server failures.

Get the level of redundancy / expected uptime you require for your firewalls, load balancers and switches.

Have a communication strategy in place for full-datacenter failures, and the occasional failure of a switch / database server / other resource that cannot easily be mirrored.

A very small minority of web sites use multi-datacenter setups, with 'geo-balancing' between datacenters.

I think he's specifically trying to manage failover between two different data centres (note the comments about different subnets), so placing the servers together/using load balancers/extra redundacy isn't going to help him (apart from redundant data centres. But you still need to tell the internet to go to the one that's still up).
–
CianAug 30 '09 at 23:22

There are lots of different stories here being used as justification for casting RR DNS in a bad light. Regarding shuffling - the objective here is to support clients which don't properly implement the resolver with no net impact on those which do. Short TTLs don't work. RR DNS does work for browsers as clients, failover occurs in seconds not minutes or hours.
–
symcbeanSep 23 '14 at 15:55

The issue with DNS failover is that it is, in many cases, unreliable. Some ISPs will ignore your TTLs, it doesn't happen immediately even if they do respect your TTLs, and when your site comes back up, it can lead to some weirdness with sessions when a user's DNS cache times out, and they end up heading over to the other server.

Unfortunately, it is pretty much the only option, unless you're large enough to do your own (external) routing.

DNS failover defintely works great. I have been using it for many years to manually shift traffic between datacenters, or automatically when monitoring systems detected outages, connectivity issues, or overloaded servers. When you see the speed at which it works, and the volumes of real world traffic that can be shifted with ease - you'll never look back. I use Zabbix for monitoring all of my systems and the visual graphs that show what happens during a DNS failover situation put all my doubts to and end. There may be a few ISPs out there that ignore TTLs, and there are some users still out there with old browsers - but when you are looking at traffic from millions of page views a days across 2 datacenter locations and you do a DNS traffic shift - the residual traffic coming in that ignores TTLs is laughable. DNS failover is a solid technique.

DNS was not designed for failover - but it was designed with TTLs that work amazingly for failover needs when combined with a solid monitoring system. TTLs can be set very short. I have effectively used TTLs of 5 seconds in production for lightening fast DNS failover based solutions. You have to have DNS servers capable of handling the extra load - and named won't cut it. However, powerdns fits the bill when backed with a mysql replicated databases on redundant name servers. You also need a solid distributed monitoring system that you can trust for the automated failover integration. Zabbix works for me - I can verify outages from multiple distributed Zabbix systems almost instantly - update mysql records used by powerdns on the fly - and provide nearly instant failover during outages and traffic spikes.

But hey - I built a company that provides DNS failover services after years of making it work for large companies. So take my opinion with a grain of salt. If you want to see some zabbix traffic graphs of high volume sites during an outage - to see for yourself exactly how good DNS failover works - email me I'm more than happy to share.

It'as my eperience that short TTLs DO NOT WORK across the internet. You might be running DNS servers that respect the RFCs - but there are a lot of servers out there which don't. Please don't assume this is an argument against Round Robin DNS - see also vmiazzo's answer below - I've run busy sites using RR DNS and tested it - it works. The only problems I encountered were with some Java based clients (not browsers) which didn't even try to reconnect on failure let alone cycle the list of hosts on an RST
–
symcbeanSep 23 '14 at 15:50

The prevalent opinion is that with DNS RR, when an IP goes down, some clients will continue to use the broken IP for minutes. This was stated in some of the previous answers to the question and it is also wrote on Wikipedia.

The use of multiple A records is not a trick of the trade, or a feature conceived by load balancing equipment vendors. The DNS protocol was designed with support for multiple A records for this very reason. Applications such as browsers and proxies and mail servers make use of that part of the DNS protocol.

Maybe some expert can comment and give a more clear explanation of why DNS RR is not good for high availability.

Thanks,

Valentino

PS: sorry for the broken link but, as new user, I cannot post more than 1

Multiple A records are designed in, but for load balancing, rather than for fail over. Clients will cache the results, and continue using the full pool (including the broken IP) for a few minutes after you change the record.
–
CianSep 29 '09 at 10:10

7

So, is what is wrote on crypto.stanford.edu/dns/dns-rebinding.pdf chapter 3.1 false? <<Internet Explorer 7 pins DNS bindings for 30 minutes.1 Unfortunately, if the attacker’s domain has multiple A records and the current server becomes unavailable, the browser will try a different IP address within one second.>>
–
vmiazzoSep 29 '09 at 14:08

There are a bunch of people that use us (Dyn) for failover. It's the same reason sites can either do a status page when they have downtime (think of things like Twitter's Fail Whale)...or simply just reroute the traffic based on the TTLs. Some people may think that DNS Failover is ghetto...but we seriously designed our network with failover from the beginning...so that it would work as well as hardware. I'm not sure how DME does it, but we have 3 of 17 of our closest anycasted PoPs monitor your server from the closest location. When it detects from two of the three that it's down, we simply reroute the traffic to the other IP. The only downtime is for those that were at that requested for the remainder of that TTL interval.

Some people like to use both servers at once...and in that case can do something like a round robin load balancing...or geo based load balancing. For those that actually care about the performance... our real time traffic manager will monitor each server...and if one is slower...reroute the traffic to the fastest one based on what IPs you link in your hostnames. Again...this works based on the values you put in place in our UI/API/Portal.

I guess my point is...we engineered dns failover on purpose. While DNS wasn't made for failover when it originally was created...our DNS network was designed to implement it from the get go. It usually can be just as effective as hardware..without depreciation or the cost of hardware. Hope that doesn't make me sound lame for plugging Dyn...there are plenty of other companies that do it...I'm just speaking from our team's perspective. Hope this helps...

I ran DNS RR failover on a production moderate-trafficked but business-critical website (across two geographies) for many years.

It works fine, but there are at least three subtleties I learned the hard way.

1) Browsers will failover from a non-working IP to a working IP after 30 seconds (last time I checked) if both are considered active in whatever cached DNS is available to your clients. This is basically a good thing.

But having "half" your users wait 30 seconds is unacceptable, so you will probably want to update your TTL records to be a few minutes, not a few days or weeks so that in case of an outage, you can rapidly remove the down server from your DNS. Others have alluded to this in their responses.

2) If one of your nameservers (or one of your two geographies entirely) goes down which is serving your round-robin domain, and if the primary one of them goes down, I vaguely recall you can run into other issues trying to remove that downed nameserver from DNS if you have not set your SOA TTL/expiration for the nameserver to a sufficiently low value also. I could have the technical details wrong here, but there is more than just one TTL setting that you need to get right to really defend against single points of failure.

3) If you publish web APIs, REST services, etc, those are typically not called by browsers, and thus in my opinion DNS failover starts to show real flaws. This may be why some say, as you put it "it is not recommended". Here's why I say that. First, the apps that consume those URLs typically are not browsers, so they lack the 30-second failover properties/logic of common browsers. Second, whether or not the second DNS entry is called or even DNS is re-polled depends very much on the low-level programming details of networking libraries in the programming languages used by these API/REST clients, plus exactly how they are called by the API/REST client app. (Under they covers, does the library call get_addr, and when? If sockets hang or close, does the app re-open new sockets? Is there some sort of timeout logic? etc etc)

It's cheap, well-tested, and "mostly works". So as with most things, your mileage may vary.

One option for multi data-center failover is to train your users. We advertise to our customers that we provide multiple servers in multiple cities and in our signup emails and such include links directly to each "server" so that users know if one server is down they can use the link to the other server.

This totally bypasses the issue of DNS failover by just maintaining multiple domain names. Users who go to www.company.com or company.com and login get directed to server1.company.com or server2.company.com and have the choice of bookmarking either of those if they notice they get better performance using one or the other. If one goes down users are trained to go to the other server.

The alternative is a BGP based failover system. It's not simple to set up, but it should be bullet proof. Set up site A in one location, site B in a second all with local IP addresses, then get a class C or other block of ip's that are portable and set up redirection from the portable IP's to the local IP's.

There are pitfalls, but it's better than DNS based solutions if you need that level of control.

Ive been using DNS failover to protect our company website for a few years now. I've never had any issues with TTL and from my tests the 30 second TTL works great. As soon as the TZOHA monitors detect the server down, it instantly switches the DNS record to the live server. I was leary about moving our DNS but after speaking with the TZO sales rep for their failover service and seeing some reviews about this technology, I had faith and it hasn't let me down.

DNS wasn't designed to do this but integrating ideas and technology together often solves many problems. I'm a happy customer of DNS failover using TZO and won't be spending thousands of dollars on hardware devices and training!

Advertise much? That aside, others have pretty well covered why this might work for some, and why you're taking your chances using it for most production environments (though it's better than nothing).
–
Chris SJun 23 '10 at 15:46

I've been using DNS based site-balancing and failover for the last ten years, and there are some issues, but those can be mitigated. BGP, while superior in some ways is not a 100% solution either with increased complexity, probably additional hardware costs, convergence times, etc...

I've found combining local (LAN based) load balancing, GSLB, and cloud based zone hosting is working quite well to close up some of the issues normally associated with DNS load balancing.

"and why you're taking your chances using it for most production environments (though it's better than nothing)."

Actually, "better than nothing" is better expressed as "the only option" when the presences are geographically diverse. Hardware load balancers are great for a single point of presence, but a single point of presence is also a single point of failure.

There are plenty of big-dollar sites that use dns based traffic manipulation to good effect. They are the type of sites who know on an hourly basis if sales are off. It would seem that they are the last to be up for "taking your chances using it for most production environments". Indeed, they have reviewed their options carefully, selected the technology, and pay well for it. If they thought something was better they would leave in a heartbeat. The fact that they still choose to stay speaks volumes about real world usage.

Dns based failover does suffer from a certain amount of latency. There is no way around it. But, it is still the only viable approach to failover management in a multi-pop scenario. As the only option, it is far more than "better than nothing".

Another option would be to set up name server 1 in location A and name server 2 in location B, but set each one up so all A records on NS1 point traffic to IPs for location A, and on NS2 all A records point to IPs for location B. Then set your TTLs for a very low number, and make sure your domain record at the registrar has been setup for NS1 and NS2. That way, it will automatically load balance, and fail over should one server or one link to a location goes down.

I've used this approach in a slightly different way. I have one location with two ISPs and use this method to direct traffic over each link. Now, it may be a bit more maintenance than you're willing to do... but I was able to create a simple piece of software that automatically pulls NS1 records, updates A record IP addresses for select zones, and pushes those zones to NS2.

Don't the nameservers take too much to propagate? If you change a DNS record with low TTL it will work instantly, but when you change nameserver it will take 24 horus or more to propagate, hence I don't see how this could be a failover solution.
–
Marco DemaioJan 27 '14 at 16:59

All of these answers have some validity to them, but I think it really depends on what you are doing and what your budget is. Here at CloudfloorDNS, a large percentage of our business is DNS, and offering not only fast DNS, but low TTL options and DNS failover. We wouldn't be in business if this didn't work and work well.

If you are a multinational corporation with unlimited budget on uptime, yeah, the hardware GSLB load balancers and tier 1 datacenters is great, but your DNS still needs to be fast and rock solid. As many of you know, DNS is a critical aspect of any infrastructure, other than the domain name itself, it's the lowest level service that every other part of your online presence rides on. Starting with a solid domain registrar, DNS is just as critical as not letting your domain expire. DNS goes down, it means the whole online aspect of your organization is also down!

When using DNS Failover, the other critical aspects are server monitoring (always multiple geo locations to check from and always multiple (at least 3) should be checking to avoid false positives) and managing the DNS records properly a failure is detected. Low TTL's and some options with the failover can make this a seamless process, and beats the heck out of waking up to a pager in the middle of the night if you are a sys admin.

Overall, DNS Failover really does work and can be very affordable. In most cases from us or most of the managed DNS providers you'll get Anycast DNS along with Server monitoring and failover for a fraction of the cost of hardware options.

So the real answer is yes, it works, but is it for everyone and every budget? Maybe not, but until you try it and do the tests for yourself, it's tough to ignore if you are a small to medium business with a limited IT budget that wants the best uptime possible.

They cover: failover, global load balancing, and a host of related matters.

If your backend architecture permits it, the better option is global load balancing with the failover option. That way, all of the servers and bandwidth are in play as much as possible. Rather than inserting an additional available server on failure, this setup withdraws a failed server from service until it is recovered.

The short answer: it works, but you have to understand the limitations.

I would recommend that you either A, select a datacenter that is multihomed on its own AS, or B, host your name servers in a public cloud. It is REALLY unlikely that EC2, or HP, or IBM will go down. Just a thought. While DNS works as a fix, it is a simply just a fix to a poor design in the network foundation in this case.

Another option, depending on your environment, is to use a combination with IPSLA, PBR and FHRP to accomplish your redundancy needs.