Imminent Death of the Net Predicted

Warning: the following is excessively technical, and is intended more for the sake of the next poor sod who types "vista dns round robin resolution" into Google than it is for my actual friends list. (Except for a few of you. And you know who you are.) Also, since I want this to be searchable on Google, I can't friends-lock it, so I'm not going to mention who I work for; please don't do so in comments, which are screened for that reason.

So, our website is run out of three separate hosting centres, each of which has its own statically routed IP block (a couple of /27s and a /26). Site 1's block is under 213.x.x.x, site 2 under 146.x.x.x and site 3 under 80.x.x.x. We use F5's Global Traffic Manager (GTM) on BigIP hardware to spread the user load (up to 22 million pages per day) across the three sites (and also Local Traffic Manager to load balance across individual web servers within each site). We cache some user data in the web application, so we use cookies (with an eight hour lifetime) to identify which server you're currently on, and we send you back to that server where possible, even if it's at a different site (we have high-bandwidth private links between the sites).

To improve our resilience, GTM is configured to return two A records for each DNS request for hostname www.<domain>.co.uk, with equal weighting between the three sites. This is to enable some web browsers to fail over more quickly to another site if one site fails for some reason. We don't try to do anything clever like sending the user to the site that will give them the quickest response, since most of our users are in the UK and geography's not much of an issue.

For some time now, we've been noticing higher traffic at site 1 and lower traffic at site 3, and the difference has been slowly increasing. This is odd, since the load should be equally balanced between the three sites.

After several days of analysing log files, it looked like a performance problem at site 3, and to a lesser extent at site 2. We use Javascript to report the total rendering time of 10% of pages served, and those times were longer at site 3 than at site 2, which in turn were longer than site 1. However, site 3 was slow both for users coming to an IP address at site 3 but being sent to a web server at another site, and also for users coming to an IP address at another site but being sent to a web server at site 3, which didn't seem to make any sense. Users sent between sites 1 and 2 had better performance, even though (as it happened) the private network link between sites 1 and 2 also went through site 3. The obvious conclusion was that users at site 3 were having pages load slowly enough that they loaded fewer of them, which would be bad news -- it could potentially be reducing our total traffic by several million pages per day.

Further analysis of log files then showed that the problem was only affecting Windows Vista users (30% of our total traffic). Other users showed the same performance and traffic at all three sites.

Googling for Vista network performance issues turned up a big red herring about TCP window scaling, which Vista implements for the first time in Windows, and can cause performance issues with some routers. This was still hard to use as an explanation, given that users on site 1 had good performance, but users coming to site 3 web servers through site 1, using the same router, had poor performance.

So as an experiment, we took site 3 out of the DNS pool altogether for a day. All DNS lookups now returned the addresses for sites 1 & 2. Suddenly, site 2 was just as bad as site 3 had been -- its total number of pages to Vista users went down rather than up, even though its total traffic was up by nearly 50%.

This suggested strongly that for some reason Vista was preferring site 1 to site 2 or 3, and site 2 to site 3, when choosing an IP address from the round-robin A records presented to it. Some more Googling eventually found RFC3484, which relates to DNS resolution in IPV6, but part of which is back-ported to IPV4. Vista is apparently the first major client OS to implement it, specifically section 6 rule 9. That specifies that the selection of an address from multiple A records is no longer random, but instead the destination address which shares the most prefix bits with the source address is selected, presumably on the basis that it's in some sense "closer" in the network.

Now, this may well make sense in IPV6 (I don't know enough about it to comment), but it's an insane algorithm to use in IPV4. First, the Internet is not laid out that way. As any comic artist can tell you, Europe does have a nice block from 80.0.0.0 to 91.255.255.255, but it also has chunks from 193-195 and 212-213, plus there's lots of geographically random stuff between 128 and 172.

But second, and more important, very few Windows client PCs actually have public IP addresses. If you're behind a NAT gateway, the DNS client in your Windows PC doesn't know the IP address you're using on the Internet, just the local network address you're using in one of the ranges specified by RFC1918. Now, in theory, that could be in 10.0.0.0/8, 172.16.0.0/12 or 192.168.0.0/16, but in practice nearly all home routers allocate addresses in the 192.168 range. As it happens, that shares two prefix bits with our site 1 address, one bit with our site 2 address and 0 bits with our site 3 address, so any Vista PC on a home network will always prefer site 1 over sites 2 or 3, and site 2 over site 3. This explains the difference in traffic volumes. A user with a slow and dodgy connection may have pages timeout, at which point their browser sends them to another IP address, so those users who have inherently worse performance are much more likely to find their way to site 3. Also, the few remaining dialup users actually have public IP addresses, which may well be in the European range from 80.0.0.0 to 91.255.255.255, which shares the most prefix bits with site 3 and thus is more likely to go to site 3. These factors explain the poor performance we saw at site 3.

So we're going to have to take a slight hit to our resilience and reduce the number of A records we return for a DNS lookup to one instead of two. This will be affecting other large multi-site websites as well -- for example, www.google.com returns three IP addresses in different ranges. And Microsoft have broken the Internet. Again. Although, to be fair, they did have some help this time from the IETF.

(I found this from a discussion on the Debian mailing list about the implementation of RFC3484 in glibc in Debian Etch. They eventually backed it out and only used section 6 rule 9 for destination addresses on the same subnet, which seems like a much better way to do it.)

Welcome to the new LiveJournal

Some changes have been made to LiveJournal, and we hope you enjoy them! As we continue to improve the site on a daily basis to make your experience here better and faster, we would greatly appreciate your feedback about these changes. Please let us know what we can do for you!