6

It has recently come to my attention that setting up multiple A records for a hostname can be used not only for round-robin load-balancing but also for automatic failover.

So I tried testing it:

  1. I loaded a page from our domain
  2. Noted which of our servers had served the page
  3. Turned off the web server on that host
  4. Reloaded the page

And indeed the browser automatically tried a different server to load the page. This worked in Opera, Safari, IE, and Firefox. Only Chrome failed to try a different server.

But after leaving that server offline for a few minutes and looking at the access logs, I found that the number of requests to the other servers had not significantly increased. With 1 out of 3 servers offline, I had expected accesses to each of the remaining 2 servers to roughly increase by 50%, but instead I only saw 7-10%. That can only mean in-browser DNS failover does not work for the majority of browsers/visitors, which directly contradicts what I had just tested.

Does anyone have an idea what is up with browsers' DNS failover behavior? What possible reason could there be why automatic failover works for me but not the majority of our visitors?

edit: To make myself clear, I made absolutely no change to our DNS settings; there's no TTL or propagation issue here, it's all about how the client handles the multiple A records.

Daniel
  • 171
  • 1
  • 7

3 Answers3

4

OK I am going to start by saying DNS is not a good failover system in any way, you need a reverse proxy or load balancer. There are several reasons why the experience is not the same. First of all in chrome it uses The OS to grab DNS info so that is dependent on the OS for the IPs, so the OS in this case might only give it one IP.

As far as the other browsers its highly dependent on how they do DNS to how it'll work. So the browser itself might decide to not try the other IPs or even try the same one several times depending on the response the DNS server has.

This brings us to the DNS server itself, most do not respect your TTL records and keep then how ever long it feels, meaning Users could get your old IP for quite a while...

Fourthly user experience, do you want users to have to refresh 3 or 4 times to get your website? Do you have any session or login based stuff on your site, what happens if the browser gets another IP in the middle of the session. If you really need HA and uptime you really need to consider doing it right,honestly or it will end up more fractured than using just one server.

Jacob
  • 9,114
  • 4
  • 44
  • 56
  • This answer is not related to my question. I've clarified the question; this is not about removing from DNS the IP address of a malfunctioning server. – Daniel Mar 16 '11 at 02:17
  • Yeah, I think Its relevant, I am explaining why DNS fail over doesn't work, TTLs are a part of that. – Jacob Mar 16 '11 at 02:43
  • TTL is irrelevant if the DNS records don't change, which is the case here. – Daniel Mar 16 '11 at 02:46
  • 2
    I don't see how you expect DNS to know not to hand out that IP with out changing it? Your DNS server has no clue your server is out and will happily keep handing out that IP... – Jacob Mar 16 '11 at 02:48
  • 3
    Yes, exactly. The DNS server keeps handing out the same 3 IP addresses (A,B,C), including the bad one (A), and the **browser** fails over (to B/C) when it cannot connect (to A). At least that's what my browsers do, apart from Chrome. Please re-read my question, slowly. – Daniel Mar 16 '11 at 02:54
  • But not **ALL** browser do this, and you need to include the largest user base possible, and do you want your user to see that connect error? And assuming you have sessions/cookies(this is 2011) what happens if the cache expires and your user switches servers? A general rule of thumb for HA is the user **should ** never notice it... – Jacob Mar 16 '11 at 02:58
  • 1
    My question was not "I am a complete newbie so please advise me what to do", my question was "why does this behave the way it does?" Mostly I am interested in knowing why my tests point to two contradicting answers. – Daniel Mar 16 '11 at 03:04
  • I didn't mean to imply that you were a noob, we like to explain things throughly so people that might not understand will learn something... As for your question it's simple DNS is **NOT** designed to be a failover system, It depends on so many factors that it is not really the same even between the same browser on different OS. In the case of chrome it depends entirely on the host OS, It throws the IP away as soon as it loads the page and expects the OS to keep it and in cases the OS might only give it the first IP due to its stack so chrome never gets a chance to try another IP... – Jacob Mar 16 '11 at 03:14
  • So that might explain why Chrome doesn't properly fail over, but Chrome only accounts for 20% of our visitors. This still doesn't explain the contradictory data. – Daniel Mar 16 '11 at 03:30
  • I think you are missing the point about OS's having different net stacks... – Jacob Mar 16 '11 at 03:43
  • Don't forget the very good, but incredibly expensive (comparatively) BGP Anycast for higher availability. – Tom O'Connor Mar 16 '11 at 09:15
  • 1
    @Jacob I wouldn't say "missing the point" but I was rather doubtful that different OS/net stacks could account for the discrepancy. Nonetheless, in addition to my original tests on WinXP I also did tests on Win7 and Ubuntu. And in all cases I got the same browser DNS failover behavior. So it's not an OS issue either. – Daniel Mar 16 '11 at 09:30
2

To me it's a great deal if you don't want to pay for expensive load balancers. See my reply here about how it's handled by browsers: https://serverfault.com/a/868535/114520

Now, for your concern, how did you monitor accesses? Was it the size of some access_log? Was it the requests per second on your webserver?

Maybe you have some caching solution on the webserver, which won't hit your dynamic server (PHP, Java...) if the request is already in cache. The more servers, the more requests before caching (if they don't share cache).

Before assuming it's a DNS issue, add a real monitoring: for example live analytics tracker, or something like that. Then shutdown one server, and see if live tracker shows a decrease in current users on the website.

For many years I've used and still use this setup with a real pleasure. I only added some more failover solutions:

  • Round-Robin on 2 or 3 nodes
  • each node has:
    • Varnish with director/probes to all backends
    • lighttpd (Apache or nginx will do!) on another port with fastcgi
    • PHP-FPM pool

If one PHP-FPM goes down, Varnish probe will fail and remove the backend until the probe is good again. If Varnish fails, then Round-Robin+browser will handle the change to another node.

Yvan
  • 350
  • 3
  • 8
0

Browsers are usually pretty aggressive about trying the alternate records when one is not responding.

A couple of things:

  1. Your issue with Chrome may be related to how it caches DNS - it does its own caching, and is pretty aggressive about it; could it have potentially still had the entry cached from before you had the multiple A records in place?
  2. Similarly, did you wait for at least the TTL of the DNS zone after the extra records were added to test the users coming in from the outside?
  3. Also, make sure that the load was fairly even between the servers in the first place; if one server only had 10% of the traffic, then you'd only expect a modest increase on the other node when it dies.

All that aside, DNS round robin is great for geographic redundancy and load balancing, but keep in mind that there are other good solutions out there for local failover.

Shane Madden
  • 112,982
  • 12
  • 174
  • 248
  • 1. AFAIK all browsers have DNS caches. I guess Chrome's issue is that it caches only one IP address even when a DNS query returns more than one. – Daniel Mar 16 '11 at 02:23
  • 2. I just edited my question to clarify that I made no change at all to DNS. So there's no TTL/propagation issue here; the records for the 3 servers were already there for a long time. – Daniel Mar 16 '11 at 02:23
  • 3. Yes, the load was fairly evenly split between the 3 servers to start with. – Daniel Mar 16 '11 at 02:24
  • Chrome will fetch an IP and delete it and expect the OS to cache it,BTW.. – Jacob Mar 16 '11 at 02:59
  • 1
    @Daniel Non-Chrome browsers cache for a short time period, and the OS does most of the heavy lifting. Chrome's a different animal, it does aggressive prefetching of DNS for pages you visit most when you first start the browser (enter `about:dns` in the address bar). However, all this is irrelevant to your question; Chrome's DNS cache has your multiple A records present.. Maybe grab a packet capture of Chrome's behavior? I'd be really surprised if it wasn't sending any requests at all to the other address. – Shane Madden Mar 16 '11 at 05:05