1

I am creating a docker swarm with 3 managers and 2 workers. A service is running in the swarm and exposes port 80. So we can hit the service with any node's ip. But what if the node goes down? Expecting a user always to try another node's ip would be very cumbersome.

So what's the common practice of managing this external access point? I can think of setting up a DNS record to return multiple nodes' IPs. Setting up another load balancer in front seems to be an overkill.

wei
  • 595
  • 1
  • 6
  • 11

1 Answers1

3

I see a few options here:

1) an external load balancer.

If you are running on AWS, GCE or other cloud providers you can use the load balancer as a service those companies provide. Your DNS names will point to the load balancer's IP and your load balancer forwards traffic to your nodes.

PROS: you always have high availability (load balancer is redundant, you need at least 2 nodes and you're good to go). You also get automatic failover (if a node fails, requests are forwarded to the other nodes of your cluster).

CONS: load balancers cost money

2) a "DIY" load balancer.

You can run another server with haproxy, nginx or whatever proxy service that runs a load balancer service for you. The DNS will point to the proxy server (which is only one at this point) and it forwards to your nodes.

PROS: limited additional cost (the proxy could even be one of your cluster's nodes).

CONS: you have to setup the whole infrastructure (failover, node discovery just to name two things you should care about). You lose high availability (as long as you don't make your proxy redundant, but I'm keeping things simple)

3) multiple DNS records.

You can set, as you suggested, multiple IP addresses in your DNS records. In this case, the client will connect to a random node in your cluster.

PROS: free of charge

CONS: if a node goes down, clients will still try to connect to it as long as you don't remove it from your DNS (and it takes time, due to TTL).

If somebody has other ideas I'm glad to hear

whites11
  • 326
  • 1
  • 6
  • This seems about right. We use option two. We have a pair of vms running haproxy with identical configs. We use pacemaker ha to keep them alive and place the service IPAs on living nodes. I thinking of moving from pacemaker to keepalived, though. We have a couple of five-node swarms "behind" the LBs - also vms. Lastly, we have anti-afinity rules in vcenter to keep the vms on different esxi hosts. – Mike Diehn Dec 11 '19 at 06:19
  • The problem with a DNS RR is that when a node fails in an N node swarm, 1/N requests timeout until the RR is updated or the node is returned to service. I've read that browsers will just auto-retry, which is nice for humans driving browsers. Didn't work out for us, though. – Mike Diehn Dec 11 '19 at 06:21