6

I am trying to setup Traefik on a production site, and I'm struggling with some high availability issues. I think we still need a reverse-proxy in front of the Traefik cluster. Here are the potential setups that I've considered, and why the reverse-proxy seems to be needed:

  1. Setup DNS A records to point to each of the Traefik nodes for load balancing and failover.

    • This practice is discouraged according to multiple sites including this SO question and this SF question.

    • Even using a service like DNSMadeEasy seems to be discouraged due to DNS caching and TTL issues.

  2. Point one DNS record to one of the nodes running Traefik.

    • That node becomes a SPOF. My nodes are running on CoreOS, which reboots after every update, so we would be guaranteed to have a few minutes of downtime each week.

    • We could move the DNS record to an alternate node whenever downtime is expected. This would be a pain to manage manually. I can envision a solution paired with locksmithd that handles this automatically, but I don't really want to build it and it wouldn't handle unexpected downtime.

    • Part of the rationale for using Docker Swarm (or Kubernetes) is to make nodes interchangeable.

  3. Put a load-balancer/reverse-proxy in front of the Traefik cluster. The reverse-proxy can provide failover between all the Traefik nodes, and DNS would point to the reverse-proxy.

    • Yes, this is a SPOF, but in my experience, it is pretty easy to get good uptime with this setup. If occasional maintenance is required, the DNS record can be pointed to a new proxy.

Am I missing something or over thinking this?

Mark Grimes
  • 584
  • 1
  • 5
  • 8
  • Hmm, you can have a subset of your edge nodes in DNS, with very low TTL (30-60 seconds) and monitoring-dns automation. For many use cases this is fine. The other option not listed here from the old days is to run heartbeatd and have healthy nodes take over ips of failed nodes. That software is still around. Running a non-HA SPOF reverse proxy in 2018 seems undesirable to me- table stakes is *some* node failure/recovery automation. What you want to quantify to pick a solution is what degree of reliability and thus what level of automation the biz demands. – Jonah Benton Jul 04 '18 at 18:05
  • I'm more familiar with OpenShift myself (which builds on top of Kubernetes), but isn't part of Kubernetes the concept of a service node, where this becomes essentially a reverse-proxy that is typically highly available. The service node then looks after scaling the pods as required. A good thing about this is that the concept of "highly available" can now also be "just have once instance that gets auto-started on failure"... or even "just start it when needed". – Cameron Kerr Nov 20 '18 at 04:22
  • This might not have been an option at the time this was first answered, but Traefik has an HA config option... but it is marked 'beta' in 1.7 , and 'enterprise' in 2.x. https://levelup.gitconnected.com/traefik-2-high-available-mode-d09c9ec36295 https://doc.traefik.io/traefik/v1.7/user-guide/cluster/ – Art Hill Apr 12 '21 at 16:14

2 Answers2

1

there are different kind of solutions.

1) Build you own HA Loadbalancer in front of your Swarm/Kubernetes Cluster to distribute the traffic and perform failover.

There lot of different Appliances out there:

  1. Netscaler
  2. Kemp
  3. F5

While this approach is HA it is usually not cheap.

An cheaper alternative to this could be a Nginx/Haproxy + Keepalived Setup.

However you need of course a floating IP and have to take care of the arp caches.

2) Take use of a "Cloud Loadbalancer". Digital Ocean, AWS, GKE, Openstack all provide such an Feature. Its easier to setup (most of the time) however if it is cheaper you have to calculate.

On DigitalOcean the LB is just 20$ and there is an Beta with a managed Kubernetes Cluster. You may want to have a look into it. All components plug well together https://www.digitalocean.com/products/kubernetes/

3) If you Apps are not 100% critical I can suggest an special solution I've used so far:

Cloudflare + low TTL + https://github.com/Berndinox/cloudflare-ddns

It works that simple: https://github.com/Berndinox/compose-v3-collection/blob/master/wordpress/www.yml How: It spins up WordPress and all its requirements including the DNS Container. The DNS Container is Updating the DNS Record of the Domain on Cloudflare (Depends on which host the container starts, the IP is different). Good, if one Host is rebooted or the container healthcheck fails the container is rescheduled. When being rescheduled and the Host initially taken is offline, the container will start on another host and is then pushing the new IP into Cloudflare. That all happens automatically without doing anything. :)

The Cloudflare TTLs are really low, so there may be just a few seconds of downtime.

Martijn Heemels
  • 7,438
  • 6
  • 39
  • 62
Berndinox
  • 240
  • 1
  • 3
  • 11
0

If you want to 'roll your own' HA layer on top of Traefik, might I suggest a slightly different angle. I use Netscalers (rebranded as 'ADC' by Citrix) in my day job, and my suggestion is to make Traefix act like an ADC... if you can pull this off. in the ADC world, this would be a 'single arm HA pair', and *should operate as active-passive (not active-active).

Set up more than one instance of Traefik, with different IPs. For my example, I use 10.0.1.11 and 10.0.1.12. These IP should be used for any OS patching, or anything else *other than the reverse proxy traffic. In an ADC, these are the NSIP entries.

Configure a second network interface (IF) on each instance a 3rd 'floating' IP. For my example, I use 10.0.1.10. In an ADC, this would be a SNIP. *insure that this IF remains down during set up or you will have IP address conflicts. Also configure this IF to *not automatically start at boot. Configure Traefik to only use this IP for reverse proxy traffic.

Next, you have to figure out how to keep the config for the instances 'in sync'. Im a bit new to Traefik, so I am unsure about this... but if Traefik behaves well with it, use a NFS share to store the config in one place, mounted on all nodes. Mount the NFS in the correct place, or use soft links. If Traefik does not behave well with this (or you dont have a good NFS), then maybe store the config in a git repo, and use scripts to sync them... or use rsyc... or (Ansible|Puppet|salt|etc)... or... or... clearly more work is needed here. You might need to script restarting services when the config is updated... not sure. This would clearly need to be done carefully so that all nodes dont restart services at the same time.

Now configure Corosync stack to manage what instance is 'up'. The Corosync stack can be configured to keep the IF with the 'floating IP' 10.0.1.10 available on only one instance, and manage the starting and stopping of services on all the instances. The 'normal' state of all services needs to be 'up' except for the IF. That way, there is little interruption if the Corosync stack needs to 'down' the IF on one instance, and 'up' the IF on a different one. Adapt these instructions: https://www.digitalocean.com/community/tutorials/how-to-create-a-high-availability-setup-with-corosync-pacemaker-and-floating-ips-on-ubuntu-14-04

Lastly, put in your DNS entry to point to the 'Floating IP' 10.0.1.10.

The result is that you *should be able to manage the instances, sync the configs, and patch the OS on the IPs 10.0.1.11 and 10.0.1.12, while only one instance always has the IP 10.0.1.10 available to manage the reverse proxy traffic.

I am considering setting up something like this, and I will provide updates if I do.

Or do as Berndinox suggested, and pay to have something you dont have to engineer, something that you know will work. The ADCs work great in my experience.

Art Hill
  • 91
  • 3