7

We are using a NLB in AWS connected to our EKS cluster via a nginx ingress controller. Some of our requests get a random 504 gateway timeout.

We think we debugged the problem to our nginx ingress. Based on some Stackoverflow recommendations we played around with Connection headers.

  1. We set Connection "close" this had no effect
  2. We set Connection "keep-alive" again no effect

We also noticed another behavior with our proxy_read_timeout when it was 60 seconds our request from the browser would be fulfilled at 60.xx seconds. When we reduced it to 30 it became 30.xx, 20 became 20.xx. We went to 1 but still get random 504 gateway timeouts and do not understand why proxy_read_timeout has this behavior in our environment.

We want to understand what is the effect of proxy_read_timeout and why do we get above behavior? Also is there a way to set Connection to "" on our nginx ingress (we are not able to do this via nginx.ingress.kubernetes.io/connection-proxy-header: "")

Thanks in advance!

Daniele Santi
  • 2,479
  • 1
  • 25
  • 22
Siva Vg
  • 73
  • 1
  • 4
  • Somebody in the slack channel for nginx-ingress says he faced the same issue and told that, 'I have the same problem and the reason is you need to specify `spec.externaltrafficpolicy=Local` which preserves the clients/source IP. `spec.externaltrafficpolicy=Cluster` is set to cluster by default. THE problem is once you set `spec.externaltrafficpolicy=Local` the health check change to http: /healthz and shit stops working'. We will try same and post if this solves the problem – Siva Vg Apr 16 '19 at 06:59

3 Answers3

3

We think our issue was related to this:

https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-troubleshooting.html#loopback-timeout

We're using an internal nlb with our nginx ingress controller, with targets registered by instance ID. We found that the 504 timeouts and the X second waits were only occurring on applications that were sharing a node with one of our ingress controller replicas. We used a combination of nodeSelectors, labels, taints, and tolerations to force the ingress controllers onto their own node, and it appears to have eliminated the timeouts.

We also changed our externalTrafficPolicy setting to Local.

J. Koncel
  • 46
  • 1
1

I had the same issue as J. Koncel where my applications that were sharing the same nodes as the nginx ingress controller were the only ones that got the 504 timeouts.

Instead of using nodeSelectors and taints/tolerations, I used Pod anti-affinity: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/#inter-pod-affinity-and-anti-affinity.

I added a label to the spec for my nginx-ingress-controller

podType: ingress

Then I updated the yml files for the applications that should not be scheduled on the same instance as the nginx-ingress-controller to be this:

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchExpressions:
        - key: podType
          operator: In
          values:
          - ingress
      topologyKey: "kubernetes.io/hostname"
nedstark179
  • 111
  • 3
0

At the moment I am not able to comment, but the following line should help you in adding the externalTrafficPolicy Setting:

kubectl patch svc nodeport -p '{"spec":{"externalTrafficPolicy":"Local"}}'