8

I've been scratching my head for the past few days, trying to come up with a solution for the following problem:

In our data center we have a F5 running on BigIP hardware that acts as a single ingress point for HTTPS requests from client machines in various office locations across the country. F5 terminates TLS and then forwards all requests to two Traefik load balancers, which route distribute the requests to the various service instances (Traefik nodes are running in Docker on Red Hat Enterprise but I believe that is irrelevant for my problem). From a throughput, CPU and memory point of view, those three network components are more than capable to handle the amount of requests and traffic with plenty of capacity to spare.

However, we noticed frequent 1000ms delays in HTTP(S) requests that clients make, particularly during high-load times. We tracked the problem to the following root cause:

  • During high-load times, the F5 "client" initiates new TCP connections to the Traefik "server" nodes at a high frequency (possibly 100+ per second).
  • Those connections are terminated on the Traefik "server" side when the HTTP responses have been returned.
  • Each closed connection remains in a TIME_WAIT state for 60 seconds on the Traefik host.
  • When the F5 initiates a new connection, it randomly chooses an available port from its ephemeral port range.
  • Sometimes (often during high load), there is a already a connection in Traefik in TIME_WAIT state with the same source IP + port, destination IP + port combination. When this happens, the TCP stack (?) on the Traefik host ignores the first SYN packet. Note: RFC 6056 calls this collision of instance-ids.
  • After 1000ms the retransmission timeout (RTO) mechanism kicks in on the F5 and resends the SYN packet. This time the Traefik host accepts the connection and completes the request correctly.

Obviously, those 1000ms delays are absolutely unacceptable. So we have considered the following solutions so far:

  1. Reduce the RTO in F5 to retransmit faster, e.g. to 200ms.
  2. Reduce net.ipv4.tcp_fin_timeout to close abandoned TIME_WAIT connections faster.
    Update: This only applies to connections abandoned by the other side, when no FIN is returned. It does not have any effect on connections in TIME_WAIT state.
  3. Enable net.ipv4.tcp_tw_reuse: Useless for incoming connections.
  4. Enable net.ipv4.tcp_tw_recycle: AFAIK contra-indicated if client sends randomized TCP timestamps. Contradicting information (incl. empirical evidence) whether this feature was removed from Linux or not. Also, generally recommended NOT to mess with.
  5. Add more source IPs and/or make Traefik listen on multiple ports to increase # of permutations in IP/port tuples.

I'll discard #1 because that's just a band-aid. Delays still occur, just a little less noticable. #3 wouldn't have any effect anyway, #4 would most likely render the system non-functional. That leaves #2 and #5.

But based on what I learned after reading through dozens of posts and technical articles, both of them will ultimately only reduce the chance of those "collisions". Because, what ultimately prevents the sending side, F5, to (pseudo)randomly choose a combination of ephemeral port, source IP and target port that still exists in TIME_WAIT state on the targeted Traefik host, regardless of how short the fin_timeout setting is (which should stay in the many sec range anyway)? We would only reduce the possibility of collisions, not eliminate it.

After all my research and in times of gigantic web applications, it really surprises me that this problem is not more discussed on the web (and solutions available). I'd really appreciate your thoughts and ideas on whether there is a better, more systematic solution in TCP land that will drive the occurrence of collisions near zero. I'm thinking along the lines of a TCP configuration that will allow the Traefik host to immediately accept a new connection despite an old connection being in TIME_WAIT state. But as of now, no luck in finding that.

Random thoughts and points:

  • At this point it is not feasible to change our various in-house applications to use longer-running HTTP(S) connections to reduce the number of requests/connections per second.
  • The network architecture of F5 and Traefik is not up for discussion, cannot be changed.
  • I recently investigated the ephemeral port selection on Windows clients. That algorithm seems to be sequential, not random. Maximizes time until port is reused, reduces security.
  • During load tests on an otherwise idle system, we generated ~100 HTTP requests/connections per second. The first collisions occurred already after a few seconds (say before 2000 requests total), even though the F5 is configured to use more than 60k ephemeral ports. I assume this is due to the pseudo-random nature of the port selection algorithm, which seems to do a fairly poor job of avoiding instance-id collisions.
  • The fact that the Traefik host accepts the TCP connection on SYN packet retransmission is probably a feature of the TCP implementation. RFC6056 speaks of TIME_WAIT assassination, which might be related to this.

Update: Per The Star Experiment, the net.ipv4.tcp_fin_timeout setting does NOT affect the TIME_WAIT state, only the FIN_WAIT_2 state. And per Samir Jafferali, on Linux systems (incl. our Red Hat Linux) the TIME_WAIT period is hardcoded in the source code and cannot be configured. On BSD according to the source it is configurable but I haven't verified this.

Christoph
  • 203
  • 1
  • 7

3 Answers3

4

In our data center we have an F5 running on BigIP hardware that acts as single ingress point for HTTPS requests from client machines in our various office locations across the country.

If this single point (front-end) remains single when it passes connections down to back-end why are you wondering about the hiccups? Specially if intensity of connections is "possibly 100+ per second".

Your setup is basically squeezing one set with higher cardinality into another one with cardinality significantly lower.

ultimately only reduce the chance of those "collisions"

This is put into basis of how packet switched networks work. Say, on Ethernet level there're collisions too. Randomness is inevitable and TCP/IP is dealing with it. The IP protocol itself was built not with LANs in mind, actually (but still works great there too).

So yes "Add more source IPs and/or make Traefik listen on multiple ports" is pretty reasonable way to follow.

poige
  • 9,171
  • 2
  • 24
  • 50
  • 1
    After reading your response, reading some more, sleeping on this and updating my question with latest findings, I believe you are right: Increasing the permutations is the simplest way forward. It will reduce (hopefully sufficiently) but not completely eliminate those collisions.Considering that we are dealing with HTTP request where, when the connection is closed by the server, it is guaranteed that the client doesn't send any data, I had just hoped there would be a more "guaranteed" solution. Kind of like net.ipv4.tcp_tw_recycle when TCP timestamps were still steadily increasing. – Christoph Nov 21 '19 at 13:59
3

Although I also think adding more IP addresses is the simplest way forward, have you considered exploring reusing TCP connections between the F5 and the Traefik nodes instead of creating a new one per external request?

I'm not sure how F5 supports that, but maybe it's as simple as switching to http2 between the F5 and the Traefik nodes. See https://developers.google.com/web/fundamentals/performance/http2#one_connection_per_origin

Pedro Perez
  • 5,652
  • 1
  • 10
  • 11
  • 1
    Thanks. Definitely something to ask the vendors. – Christoph Nov 21 '19 at 20:44
  • 1
    @Christoph is the F5 forwarding TCP or HTTP? if you don't have a requirement for TLS within your back-end connections, you might look into `fasthttp` profile on the F5: >Using the profile also ensures that the BIG-IP system pools any open server-side connections. This support for connection persistence can greatly reduce the load on destination servers by removing much of the overhead caused by the opening and closing of connections. Lots of other limitations however. – Yolo Perdiem Nov 27 '19 at 21:12
  • @YoloPerdiem Thanks for that info. At this point in time the security group insists on TLS but from what I heard, that will change soon (i.e. no TLS). Then we can look at the fasthttp option. – Christoph Dec 01 '19 at 20:36
2

Turns out there was a very simple solution to this problem after all, which we figured out after working with the Traefik vendor for a while. Turns out also that the fact that we are running Traefik in Docker does matter. The problem and solution is very specific to our setup but I still want to document here it in case others should encounter the same. Nevertheless, this does not invalidate the other, more general recommendations as collisions of instance IDs are a real problem.

Long story short: All Traefik instances are configured as host-constrained containers (i.e. tied to specific hosts) running in a Docker Swarm cluster. Traefik instances need to expose a port at host level so that they become reachable from the F5, which obviously is not a Docker Swarm participant. Those exposed ports had been configured in ingress mode, which was not only unnecessary (no need to route traffic through the Docker Swarm ingress network) but was also the cause for the dropped/ignored SYN packets. Once we switched the port mode to host, the delays disappeared.

Before:

  ports:
  - target: 8080
    published: 8080
    protocol: tcp
    mode: ingress

After:

  ports:
  - target: 8080
    published: 8080
    protocol: tcp
    mode: host
Christoph
  • 203
  • 1
  • 7