I've been scratching my head for the past few days, trying to come up with a solution for the following problem:
In our data center we have a F5 running on BigIP hardware that acts as a single ingress point for HTTPS requests from client machines in various office locations across the country. F5 terminates TLS and then forwards all requests to two Traefik load balancers, which route distribute the requests to the various service instances (Traefik nodes are running in Docker on Red Hat Enterprise but I believe that is irrelevant for my problem). From a throughput, CPU and memory point of view, those three network components are more than capable to handle the amount of requests and traffic with plenty of capacity to spare.
However, we noticed frequent 1000ms delays in HTTP(S) requests that clients make, particularly during high-load times. We tracked the problem to the following root cause:
- During high-load times, the F5 "client" initiates new TCP connections to the Traefik "server" nodes at a high frequency (possibly 100+ per second).
- Those connections are terminated on the Traefik "server" side when the HTTP responses have been returned.
- Each closed connection remains in a TIME_WAIT state for 60 seconds on the Traefik host.
- When the F5 initiates a new connection, it randomly chooses an available port from its ephemeral port range.
- Sometimes (often during high load), there is a already a connection in Traefik in TIME_WAIT state with the same source IP + port, destination IP + port combination. When this happens, the TCP stack (?) on the Traefik host ignores the first SYN packet. Note: RFC 6056 calls this collision of instance-ids.
- After 1000ms the retransmission timeout (RTO) mechanism kicks in on the F5 and resends the SYN packet. This time the Traefik host accepts the connection and completes the request correctly.
Obviously, those 1000ms delays are absolutely unacceptable. So we have considered the following solutions so far:
- Reduce the RTO in F5 to retransmit faster, e.g. to 200ms.
- Reduce net.ipv4.tcp_fin_timeout to close abandoned
TIME_WAITconnections faster.
Update: This only applies to connections abandoned by the other side, when no FIN is returned. It does not have any effect on connections in TIME_WAIT state. - Enable net.ipv4.tcp_tw_reuse: Useless for incoming connections.
- Enable net.ipv4.tcp_tw_recycle: AFAIK contra-indicated if client sends randomized TCP timestamps. Contradicting information (incl. empirical evidence) whether this feature was removed from Linux or not. Also, generally recommended NOT to mess with.
- Add more source IPs and/or make Traefik listen on multiple ports to increase # of permutations in IP/port tuples.
I'll discard #1 because that's just a band-aid. Delays still occur, just a little less noticable. #3 wouldn't have any effect anyway, #4 would most likely render the system non-functional. That leaves #2 and #5.
But based on what I learned after reading through dozens of posts and technical articles, both of them will ultimately only reduce the chance of those "collisions". Because, what ultimately prevents the sending side, F5, to (pseudo)randomly choose a combination of ephemeral port, source IP and target port that still exists in TIME_WAIT state on the targeted Traefik host, regardless of how short the fin_timeout setting is (which should stay in the many sec range anyway)? We would only reduce the possibility of collisions, not eliminate it.
After all my research and in times of gigantic web applications, it really surprises me that this problem is not more discussed on the web (and solutions available). I'd really appreciate your thoughts and ideas on whether there is a better, more systematic solution in TCP land that will drive the occurrence of collisions near zero. I'm thinking along the lines of a TCP configuration that will allow the Traefik host to immediately accept a new connection despite an old connection being in TIME_WAIT state. But as of now, no luck in finding that.
Random thoughts and points:
- At this point it is not feasible to change our various in-house applications to use longer-running HTTP(S) connections to reduce the number of requests/connections per second.
- The network architecture of F5 and Traefik is not up for discussion, cannot be changed.
- I recently investigated the ephemeral port selection on Windows clients. That algorithm seems to be sequential, not random. Maximizes time until port is reused, reduces security.
- During load tests on an otherwise idle system, we generated ~100 HTTP requests/connections per second. The first collisions occurred already after a few seconds (say before 2000 requests total), even though the F5 is configured to use more than 60k ephemeral ports. I assume this is due to the pseudo-random nature of the port selection algorithm, which seems to do a fairly poor job of avoiding instance-id collisions.
- The fact that the Traefik host accepts the TCP connection on SYN packet retransmission is probably a feature of the TCP implementation. RFC6056 speaks of TIME_WAIT assassination, which might be related to this.
Update: Per The Star Experiment, the net.ipv4.tcp_fin_timeout setting does NOT affect the TIME_WAIT state, only the FIN_WAIT_2 state. And per Samir Jafferali, on Linux systems (incl. our Red Hat Linux) the TIME_WAIT period is hardcoded in the source code and cannot be configured. On BSD according to the source it is configurable but I haven't verified this.