Need help figuring out a random connection timeout issue on a server

2

I discovered this issue trying to loadtest a custom node.js websocket server, where some sockets are failing to connect (they hit the connection timeout). It does not appear to be related to the current load as I can also randomly get the failure on just a single client thread test.

This appears to be unrelated to node.js as I can also reproduce the problem load testing nginx serving a static page on the same server. The overall failure rate seems to be between 7-10% of inbound connections will fail.

This does not appear to be a problem on my local client machine or Internet connection as I can reproduce the problem from another machine at a different location.

I have checked all of the normal tunables (somaxconn, max open files, etc), and as far as I can tell I'm nowhere near hitting any limits. I am not seeing any entries in syslog relating to this problem. I also tried to completely disable iptables to prevent any firewall issues.

The server is Ubuntu LTS 16.04 (i7, 32GB), and is a dedicated machine at a colo facility. Before contacting them I wanted to see if I could find more data about whether this is a problem at the O/S level, machine level, or network level.

I was able to capture a tcpdump of a failed connection, but I'm not really sure what to make of it:

07:19:29.952730 IP localmachine.53949 > server.30312: Flags [S], seq 2408213894, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0                                                                         
07:19:29.952879 IP server.30312 > localmachine.53949: Flags [S.], seq 1245200353, ack 2408213895, win 28200, options [mss 1410,nop,nop,sackOK,nop,wscale 7], length 0                                                        
07:19:30.951778 IP server.30312 > localmachine.53949: Flags [S.], seq 1245200353, ack 2408213895, win 28200, options [mss 1410,nop,nop,sackOK,nop,wscale 7], length 0                                                        
07:19:32.949553 IP localmachine.53949 > server.30312: Flags [S], seq 2408213894, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0                                                                         
07:19:32.949650 IP server.30312 > localmachine.53949: Flags [S.], seq 1245200353, ack 2408213895, win 28200, options [mss 1410,nop,nop,sackOK,nop,wscale 7], length 0                                                        
07:19:34.947783 IP server.30312 > localmachine.53949: Flags [S.], seq 1245200353, ack 2408213895, win 28200, options [mss 1410,nop,nop,sackOK,nop,wscale 7], length 0                                                        
07:19:38.947699 IP server.30312 > localmachine.53949: Flags [S.], seq 1245200353, ack 2408213895, win 28200, options [mss 1410,nop,nop,sackOK,nop,wscale 7], length 0                                                        
07:19:38.950399 IP localmachine.53949 > server.30312: Flags [S], seq 2408213894, win 64240, options [mss 1460,nop,wscale 8,nop,nop,sackOK], length 0                                                                         
07:19:38.950438 IP server.30312 > localmachine.53949: Flags [S.], seq 1245200353, ack 2408213895, win 28200, options [mss 1410,nop,nop,sackOK,nop,wscale 7], length 0                                                        
07:19:46.947769 IP server.30312 > localmachine.53949: Flags [S.], seq 1245200353, ack 2408213895, win 28200, options [mss 1410,nop,nop,sackOK,nop,wscale 7], length 0 

It looks like the ACK from the server is never getting to the client and the client keeps trying to SYN and the server keeps trying to respond until the connection timeout is reached. This is about where my knowledge taps out and I'm not really sure what to do with this information. What could cause this type of issue, or what else should I look at?

amnesia

Posted 2018-08-23T12:05:19.530

Reputation: 121

Looks like some asymmetric routing or duplicate client IP issue. Can you observe a failed connection and a working connection at the same time in your tcpdump? Or are all the failing connections grouped in same time intervals? traceroute executed on the server may help. tcpdump -e might also help to see mac addresses. – Gohu – 2018-08-23T12:26:29.560

@Gohu - Yes, the above capture was taken with 2 connections at the same time, one failed and one didn't. I can spin up 100 connections and about 10 will fail and the other 90 will be fine. – amnesia – 2018-08-23T12:47:59.840

No answers