We have a problem that a number of clients (all linux Ubuntu) are sometimes not able to connect to a remote server over SSH. If the problem occurs, Windows clients don't have that problem and can connect just fine.
I found this other question with a similar problem: Why would a server not send a SYN/ACK packet in response to a SYN packet
Disabling TCP Timestamping on the server does indeed solve the problem, but I would like to know what the real problem is. I don't really see why this should cause any problems, definitely not when establishing the connection.
When using Wireshark, I see that the Windows clients use a Window size of 8192 whereas the Linux clients use a Window size of 29200. The Windows clients receive a SYN_ACK, the Linux clients don't. Is it possible that this higher initial window size is responsible for not sending the SYN_ACK by the server? I can't come up with a sensible explanation as to why it could cause the given problem, but since it's the only (visible to me) difference, it does appear to look like that. Am I missing something?
*** EDIT After more searching, thinking and some voodoo magic, I think I might have come up with a plausible explanation. It does take some assumptions and specific conditions to be in place, but I do believe that these might just be possible in this particular situation.
Both users are behind the same NAT device (in our case, a Fortigate firewall). This firewall will assign local ports on it's external interface/IP to each NAT'ed connection. If the port is already in use to another user, it is skipped. If the connection is closed, the port is released and returned to the NAT pool. If that port is then assigned to the other user, but the server still has some record of the connection (TIME_WAIT, final FIN/ACK not received) and the timestamp of the packet is lower of that of the previous connection, the packet will be silently disgarded.
Ok, there are a lot of if in there, but... - the two users are developing on the same website so they will be making a lot of connections to the same remote server - the firewall (Fortigate) appearantly keeps a sequential counter of the NAT port per source IP/destinationIP/destinationPort. If the counters of both users are close to eachother, chances of such "collision" happening with two connections to that server are not that unlikely, given that both destination IP as port are the same. That would explain why the problem only occurs sporadically.
The only problem with this theory is that I can't find any evidence of this happening on the server side. There are no connection stuck in TIME_WAIT or something like that, and I do assume that once they disappear from the netstat output, the server has forgotten about them.
I do believe that the initial Window Size does not play a role in this, so I am striking that one of of the list of suspects.