1

I have a server that has around 100 SSH tunnel connections active from client servers across Canada and the US. We use the same device that runs a custom build of Ubuntu and load that on each client server that connects to the server. Recently, I have attempted to setup some of these client servers and I am receiving a connection timeout when attempting to connect to the main server from those client servers.

Here are some of the important debug steps I have taken and their results:

  1. The client server is receiving a timeout when attempting to connect to the main server even though it can ping the server.
  2. When trying to telnet into port 22, the connection times out instead of receiving the SSH acknowledgement
  3. I can SSH into any other machine from that client server except the main server
  4. Other machines can SSH into the main server, even on the same IP address as the client servers
  5. Each client server has the exact same OS build as the other client servers
  6. There are around 100 active connections from other client servers currently deployed using the same configuration, but only these new ones are experiencing the problem
  7. I have increase the maximum number of SSH connection attempts (MaxStartups) as well as the maximum number of TCP socket connections (net.core.somaxconn) to 2000 and 65535, respectively, and this has not improved the situation

I am stuck and need to figure out why this is happening. Any help would be appreciated. Thanks!

TopDogg25
  • 41
  • 1
  • 4
  • First step should be to perform a packet capture on the server to determine if the problematic client's ssh attempts are actually even making it to the server. I must say, though, that this sounds like something you should consider using IPsec for instead of SSH for secure transit. – EEAA Jul 21 '14 at 17:35
  • Run `mtr` between the clients in question and the host. See if there is a large amount of packet loss. Also check network cable integrity on the client end (if `ifconfig` shows errors, this usually means the cable is bad.) – Michael Martinez Jul 21 '14 at 18:11
  • @MichaelMartinez Ran a 100 count mtr and everything looks normal: 1% packet loss at 1 hop and decent latency. ifconfig looks good from client and server side. – TopDogg25 Jul 21 '14 at 18:58
  • check the dns settings on the clients in question. make sure forward and reverse lookups are working to resolve the client hostnames and the server hostname. – Michael Martinez Jul 21 '14 at 21:37
  • ssh needs to be able to do both a forward and reverse dns/ip lookup in order to function properly. Otherwise you see connection timeouts. – Michael Martinez Jul 21 '14 at 22:25

1 Answers1

3

After a lot of investigation and Google searches, I was able to find the root cause and ultimately a fix. After ruling out networking and dns issues, I was only left with the protocol. Since Ping worked and telnet to port 1 did not, I knew it couldn't be a port issue. After testing traffic with both UDP and TCP, it turned out that TCP was the only protocol that was having the issue.

I ran tcpdump to check the packets that were being exchanged and I noticed right away that only the initial SYN packet was being sent from the client to the server and the ACK was not being returned. Unfortunately, there was no root cause found yet.

By running netstat -s before and after attempting multiple ssh connects over a few trials, the only value that was off was the "Passive connection rejected because of time stamp". I found this article (in Japanese) that was related to this issue and suggested a relation with tcp_tw_recycle in a NAT environment. The resulting conclusion was to disable tcp_tw_recycle, the consequence being that the number of open TCP connections doubled, we were able to resolve the issue. This ServerFault answer discusses it's ramifications in detail.

Hopefully this answer will prove useful to someone else who ends up dealing with this edge case. Also, does anybody have any additional suggestions / warnings related to this solution?

TopDogg25
  • 41
  • 1
  • 4
  • I know this has almost 3 years. I'm facing what it looks like to be the same issue. The problem only happens with my internet connection. When I'm at work I could keep a connection to server A anytime longer. It could pass hours that the connection won't timeout. But when I'm at home, it may only pass 5 minutes or so in order to the connection timeout. – Sebastian Jan 22 '17 at 19:35