ssh connection timed out, why sudo service sshd restart resolves it?

Question

I have a little tricky behavior I can't explain. I have a virtual machine running Ubuntu 20.04, docker 19.03.6and arediscontainer. Hosted on aWindows 2019` Hyper-V machine.

There's a second virtual machine (same network but different physical server) running W2k19 and a redis-client connecting to the redis instance.

Due to bad configuration from time to time redis overwhelms the Ubuntu machine, using too much memory, and producing thousands of *connection timed out* exception in the redis-client.

When this happens, all connections between machines stop working. If I try to connect via ssh from the W2k19 machine to Ubuntu or using telnet from the same machine on any port, I get a *connection timed out*.

Like if something on the Linux machine did an auto ban of the IP address of the w2k19 machine. From any other machine I can connect via ssh, telnet and so on.

Ufw is turned off
We dont't have fail2baninstalled
iptables is configured with all ports open

But we still can't connect. We reproduced the behavior on another machine, a second VM with W2k19 and the same redis-client.

What we found out would reestablish the connections between those machine was a restart of the ssh service on the Ubuntu machine combined with a reboot of the W2k19 machine.

Just the single sudo service sshd restart is not enough, and just a reboot of the W2k19 machine is not enough. I can't figure out what's going on, and we cannot accept as a standard procedure in these cases to restart the ssh service and reboot the machine.

But so far we are not being able to figure out what rule/configuration whatsoever is blocking the connections. It has to do something with the ssh service probably, since restarting it does contribute to restore the connections, but how?

And why restarting the ssh service (and rebooting the W2k19 machine) is actually unblocking the connection to the redis 6379 port?

!!! UPDATE !!! I tried tcpdump on the ubuntu machine and see no traffic from the other VM. I configured network mirroring for the ubuntu machine, and analyzed traffic with wireshark, no traffic from the other VM either. I disabled firewalls everywhere (ubuntu VM, client VM, hyper-v hosts) while analyzing the traffic.

Something is blocking the traffic before it reach the VM, but I can't figure out what.

Please keep a professional tone in ServerFault and avoid swear words in the question title. — Tilman Schmidt, Jun 10 '20 at 22:57
I suggest that you try restarting the Windows services using SSH connectivity, and determine if your SSH connectivity resumes on the Ubuntu box. You may simply have left too many connections "open" and the Ubuntu can't open more connections until the 900 second timer expires on each session. — Bee Kay, Jun 17 '20 at 21:32

score 0 · Answer 1 · answered Jun 16 '20 at 14:29

Connection timed out means that the initial TCP SYN caused no response whatsoever [within connection timeout duration]. The client has received neither SYN/ACK, neither RST, neither an ICMP error -- nothing.

This can happen for many reasons. Let's break them down broadly, by stages of TCP handshake.

Malfunction 1: the initial SYN wasn't delivered to the server machine.

Malfunction 2: the server machine has received the SYN, but took too long to accept() the connection request.

Malfunction 3: the SYN/ACK response wasn't delivered to the client machine.

Malfunction 4: the last ACK, and all resends of it, have been lost. (This might give different error, but I'm unsure.)

This part gives me a hunch:

… from time to time redis overwhelms the Ubuntu machine, using too much memory …

Linux OOM killer is delicate topic; unless you configure the crap out of it — it'll usually prefer to just hang the userspace instead of killing anything. (Don't ask my why; I still don't know. It's easier to configure it than to get at the ultimate reasons why.)

So let me offer you OOM test: when the issue reproduces, can you ping the server machine? Can you also ssh into it? The likely outcome: yes to ping, no to ssh — would indicate Malfunction 2.

This is typical with OOM machines: the kernel is still live and happy, and responds to pings like nothing happens. But note: unlike ping, establishing TCP requires the userspace server program (e.g. redis, or sshd) to actively call accept() on the about-to-open connection. In OOM condition this takes ages, as the programs sit there waiting for fulfillment of their memory allocation requests.

OOM test outcome "No to ping, no to ssh" — would indicate that it's not Malfunction 2; I'd guess some sort of bridging virtualization stuff going haywire.

Running redis in docker complicates this further. Docker has its own memory accounting logic (see --memory and friends). It also has to tweak iptables rules for container networking to function.

If this doesn't help: please give more details on the networking setup here, including virtualized networks. I feel like I already need a diagram to correctly count your VMs.

Thanks for your answer. The memory problem was due to misconfiguration of redis. Once fixed we didn't have the memory problem anymore. But I could still reproduce the no-connection issue just rebooting the linux VM. Ping works, SSH does not. But from other machines everything works. It seems something in between is blocking the connections from the client machine which tried to many times to connect to redis failing to do so while rebooting. And read my update to the answer, I now believe it's outside linux, my guess is the issue had to do with hyper-v/windows. — Max Favilli, Jun 17 '20 at 01:24
Alright; do any IP addresses change after reboot? Do you have NAT somewhere in the client-server path? NAT is almost always stateful (in e.g. hyper-v, or really anywhere), and stale state of it could maybe cause such symptoms. I'd try reconfiguring all the interim networks to bridging, just to eliminate the NAT hypothesis. — ulidtko, Jun 17 '20 at 09:24
No ip change after reboot. The funny thing is if we actually change the ip address of either the redis client VM or the linux VM after reboot it works. So it's network related. But all VM and all hosts are on the same network and the Virtual Switch is not configured with NAT. — Max Favilli, Jun 18 '20 at 08:41
Definitely a network virtualization issue. Perhaps try employing an encrypted tunnel through the virtnet (using e.g. `socat` TLS or somesuch): this might prevent the hypervisor from peeping into the connections too much. — ulidtko, Jun 18 '20 at 10:03

ssh connection timed out, why sudo service sshd restart resolves it?

1 Answers1