Debugging "clogged" TCP connections

I'm having trouble with an internet connection that seems to randomly "freeze" arbitrary tcp connections when they have not been used for a while. The connections stay established, but no data is coming through.

When this happens, netstat still shows the connection status as ESTABLISHED on both the local computer:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name Timer
tcp        0     53 192.168.0.10:41129      173.255.235.238:143     ESTABLISHED 8219/gnutls-cli  on (79.31/13/0)

..and the remote server:

Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name Timer
tcp        0      0 173.255.235.238:143     68.5.174.98:41129       ESTABLISHED 5303/imapd       off (0.00/0/0)

However, it seems that no data at all is transferred. If I run strace on the local and remote process, both just show a repeating sequence of select calls (with different fds of course), e.g.

select(6, [0 5], NULL, NULL, {0, 50000}) = 0 (Timeout)
select(6, [0 5], NULL, NULL, {0, 50000}) = 0 (Timeout)
select(6, [0 5], NULL, NULL, {0, 50000}) = 0 (Timeout)

The internet connection overall does not seem affected, I can still establish new connections to the same service on the same server without any problems. However, the affected local applications seem to be unaware of the problem and just hang.

About 10 minutes after the attempted transmission on the local end, the connection on the remote end disappears from the netstat (I wasn't able to catch any intermediate state), but still stays ESTABLISHED on the local end.

Finally, after some more minutes, the local application aborts with a timeout and disappears from the local netstat output as well.

When I look at a packet capture of this connection on the client side, there is a long (expected) period of inactivity that seems to trigger the problem, then the local end tries to transmit some data again but never receives an ACK. Instead, 15 TCP Retransmissions go out, with intervals increasing from 0.3 seconds to 120 seconds. No activity is captured after that.

Does anyone have a suggestion of how I could debug this further to find out where the problem lies and how to fix it?

Additionaly and/or as a temporary workaround: is is there some way to globally reduce the timeout on client and/or server to reduce the time before the local application aborts?

Nikratio

Posted 2012-12-14T03:42:26.967

Reputation: 377

Debugging "clogged" TCP connections

Answers