0

I have Linux server (Ubuntu 16.04) where every serverce seems to be fine, except that it sometimes answers TCP connection very slowly (eg. 10 -20 sec) or not at all.

The server is not under load and this happens accross all TCP services (HTTP, SMTP, VOIP). Most connections are answered fast and it seems rather random when this slow down happens happens.

My guess is that this is inside the network stack.

Any idea how to debug this ?

I made 2 TCP dumps.

Not working:

15:10:07.281993 IP p4FC4B365.dip0.t-ipconnect.de.3237 > hetzner3.1740: Flags [S], seq 3664811831, win 8192, options [mss 1452,nop,wscale 8,nop,nop,sackOK], length 0
15:10:10.281742 IP p4FC4B365.dip0.t-ipconnect.de.3237 > hetzner3.1740: Flags [S], seq 3664811831, win 8192, options [mss 1452,nop,wscale 8,nop,nop,sackOK], length 0
15:10:16.282033 IP p4FC4B365.dip0.t-ipconnect.de.3237 > hetzner3.1740: Flags [S], seq 3664811831, win 8192, options [mss 1452,nop,nop,sackOK], length 0

Working

15:11:01.929326 IP p4FC4B365.dip0.t-ipconnect.de.3238 > hetzner3.1740: Flags [S], seq 513688945, win 8192, options [mss 1452,nop,wscale 8,nop,nop,sackOK], length 0
15:11:04.931110 IP p4FC4B365.dip0.t-ipconnect.de.3238 > hetzner3.1740: Flags [S], seq 513688945, win 8192, options [mss 1452,nop,wscale 8,nop,nop,sackOK], length 0
15:11:10.930925 IP p4FC4B365.dip0.t-ipconnect.de.3238 > hetzner3.1740: Flags [S], seq 513688945, win 8192, options [mss 1452,nop,nop,sackOK], length 0
15:11:10.930964 IP hetzner3.1740 > p4FC4B365.dip0.t-ipconnect.de.3238: Flags [S.], seq 4087654018, ack 513688946, win 29200, options [mss 1460,nop,nop,sackOK], length 0
15:11:10.960346 IP p4FC4B365.dip0.t-ipconnect.de.3238 > hetzner3.1740: Flags [.], ack 1, win 65340, length 0
15:11:10.971341 IP p4FC4B365.dip0.t-ipconnect.de.3238 > hetzner3.1740: Flags [P.], seq 1:513, ack 1, win 65340, length 512
15:11:10.971371 IP hetzner3.1740 > p4FC4B365.dip0.t-ipconnect.de.3238: Flags [.], ack 513, win 30016, length 0
15:11:10.971377 IP p4FC4B365.dip0.t-ipconnect.de.3238 > hetzner3.1740: Flags [P.], seq 513:627, ack 1, win 65340, length 114
15:11:10.971388 IP hetzner3.1740 > p4FC4B365.dip0.t-ipconnect.de.3238: Flags [.], ack 627, win 30016, length 0
15:11:10.974736 IP hetzner3.1740 > p4FC4B365.dip0.t-ipconnect.de.3238: Flags [P.], seq 1:129, ack 627, win 30016, length 128
15:11:10.975473 IP hetzner3.1740 > p4FC4B365.dip0.t-ipconnect.de.3238: Flags [P.], seq 129:281, ack 627, win 30016, length 152
15:11:11.006089 IP p4FC4B365.dip0.t-ipconnect.de.3238 > hetzner3.1740: Flags [.], ack 281, win 65060, length 0
Gene Vincent
  • 375
  • 1
  • 6
  • 16
  • I bet these services are trying to do a reverse DNS lookup on the calling IP address. Check what's happening on UDP port 53 at the same time. Perhaps your DNS access goes away? – Alastair McCormack Apr 22 '17 at 13:17
  • And to avoid your tcpdump provoking the dns lookups run it with `-n`. – meuh Apr 22 '17 at 18:18
  • I thought at DNS lokups, too, but thats not the reason. It also happens with services that don't use DNS at all. – Gene Vincent Apr 22 '17 at 19:10

1 Answers1

1

I don't have enough reputation to ask a question so this is more comment/question than answer.

The successful trace shows that the server responds after the window scale option is removed from the SYN.

Do you have examples where the server responds after the first SYN? Do they have the windows scale option?

Does the server have windows scaling enabled? What does sysctl -a | grep scal show?

EDIT: This problem/solution sounds similar to yours: Why would a server not send a SYN/ACK packet in response to a SYN packet

Jeff S.
  • 128
  • 3
  • I dont have a different working example. I get "net.ipv4.tcp_adv_win_scale = 1". What does that tell you ? – Gene Vincent Apr 22 '17 at 22:13
  • 1
    Sorry, I meant sysctl -a | grep scal (w/o the e) to see if the server has scaling set to 1 or 0. That said, I just found this that sounds similar - https://serverfault.com/questions/235965/why-would-a-server-not-send-a-syn-ack-packet-in-response-to-a-syn-packet – Jeff S. Apr 22 '17 at 22:22
  • I think scaling is on: kernel.sched_tunable_scaling = 1, net.ipv4.tcp_adv_win_scale = 1, net.ipv4.tcp_window_scaling = 1 – Gene Vincent Apr 22 '17 at 22:31
  • Timestamps are also on: net.ipv4.tcp_timestamps = 1 – Gene Vincent Apr 22 '17 at 22:34
  • 1
    Try disabling timestamps first to see if that solves your problem: sysctl -w net.ipv4.tcp_timestamps=0 If it doesn't, it sounds like you can disable scaling, but that can impact throughput by limiting your cwnd to 65k. – Jeff S. Apr 22 '17 at 22:49