Troubleshooting High TCP Retransmit Rate

Question

I've been attempting to troubleshoot a network issue that presents with very high rate of TCP retransmits. 36 samples (taken with Wireshark 1.10.8 running on 32-bit Windows 7) totaling a little over seven hours, ranging between 2 and 53 minutes each shows retransmits occupying between 43 and 61 percent of the total ingress bandwidth.

What's confusing me is that as far as I know, there are only two reasons for this sort of issue: a flaky link that drops packets, and congestion. I believe that I have ruled these out. Let me lay out our situation, and I would love to hear from people more knowledgeable than myself on other directions of inquiry to resolve the problem.

The network in question is aboard a ship at sea. It uses a satellite link to communicate with the Internet. Unfortunately, the bandwidth costs for this type of link are prodigious, so we're stuck with a 1Mbps down / 512kbps up connection. Being a satellite link, it runs about 650ms ping times. At the moment, we have about 300 people aboard, all sharing that pipe.

The network consists of two VLANs (one for ship's computers, and the other for guests). Both VLANs are piped into a SonicWall TZ 215 (running SonicOS Enhanced 5.8.1.2-6o) which controls the pipe to the Internet. Both VLANs have wired and wireless clients. The wired network is run by a series of Cisco 2900 gigabit switches. The wireless network is provided by numerous Cisco APs (signal propagation in a steel ship at sea is terrible).

My first thought was that it was a congestion issue, so I pursued various solutions to this (blocking high bandwidth services like video chatting and streaming, bugging the corporate office to pay for a bigger pipe, etc). Sadly, we didn't get a bigger pipe. The other things helped a little bit, but not enough to make a real difference.

But this weekend I was put back to square one. The captain asked me to disable the guest access to the Internet during a drill. I took that opportunity to take a Wireshark capture of the network when it wasn't congested. To my surprise, that 10 minute sample showed the TCP retransmit rate was nearly identical to all the other captures - 58%. Over the duration of that sample, the average bandwidth usage was 98kbps, so it was definitely not congested.

This leaves just packet loss as a likely cause. To test this, I ran twelve hours of pings. At the end, the program reported less than 1% packet loss.

Which leaves... what? I don't know. Any additional ideas would be most appreciated.

What version of tcpip.sys do you have on the client (and target host if Windows)? — Greg Askew, Aug 06 '14 at 13:08
The ping test was performed from a Windows 7 (32-bit) machine with tcpip.sys version 6.1.7601.22525 (11/26/13). The target of the ping was Google's DNS server, 8.8.8.8. I don't know what OS it runs, but I'm guessing it's not Windows — Erick Brown, Aug 06 '14 at 13:15
Forgetting the Google ping test, what is the MTU of the link to the hosts where you are experiencing retransmits? Also, can you provide the output of the command: netsh interface tcp show global — Greg Askew, Aug 06 '14 at 13:22
There are hundreds of clients on the network, all of them suffering from the same network issues. The mix includes Windows XP/7/8/Server 2k8, MacOS, iOS, and Android. That being said the typical MTU that I've seen from the Wireshark captures is 1514 — Erick Brown, Aug 06 '14 at 13:27
A standard MTU on a LAN is 1500. If you subtract 28 from that, you should be able to ping a remote host with ping -f -l 1472 n.n.n.n with no timeouts or DF warnings. I have to say that "hundreds of clients" sharing a 1Mbit down/512 kbps up connection I would expect to see a lot of retransmits though. — Greg Askew, Aug 06 '14 at 13:42
I completely agree r/e the number of clients and the link size. But since we see the same retransmit rate when the link is running at 10% utilization, we know that there's something else going on too. I like your idea on beefing up the ping size to help it catch drops — Erick Brown, Aug 06 '14 at 13:45
Actually using the largest size payload is to determine the actual MTU, not what the client os thinks the MTU is. If you get a timeout or a DF warning, drop the payload size until it works. That is the actual MTU. — Greg Askew, Aug 06 '14 at 13:51

score 1 · Answer 1 · answered Sep 12 '20 at 00:22

1

To test this, I ran twelve hours of pings. At the end, the program reported less than 1% packet loss.

Ping uses ICMP packets - that's internet control message protocol. ICMP is intended to ensure traffic flows (i.e. telling other machines how to route traffic) so devices MUST prioritize ICMP over other packet types.

i.e. this is the worst possible way to detect congestion.

answered Sep 12 '20 at 00:22

symcbean

19,931
1
29
49

+1 for helpful info that "it's the worst way to detect congestion". The answer would be even more helpful, if you could suggest a few good ways to detect congestion in this situation. – LMSingh Oct 19 '21 at 00:11

score 1 · Answer 2 · answered Aug 06 '14 at 12:59

1

Check everything before your network. As in: The satellite link is flaky. Could be anything on the physical level of that side - bad calibration, whatever.

As per the Sherlock Holmes Approach that is the only thing left. Packets are lost because they are LOST.

answered Aug 06 '14 at 12:59

TomTom

50,857
7
52
134

If the satellite link was flaky, wouldn't 12 hours of pings shown more than 1% packet loss? Or am I thinking about that the wrong way? – Erick Brown Aug 06 '14 at 13:04
Yes. Ping packets are very small for example. Ping also may be faked. I just say - if you eliminate everything that is not it, then the remaining has to be it. And everything before your first ethernet cable all that is left. And that is the satellite link. – TomTom Aug 06 '14 at 13:11
That's a good point. Pings are small. An intermittent problem would be hundreds of times more likely to manifest in packets at the full MTU than they would be in a ping – Erick Brown Aug 06 '14 at 13:16

score 1 · Answer 3 · answered Aug 05 '15 at 18:20

1

One good way to detect loss is by using a UDP stream of packets (there are various tools that do this, mainly for QoS testing). You can vary size, frequency, delay. It should show you if you have actual loss.

answered Aug 05 '15 at 18:20

Tom

11
1

score 0 · Answer 4 · answered Jul 29 '20 at 07:08

I experienced something similar with sonic wall in the past, check that you have the same MTU size of the packets. CISCO has MTU of 1500 while sonicwall has 1492, so every packet is broken into two... see: https://www.sonicwall.com/support/knowledge-base/set-mtu-in-vpn-environment-in-case-of-throughput-issues/170705131319789/

score 0 · Answer 5 · answered Sep 11 '20 at 18:07

Agree with the ping test, set DF bit and see where your MTU caps out. Is the traffic encapsulated? Id imagine so through SAT which will reduce this further. I shed a tear when I read the number of users for 1Mbps..upload saturation will impact it further. I know there's little you can do but with the way pages are designed these days you're fighting a loosing battle. We tried to restrict a public wirless service to 256kbps per client and the experience wasn't useable, I can't even begin to imagine loading a page with a 56k modem today, and yours is contended 15 to 1.

Troubleshooting High TCP Retransmit Rate

5 Answers5