I've been attempting to troubleshoot a network issue that presents with very high rate of TCP retransmits. 36 samples (taken with Wireshark 1.10.8 running on 32-bit Windows 7) totaling a little over seven hours, ranging between 2 and 53 minutes each shows retransmits occupying between 43 and 61 percent of the total ingress bandwidth.
What's confusing me is that as far as I know, there are only two reasons for this sort of issue: a flaky link that drops packets, and congestion. I believe that I have ruled these out. Let me lay out our situation, and I would love to hear from people more knowledgeable than myself on other directions of inquiry to resolve the problem.
The network in question is aboard a ship at sea. It uses a satellite link to communicate with the Internet. Unfortunately, the bandwidth costs for this type of link are prodigious, so we're stuck with a 1Mbps down / 512kbps up connection. Being a satellite link, it runs about 650ms ping times. At the moment, we have about 300 people aboard, all sharing that pipe.
The network consists of two VLANs (one for ship's computers, and the other for guests). Both VLANs are piped into a SonicWall TZ 215 (running SonicOS Enhanced 5.8.1.2-6o) which controls the pipe to the Internet. Both VLANs have wired and wireless clients. The wired network is run by a series of Cisco 2900 gigabit switches. The wireless network is provided by numerous Cisco APs (signal propagation in a steel ship at sea is terrible).
My first thought was that it was a congestion issue, so I pursued various solutions to this (blocking high bandwidth services like video chatting and streaming, bugging the corporate office to pay for a bigger pipe, etc). Sadly, we didn't get a bigger pipe. The other things helped a little bit, but not enough to make a real difference.
But this weekend I was put back to square one. The captain asked me to disable the guest access to the Internet during a drill. I took that opportunity to take a Wireshark capture of the network when it wasn't congested. To my surprise, that 10 minute sample showed the TCP retransmit rate was nearly identical to all the other captures - 58%. Over the duration of that sample, the average bandwidth usage was 98kbps, so it was definitely not congested.
This leaves just packet loss as a likely cause. To test this, I ran twelve hours of pings. At the end, the program reported less than 1% packet loss.
Which leaves... what? I don't know. Any additional ideas would be most appreciated.