Finding cause of TCP retransmission within a LAN

Question

Hello denizens of Server Fault

I have an irritating problem with a LAN of about 100 computers, 2 Windows domain servers, and 12 VoIP phones. Since their installation around a year ago, every week or so, we notice a VoIP phone resetting itself - occasionally in the middle of a call. Simultaneously there are often signs of temporary loss of connection on computers: freezes in explorer while accessing network shares, errors in our administration software due to loss of connection to the database server.

I have been doing some Wireshark monitoring on the connection between the VoIP PBX and the rest of the network. Wireshark picks up a clump of retransmitted TCP packets at the times when we record phone restarts. The Wireshark log shows about 2 clusters of retransmissions a day ranging from 5 packets to hundreds. Those in each cluster are mainly between the PBX and some set of the VoIP phones, but not always the same set. Often retransmissions at the same time are to phones connected to the same switch, but sometimes retransmissions occur together to phones at opposite ends of the network. There are usually some coincident retransmissions in passing TCP traffic, for example between client machines and the file servers.

The spikes in retransmissions and phone resets do not correlate well with when the network is heavily loaded. They seem to occur slightly more during the day, but most in the evening, when traffic should be decreasing. They occur reasonably often late at night when most computers are turned off and traffic should be lowest.

Do you have any ideas that might help diagnose the cause of problems like this? One thing I have not yet tried, but should have, is updating the firmware of all the switches.

What model switches? How does proccessor,memeory,etc stats look? Are you on one broadcast domain? how close to max throughput are you seeing on the network? — Zypher, May 20 '10 at 21:51
All the switches are 3Com: Baseline 2924 - PWR Plus (3CBLSG24PWR) x 2, 4200 (3C17304A) x 3, 4200 (3C17304) x 2, 2824-SPF Plus (3C16487), 2250 plus (3C16476CS). I don't think they give stats on processor or memory, but I'd be very pleased to learn otherwise. Yes we are on one broadcast domain. I don't know about throughput, I will look into measuring it. — Surreal, May 21 '10 at 15:28

score 18 · Accepted Answer · answered May 20 '10 at 23:02

18

TCP retransmissions are usually due to network congestion. Look for a large number of broadcast packets at the time the issue occurs. If the percentage of broadcast traffic in your capture is above about 3% of the total traffic captured, then you definitely have congestion. Look for both physical layer (ARP) and network layer (name resolution) broadcasts on the network. If you find a high volume of broadcast traffic you can trace it to the source from the capture data.

answered May 20 '10 at 23:02

joeqwerty

108,377
6
80
171

9

In addition, the TCP retransmissions are not the cause of your problem, they're a symptom of the problem. – joeqwerty May 21 '10 at 00:42
I should have mentioned that I had a look at the UDP broadcasts and they did not correlate with the retransmissions. A few of the retransmission events coincide with spikes in UDP broadcasts, but most do not. I have had another look and found that UDP broadcasts do not exceed 1.5% of traffic (about 350 packets) in any 10 minute time segment, and reaching that level is rare. However I had not looked at ethernet broadcasts. I am running a script now to filter all my wireshark logs. Is the 3% rule of thumb for UDP broadcasts and ethernet broadcasts individually or combined? – Surreal May 21 '10 at 14:39
1

The 3% is not really a rule of thumb. It's what I've been told and what I've seen in my own environment. I've heard numbers ranging from 10 to 20% but I've found that once it exceeds 3 to 5% it's usually causing problems. You need to look at all broadcast traffic: ethernet, network, and multicast broadcasts, as they can all cause congestion. Basically any traffic that is broadcast to all switch ports is traffic that needs to be analyzed and reduced or eliminated. – joeqwerty May 21 '10 at 15:14
I still have not got a pretty graph together to check for a good correlation over a long period, but ethernet broadcasts are looking quite promising. One log where there was retransmission had just above 3% broadcasts, another about 6%. I have found one problem at least: an old server is putting out a constant stream of gratuitous ARP packets. – Surreal May 21 '10 at 21:18
Hmmm.... OK. Sounds promising. Another tool you can use that has a nice analysis feature is Colasoft Capsa. It will show you if there are broadcast storms on the network. Sorry I didn't mention it earlier. You can download a time limited demo and run it to see what it's analysis reports. – joeqwerty May 21 '10 at 21:37
1

I found the excessive ARP entries using the Wireshark filter of `arp` - and to see the broadcast ones only, using a filter of `eth.addr==ff:ff:ff:ff:ff:ff` – mlhDev Jun 22 '15 at 21:01

score 2 · Answer 2 · answered May 21 '10 at 01:04

Gathering traffic statistics for your switches may show you have periods where you are running at or near capacity. This can lead to retries when responses don't come back within the inital timeout (often 3 seconds). This increases congestion momentarily until congestion mitigation mechanisms kick in.

Look for people using streaming media as that can soak up bandwith quickly.

You may be able to mitigate the problem for the phones by traffic shaping. This will just move the problem to other users.

score 2 · Answer 3 · answered May 21 '10 at 01:09

2

Sounds like a spanning tree loop or a broadcast storm to me, especially if the retransmissions and the issues are localized to the same switch (which differs). When it happens, what are the port states on your L2 device? Probably a bad switch or bad root bridge priorities? Interesting problem.

answered May 21 '10 at 01:09

McJeff

2,019
13
11

Thank you for prompting me to read up on spanning trees, about which I am embarassingly ignorant. However I do not think it could be a spanning tree loop, because we do not have any redundant links in our network (possibly a problem in itself). By "port states on your L2 device", am I right you mean which ports the switches have enabled as a result of the spanning tree algorithm? We have not manually configured a root bridge, would it be a good idea to do so? – Surreal May 21 '10 at 14:54
Getting familiar with STP is a good idea, but if you are sure that you don't have any redundant links, then STP won't be the issue. – joeqwerty May 21 '10 at 18:01
Yeah, if you don't have redundant links, it wouldn't be a problem. By port states, yes, I mean which are forward/blocked/learning. – McJeff May 21 '10 at 20:22

score 2 · Answer 4 · answered Apr 13 '12 at 03:28

You probably have solved this since it has been so long but essentially you need to enable "port fast" on the ports that have endpoints (voip phones,workstations, servers). A phone can send PDUs so if that guy reboots it will cause an STP convergence to occur thus causing the FDB table to be flushed and all devices to go through the 4/5 step STP fun. By putting ports with endpoint in "port fast" they skip the waiting and go right to forwarding mode.

score 1 · Answer 5 · answered May 20 '10 at 23:18

1

Hopefully your phones are on a different subnet and VLAN from the other computers?

answered May 20 '10 at 23:18

Greg Askew

34,339
3
52
81

No they are on the same IP subnet, and I am pretty sure the same VLAN too. Is this a serious problem? It certainly sounds like it would be a good idea. I can see it would separate the broadcast domains for phones and everything else. Would it have any other advantages? – Surreal May 21 '10 at 14:57
Yes I would definitely put the phones on a dedicated VLAN. – Greg Askew May 21 '10 at 17:12

hookenz · Answer 6 · 2010-05-23T23:34:19.243

1

It could also be a faulty piece of equipment like a faulty switch. Do the retransmissions correlate to phones/computers on one particular switch or part of the network?

Just to extend my answer a little. Not all switches are created equal, even if they have the same specs. Some are able to cope with much higher load than others because they have faster processors inside. It could be that your switches are not quite up to grade.

I'd start by putting some of your most troublesome VOIP phones onto their own physical switch and see whether the resets on those continue. If it goes away then you're on the road to solving it very soon.

edited May 23 '10 at 23:34

answered May 20 '10 at 23:34

hookenz

14,132
22
86
142

I wish they did. There does seem to be most problem with devices connected to two switches, which are at opposite ends of the network. However there are significant retransmissions to phones in other parts of the network as well. – Surreal May 21 '10 at 15:07

Finding cause of TCP retransmission within a LAN

6 Answers6