6

I have a really strange one.

I have packet loss with Excessive 'TCP Dup ACK' & 'TCP Fast Retransmission' when I download files (and only download) from two different Windows 2008 servers. Upload speed is fine.

This ONLY occurs if the client computers(Win7) is connected at 100mb/s. At 1GB, no errors and I get full speed. If I set the client nic to 100Mb/s, I get a lot of 'TCP Dup' errors and the download speed drops to around 2-5MB/s. Upload speed is 10MB/s or above.

This only happens to the Windows 2008 Server boxes (Dell, but different hardware). This problem does not occur if I transmit between the Win7 clients and the Linux servers.

It's like Server 2008 is unable to scale the TCP window properly, overloads the switch or something, then pauses traffic for a bit.

Parts of the network run at 100Mb/s due to older equipment, so this is really causing a problem in some buildings.

I have uploaded a pcap file from the client here. https://dl.dropboxusercontent.com/u/24907255/slow.pcap.gz

It shows a 50MB file being written to the server, then read back from the server with the errors.

Thanks for any help. I am stumped.


11/28/13 More Information.

I shutdown the entire network so that only one client and one server are on the network. No change in the problem.

If I set every interface, server, client and Cisco 2960 switch to 100Mbs full, then the problem goes away. If I set the server and switch interface auto or 1Gbs, the problem is back.

If I bypass the switch with a Netgear 10/100 switch and set both client and server to auto, I have no problems.

I did discover this. In the normal setup, with server to switch at 1Gbs, I plug in the Netgear 10/100 switch between the client and Cisco switch, my speed problem is even worse. Speeds go from 5-7MB/s to 2-3MB/s, and yes I have tried fixed and auto network speeds. This would explain why some of the buildings that have a 2 switch hop between them and the main Cisco switch have more of a speed problem.

On to pinging. With everything at 1GB/s, I can ping a full TCP payload, ping -l 65500 and it works. With the client at 100Mbs, the max size I can ping is 17752. Anymore and it fails, to the Windows servers only, no problem on the Linux boxes. With the Netgear 10/100 between the server and client, no problems pinging at 65500.


Update 3

I swapped in a PowerConnect 2748 switch. Same problem with the server at 1Gbs and the client at 100Mbs. I can ping over 17752 now tho. Strange. So I don't think it's the Cisco switch.


Update 4. I am trying to get some hard numbers by using ipref. All systems connected to the same switch, with the client set to 100Mbs and running the command ipref.exe -c -u -b 10m. So sending to the server. One server is 2008 with no load on it right now, other is a Ubuntu with a load avg of .20.

At 10m

  • Linux jitter 0.022ms, packet loss is 0/8505
  • Server 2008 jitter 1.859, packet loss 68/8505

Pushing it to 100m

  • Linux jitter 0.445, packet loss 0/26634
  • Server 2008 jitter 0.542, packet loss 94/26596

Now for stats sending TO the client at 10m

  • Linux jitter 0.271 ms, 0/ 8500 (0%) 1 datagrams received out-of-order
  • Server 2008 jitter .063, 20/8505 (0.24%)

Pushing it to 100m

  • Linux jitter 0.230 ms 4083/85443 (4.8%), 1 datagrams received out-of-order, 95.7Mbs
  • Server 2008 jitter 0.237, 28174/81718 (47%), 51.1mbs

So Server 2008 is poor in general, but you can see the huge packet loss 47% when the connection is pushed to the clients 100mbs limit.


Update 5.

When I tested with the PowerConnect 2748 switch, I used different cat5 cable between the server and switch and client and switch. This should rule out cabling or switch issues.

I have two Windows 2008 Servers in this environment, installed at different times, and on different hardware. The only thing they share is a Broadcom branded nic, but the chipset is different. Both experience the same problem, but I am doing my main testing on one so in case something goes wrong, the other will still work.

The one server has a built on BCM5709C with two ports, and an add-on card, pci express I think, card also with the same BCM5709C chipset and two ports. I have tried all of them and the problem still exist. So this should rule out any hardware problems.


Update 6 12/3/13 I installed the Intel nic. No change. I played around with the ctcp settings and no change there. I even turned off SMB2 and no difference.

I did some more testing at 100Mbs Copying a 3GB ISO image TO the server, drag and drop, averages out at 10MB/s. Copying the same 3GB ISO image FROM the server, averages out at 6.3MB/s.

With all network interfaces set to Auto and at 1Gbs. Copying the ISO TO the server, averages 101MB/s Copying the ISO FROM the server, averages 57MB/s

So read speeds from the server are almost half the write speeds.

Porch
  • 680
  • 5
  • 12
  • I will be taking the entire network down over the holiday and I will do some more testing and get back to everyone. Please keep going with the ideas and I will test them all out. – Porch Nov 27 '13 at 03:16
  • 1
    What sort of switch do you have? are windows server the only ones at 100mb/s speed? which is pretty old now... are you not at gigabit yet? – hookenz Nov 27 '13 at 20:34
  • 1
    We have Cisco switches 2900 series I think, as the core switches, all GB. Some of the buildings have different brands from Dell to Summit. Some are GB, but many are not. The cabling is old and long runs are replaced with fibre when budget allows. – Porch Nov 27 '13 at 21:46
  • I know this may seem dumb but do you have the 'wake on lan' type of settings disabled for the nics on both ends? One thing I don't see here is if the switches are the culprit or not. Its possible you may have something simple as a faulty switch. Is it possible to swap out the switch just to check? The other thing I'm noting is that two servers are in the example here but there are also challenges with the rest of the network. Is it possible you have some rogue hardware mucking things up for you? – Techie Joe Nov 28 '13 at 17:09
  • 1
    Wake on lan is disabled on all systems. I do have another Cisco 2960 switch in a different building, but it's not an easy task to swap. If it was the switch, why would traffic be fine on the linux servers? Also, I unplugged everything but one server and one client, same problem. – Porch Nov 29 '13 at 01:11
  • That's excellent troubleshooting that is. Write your own answer and I can give you another upvote? – ErikE Nov 29 '13 at 06:42
  • Just to be clear you have zero problems with transfers on the Linux systems, correct? If this is the case, can you swap out the nics and cables on the linux systems and put them on the Windows systems then try your transfer tests again? – Techie Joe Nov 29 '13 at 16:31
  • 1
    All network cards are embedded into the motherboards. But I am thinking about buying a cheap Intel nic and installing it into one of the Windows servers. – Porch Nov 29 '13 at 18:05
  • At this point (based on the information in this thread) I would recommend trying this next (installing a NIC) as I'm thinking that the problem may be as simple as a bad nic or a bad cable. – Techie Joe Nov 29 '13 at 18:23
  • Your last published result (47% loss) makes one to think deeply. I found out that Cisco 2960 has very (VERY!) small packet buffer - 384K shared by several ports. This is a real challenge for 1000-to-100 paths. Linux fortunately can serve it correctly. The things may even worse with aggressive flow control (ctcp) which is turned on by default ON Win Server 2008 (http://www.speedguide.net/articles/w...08-tweaks-2574). Try to turn it OFF using the tips in the link supplied. – Veniamin Nov 29 '13 at 19:30
  • @Veniamin Not discounting your post at all but if he hasn't made attempts to test to see if the Nic is bad before all this one would think that this should be ruled out first (along with swapping out or crimping a new cable). – Techie Joe Nov 29 '13 at 19:32
  • @TechieJoe I completely agree with you that it may help. But at very beginning it was reported that 1000-to-1000 paths have no problem that makes to think of NIC/Cable at least to operate correctly. – Veniamin Nov 29 '13 at 19:48
  • I will pay with the ctcp setting and see if that changes anything. See addition above about nic testing. – Porch Nov 29 '13 at 21:39
  • @Porch: I believe you may have found the answer, it just needs some background (see below). – ErikE Nov 29 '13 at 22:07

6 Answers6

6

This sounds like a speed/duplex mismatch causing collisions and retransmits. Misconfiguration between the server and the other side could cause this. Another reason for the mismatch could be failing autonegotiation.

Make sure both ends of the connection are configured identically regarding speed and duplex.

Teun Vink
  • 1,837
  • 11
  • 14
  • 1
    This is exactly right. @Porch is setting the "client NIC" port to 100Mbps but is failing to set the corresponding switch port on the other end. – Skyhawk Nov 23 '13 at 15:40
  • As a test case, I have both Windows and Linux servers plugged into the same Cisco switch with the client. With the client set to "auto" as to network speed, it runs at 1GB/s and all is fine. If I manually set it to 100MB/s, I loose packets when receiving data from the Windows 2008 box only. No problems with the Linux servers. This tells me it's not a switch port issue unless I am wrong somehow? – Porch Nov 25 '13 at 18:10
  • I'm not 100% certain I remember this correctly, but have some very faint recollection from teaching Cisco switching/routing 10+ years ago that there is some obscure protocol negotiating link speed. Both ends are involved and that one nic/driver succeeds is not a guarantee for every other to succeed. In any case, what's the harm in hard-setting all interfaces involved in a test to 100Mbps? At worst you rule out a hypothesis. – ErikE Nov 25 '13 at 22:17
  • I set the server nic to 100Mbps and everything went down the crapper. Very, very slow transfer rate. Less then 1MB/s when moving files. More testing is needed. – Porch Nov 27 '13 at 03:17
  • 1
    Did you also set the corresponding switch interface to 100Mbps? – ErikE Nov 27 '13 at 04:33
  • I did not, but I logged in and verified that the Cisco auto set it to 100Mbps/full itself. – Porch Nov 27 '13 at 06:24
  • If I set everything, switch and servers to 100Mbs full, then I don't have a problem. But I can't leave it that way and limit our main fileserver to 100Mbs speed. – Porch Nov 29 '13 at 01:13
2

I believe you should investigate if any of the NIC driver/Windows NDIS offload settings relate to your problem. I am most suspicious of the LSO (Large Send Offload) function as I've seen it totally wreck a service (Dell server w. Broadcom NIC) in a manner which defied all troubleshooting book definitions of anything.

The actual effect of LSO when it disrupts rather than enhances, is that the LSO engine may pass larger data frames that the switch supports. This causes the switch to silently discard those frames. Needless to say this causes performance degradation and packet loss. The failure can be imminent, but can also be intermittent making it tremendously difficult to troubleshoot. This is described in detail here: Large Send Offload and Network Performance

Disclaimer: this is just best effort thoughts on a possible angle on your problem. Implementing any one of the changes below will disrupt your network communication. The computer should be restarted after applying any of the settings. I copy/paste the most interesting settings for reference, but the links contain all the hardcore info and caveats. I most strongly recommend using the official docs as the basis for change and this post at most like a checklist.

Before proceeding with any of this, back up your registry key of:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

One uncool reason is due to an official bug described below, which changes some unrelated values when certain settings are sent through the command line.

I freely admit that where settings are present in both the Windows NIC driver GUI and in Windows, I never really got clarity in if one has to disable both in the GUI and through Windows CMD/Registry, or if one suffices. The blogs I've read which presented an answer have been inconsistent with regards to some minor detail or other so I never was sure. Nowdays I attempt change everywhere I find the option for whichever setting I'm focusing on. The GUI options are not presented here, but are described in the official docs.

Also, different NIC drivers for the same card may present varying granularity in the advanced settings in the GUI.

Disabling Task Offloading

This registry setting disables task offloading as defined in Using Registry Values to Enable and Disable Connection Offloading.

HKLM\System\CurrentControlSet\Services\TCPIP\Parameters\DisableTaskOffload
Setting this value to one disables all of the task offloads from the TCP/IP
transport. Setting this value to zero enables all of the task offloads.

If the above setting has any effect you could try going granular as specified in the link. There are quite a number of settings governing this so I won't paste them all in.

I'll supply the LSO ones though:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\LsoV1IPv4
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\LsoV2IPv4
HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\LsoV2IPv6

For all three: Enabled = 1(default). Disabled = 0.

Disabling connection offloading

As defined in Using Registry Values to Enable and Disable Connection Offloading.

HKLM\System\CurrentControlSet\Services\TCPIP\Parameters\TCPConnectionOffloadIPv4
Describes whether the device enabled or disabled the offload of TCP connections
over IPv4. Enabled = 1 (Default). Disabled = 0.

HKLM\System\CurrentControlSet\Services\TCPIP\Parameters\TCPConnectionOffloadIPv6
Describes whether the device enabled or disabled the offload of TCP connections
over IPv6. Enabled = 1 (Default). Disabled = 0.

Disabling TCP Chimney, TOE and TSO

As specified in How to Disable TCP Chimney, TCPIP Offload Engine (TOE) or TCP Segmentation Offload (TSO) Note the Win2008 hotfix

and in Information about the TCP Chimney Offload, Receive Side Scaling, and Network Direct Memory Access features in Windows Server 2008.

Windows 2008 Server:
If the operating system is Microsoft Windows Server 2008 (any version
including R2), run the following from a Command prompt:

1. netsh int tcp set global chimney=disabled
2. netsh int tcp set global rss=disabled
3. netsh int tcp set global netdma=disabled

Note: To display current global TCP settings, use the net shell command:
netsh int tcp show global

4. Restart the server.

Note: Microsoft has identified an issue running the netsh command to set global
TCP parameters on Windows Server 2008 and Vista machines.  Some global
parameters, such as TCPTimedWaitDelay, can be changed from their default or
manually set values to 0xffffffff.  Before running the above command, Symantec
recommends reviewing Microsoft KB Article 967224 (support.microsoft.com/kb/967224).
Upon completion of the above command's execution, Symantec also recommends
reviewing the TCP Parameters noted in the KB Article and applying the hotfix from
the article if needed.

` The hotfix describes the issue thus:

After you run the command, the values of the following unrelated settings are
changed to 0xFFFFFFFF:
KeepAliveInterval
KeepAliveTime
TcpTimedWaitDelay

In addition, the "TcpMaxDataRetransmissions" are changed to 0xFF.

Again, one may therefore wish to backup the entire registry key before doing anything:

HKLM\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters

If you google you problem together with offloading highlights from above, you'll find no end to posts, articles and blogs describing similar issues due to NIC offloading. But if it still doesn't work then I guess you can move on up the stack to try other things out, because it isn't due to half broken cable, NIC or switchport, right?

ErikE
  • 4,676
  • 1
  • 19
  • 25
  • I don't see how it's a speed/duplex mismatch if the Linux servers plugged into the same Gigabit switch work just fine. I tried turning off all offloading on the client and server and it didn't make a difference. – Porch Nov 25 '13 at 18:11
  • Is it always the same Win2008-box or all? If it´s several, do they have hardware nic or driver in common? Is it possible to try another driver and/or nic with the same machine just as an attempt to rule that layer out on the server side? – ErikE Nov 25 '13 at 22:18
  • 1
    They are two different Dell Servers. One has a BCM5720C and the other is a BCM5709. Different drivers for both. – Porch Nov 27 '13 at 02:54
  • Before disabling all the performance features, try updating the network drivers. Anytime Broadcom is involved, they tend to be the problem, but updating the drivers is usually the solution. – Mark Sowul Nov 28 '13 at 00:07
  • I would say to do first whichever seems easiest to reverse. Some drivers I find a pain to reverse, depending on if there is extra management software on top of the driver package or not, and if one has the present installer at hand or not. And whilst updating drivers sometimes fixes a problem, sometimes it doesn't. And on occasion it makes things worse. So reversibility would be my guiding star. – ErikE Nov 28 '13 at 04:51
  • 1
    The latest drives for the card on broadcom website has an installer that grips about not being compatible with my system. I am downloading the one for 32bit 2008, but I don't think it's the right one on the website. – Porch Nov 29 '13 at 04:31
  • Special thanks to ErikE for all the useful information. I just spent a good 2 hours making the changes and testing. And it didn't help at all. Sorry. – Porch Nov 29 '13 at 04:33
1

always look at the networking device for clues..... so, if cisco, do a "show interfaces f0/11" or whatever it may be in your case. retransmits can also be due to a bad ethernet port/nic/cable, such as due to "crosstalk"..... show int on the switch should show you these error stats, if thats the case, and it will be obviously way too high

EDIT: as this is microsoft, its most likely thats your problem, but other than that, in general, start at layer one (make sure phyical cables are good), and work your way up the stack, ... ie layer 2, speed/duplex/mac address fltering,.. then layer 3 ip/udp/tcp firewalling,...etc

nandoP
  • 2,001
  • 14
  • 15
  • 1
    Thanks for the tips on the network stats from the switch. I will check that out. This problem appears campus wide. Anytime the speed drops from 1GB to 100MB/s, I have packloss and slowness. Some buildings have only 100MB/s switches and all those systems have packet loss issues. Even on the systems on a 1GB switch, setting the network speed in windows to 100Mb/s full duplex causes the problem. – Porch Nov 25 '13 at 18:12
  • wow sounds like microsoft is doing whatever it feels like with ethernet protocols. you are better off running widnows in a virtualized environment, such as virtualbox, xen, or kvm. that way, all lower level tcp/ip layer 2/3/4 is handled by linux, and so you are guarenteed to have much better performance, or at the very least get much more verbose error messages to help further diagnose. microsoft is useless for people that want to know how computers actually work – nandoP Nov 30 '13 at 00:38
1

This can also be "advanced" NIC atributes, like PowerManagement ones or IRQ priority. Assuming you have the same version of drivers. Go to:

Device Manager -> Network Interfaces -> Properties for the NIC -> Advanced Tab.

Check and compare all values here.

ibre5041
  • 285
  • 1
  • 7
  • Compare to what? – Porch Nov 27 '13 at 03:13
  • @Porch - compare to system which works properly. This can be something curious like power saving policy, IRQ assignment. Also some NICs support polling - i.e. instead of triggering IRQ for every packet, they buffer packets in a RAM, and then this buffer is slurp periodically by OS. Are all the OSs 64bit? – ibre5041 Nov 27 '13 at 08:42
  • 1
    I have power managment for the server and client nics turned off. No change. Both servers are 2008, but one is standard 32bit, one enterprise 64bit. The hardware is different on each. – Porch Nov 29 '13 at 01:17
0

Did you checked for jumbo frames are off on your 100/1000 network?

UPD:

If jumbo frames are used then all netowrking hardware on broadcast domain should use It. That is impossible with legacy 100mb devices.

I do not know how win2008 tcp works exactly but providing jombo frames it may start scaling transmission window with packet size (not packet count as usual). Then you will observe the situation like described.

FYI: http://m.windowsitpro.com/windows/q-how-do-i-enable-jumbo-frames

UPD2:

I looked to the packet dump you have supplied and saw a lot of packet with length > 1500 and bad checksums (checksums for lengths < 1500 are OK). It confirms my assumption.

The only thing I can not understand - they are relevant to the first session: from client to server (!!!???):

22:25:06.041113 IP (tos 0x0, ttl 128, id 31391, offset 0, flags [DF], proto TCP (6), length 40)  192.168.0.109.49225 > 192.168.0.252.microsoft-ds: Flags [.], cksum 0x9422 (correct), ack 1453, win 1234, length 0

22:25:06.041223 IP (tos 0x0, ttl 128, id 31392, offset 0, flags [DF], proto TCP (6), length 64280, bad cksum 0 (->285)!) 192.168.0.109.49225 > 192.168.0.252.microsoft-ds: Flags [.], cksum 0x82c0 (incorrect -> 0xc9bb), seq 718652:782892, ack 1453, win 1234, length 64240SMB-over-TCP packet:(raw data or continuation?

22:25:06.041254 IP (tos 0x0, ttl 128, id 31437, offset 0, flags [DF], proto TCP (6), length 1452) 192.168.0.109.49225 > 192.168.0.252.microsoft-ds: Flags [P.], cksum 0x0517 (correct), seq 782892:784304, ack 1453, win 1234, length 1412SMB-over-TCP packet:(raw data or continuation?)

22:25:06.041278 IP (tos 0x0, ttl 128, id 31438, offset 0, flags [DF], proto TCP (6), length 2960, bad cksum 0 (->f1df)!) 192.168.0.109.49225 > 192.168.0.252.microsoft-ds: Flags [.], cksum 0x82c0 (incorrect -> 0xfa12), seq 784304:787224, ack 1453, win 1234, length 2920SMB-over-TCP packet:(raw data or continuation?)

22:25:06.042134 IP (tos 0x0, ttl 128, id 31441, offset 0, flags [DF], proto TCP (6), length 2960, bad cksum 0 (->f1dc)!) 192.168.0.109.49225 > 192.168.0.252.microsoft-ds: Flags [.], cksum 0x82c0 (incorrect -> 0x1d7e), seq 787224:790144, ack 1453, win 1234, length 2920SMB-over-TCP packet:(raw data or continuation?)

22:25:06.042492 IP (tos 0x0, ttl 128, id 31444, offset 0, flags [DF], proto TCP (6), length 5880, bad cksum 0 (->e671)!) 192.168.0.109.49225 > 192.168.0.252.microsoft-ds: Flags [.], cksum 0x82c0 (incorrect -> 0xa74e), seq 790144:795984, ack 1453, win 1234, length 5840SMB-over-TCP packet:(raw data or continuation?)
Veniamin
  • 853
  • 6
  • 11
  • This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. – slm Nov 27 '13 at 22:14
  • @slm this is an answer ineed. Simple comment does not allow to earn a bounty ;). In a case like that any answer is just an assumption. But you may right and I need to add futher clarifications. – Veniamin Nov 28 '13 at 05:05
  • Understand about the bounty. Answer looks a bit better. Would be better still if you included how to determine this and any additional info you can think of. Answers should be able to stand on their own. – slm Nov 28 '13 at 05:37
  • 2
    These large packets without cksum can also possibly come from TCP Offload engine. I do not know how it works Windows, but on AIX when you use NICs onboard TCP offload engine then tcpdump sees received packets of size `32KB` without any cksum. As some of TCP processing is performed in NIC (not by OS), you must which this feature off in order to see what really happens in the network. The other option is to duplicate the Switch port (on Cisco) and then eavesdrop the communication. – ibre5041 Nov 28 '13 at 13:19
  • @Ivan It definitely makes things more clear. Packets going from server (? .252) to client (? .109).109 fit to 1500 size according to the dump ... unless offload engine runs in opposite way and joints normal packets to jumbo frames, although I doubt it. – Veniamin Nov 28 '13 at 15:25
  • The server does not have an option to "disable" jumbo frame, but I have it set to 1500 and that looks to disable it. The clients have jumbo frames off. The checksum errors are because of the checksum offloading not being captured by wireshark. I have turned offloading on and off and it makes no difference to the problem. – Porch Nov 29 '13 at 01:20
0

The effects you describe in your later findings are in line with the way IEEE 802.3u operates:

  • If you hard set the speed of one of the interfaces (NIC/Switchport) and set the other to Auto, you will likely suffer a duplex mismatch.

  • If you hard set one of the interfaces to full duplex, the other cannot autonegotiate duplex but must also have it hard set.

  • Even if both interfaces are hard set to Auto/Full duplex, some NICs(or poorly written Windows drivers) still leave the auto negotiation in operative mode and default to half duplex.

This is where I got those facts:

Two documents from Cisco relate (amongst others) to the 2900 series switches and troubleshooting NIC to switchport connectivity issues. They include concrete troubleshooting steps, especially for the switch side but also for the NICs. As Cisco has a lead on practical network analysis including in-depth knowledge of fundamental preconditions (such as the auto-negotiation electrical protocol), it is quite likely that the PowerConnect has similar working conditions (developed against the same protocol standards). I will quote freely for completeness and shape it up a bit later, but I would urge you to skim them through:

Troubleshooting Cisco Catalyst Switches to NIC Compatibility Issues

Configuring and Troubleshooting Ethernet 10/100/1000Mb Half/Full Duplex Auto-Negotiation

Here I quote some of the really interesting stuff:

Autonegotiation Valid Configuration Table

Speed determination issues can result in no connectivity. However, issues 
with autonegotiation of duplex generally do not result in link establishment
issues. Instead, autonegotiation issues mainly result in performance-related
issues. The most common problems with NIC issues deal with speed and duplex
configuration.  

Table 1 summarizes all possible settings of speed and duplex for FastEthernet 
NICs and switch ports.

Then follows an extremely useful table which I'll try to port here later without loosing formatting. The table also includes 1Gbps speed combinations with similar interesting effects and comments. However, highlights include:

* Configuration NIC (Speed/Duplex): 100Mbps, full duplex
* Configuration Switch (Speed/Duplex): auto
* Resulting NIC Speed/Duplex: 100Mbps
* Resulting Catalyst Speed/Duplex: 100Mbps half duplex
Comments: duplex mismatch (footnote 1)

* Configuration NIC (Speed/Duplex): auto
* Configuration Switch (Speed/Duplex): 100Mbps, full duplex
* Resulting NIC Speed/Duplex: 100Mbps full duplex
* Resulting Catalyst Speed/Duplex: 100Mbps half duplex
Comments: duplex mismatch (footnote 1)

* Configuration NIC (Speed/Duplex): 100Mbps, full duplex
* Configuration Switch (Speed/Duplex): 100Mbps, full duplex
* Resulting NIC Speed/Duplex: 100Mbps, full duplex
* Resulting Catalyst Speed/Duplex: 100Mbps, full duplex
Comments: Correct manual config (footnote 2)

The table footnotes are most interesting:

(1) A duplex mismatch can result in performance issues, intermittent
connectivity, and loss of communication. When you troubleshoot NIC issues,
verify that the NIC and switch use a valid configuration.

(2) Some third-party NIC cards can fall back to half-duplex operation mode,
even though both the switchport and NIC configuration are manually configured
for 100 Mbps, full-duplex. This is because NIC autonegotiation link detection
still operates when the NIC is manually configured. This causes duplex
inconsistency between the switchport and the NIC. Symptoms include poor port  
performance and frame check sequence (FCS) errors that increment on the
switchport. In order to troubleshoot this issue, try to manually configure
the switchport to 100 Mbps, half-duplex. If this action resolves the
connectivity problems, this NIC issue is the possible cause. Try to update
to the latest drivers for your NIC, or contact your NIC card vendor for
additional support.

Why Is It That the Speed and Duplex Cannot Be Hardcoded on Only One Link Partner?

As indicated in Table 1, a manual setup of the speed and duplex for
full-duplex on one link partner results in a duplex mismatch. This happens
when you disable autonegotiation on one link partner while the other link
partner defaults to a half-duplex configuration. A duplex mismatch results
in slow performance, intermittent connectivity, data link errors, and other
issues. If the intent is not to use autonegotiation, both link partners must
be manually configured for speed and duplex for full-duplex settings.

The very last topic of the NIC Compatibility link carries a technical background to the effects described in the passages quoted above. The basis for this background are some key details of the operation of the auto negotiation protocol:

(Table of bits shortened down for relevance)
0.13     Rate Selection (least-significant bit [LSB])
             0.6 0.13 1 1 reserved
             1 0 1000 Mbps : 0 1 100 Mbps : 0 0 10 Mbps

0.12     Autonegotiation Enable 
             1 = autonegotiaton enabled
             0 = autonegotiation disabled

0.8  Duplex Mode     1 = full-duplex 0 = half-duplex

0.6  Rate Selection (most-significant bit [MSB]). See bit 0.13

The register bits relevant to this document include 0.13, 0.12, 0.8, and 0.6.
The other register bits are documented in the IEEE 802.3u specification.
Based on IEEE 802.3u, in order to manually set the rate (speed), the
autonegotiation bit, 0.12, must be set to a value of 0. As a result,
autonegotiation must be disabled in order to manually set the speed and
duplex.
If the autonegotiation bit 0.12 is set to a a value of 1, bits 0.13 and 0.8
have no significance, and the link uses autonegotiation to determine the
speed and duplex. When autonegotiation is disabled, the default value for
duplex is half-duplex, unless the 0.8 is programmed to 1, which represents
full-duplex.

Based on IEEE 802.3u, it is not possible to manually configure one link
partner for 100 Mbps, full-duplex and still autonegotiate to full-duplex
with the other link partner. If you attempt to configure one link partner
for 100 Mbps, full-duplex and the other link partner for autonegotiation,
it results in a duplex mismatch. This is because one link partner
autonegotiates and does not see any autonegotiation parameters from the
other link partner and defaults to half-duplex.

In addition I found bug reports to similar effect from Cisco, but they are very specific with regards to combinations of switch hardware/software, os version, nics and drivers. Without knowing exact details it gets too speculative.

I believe this may just be a confirmation of your findings, by way of protocol definition and operandum.


Solutions

So assuming this was not a wild (but fun) goose chase, I quote you:

1) "If I set every interface, server, client and Cisco 2960 switch to 100Mbs full, then the problem goes away. If I set the server and switch interface auto or 1Gbs, the problem is back."

2) "If I bypass the switch with a Netgear 10/100 switch and set both client and server to auto, I have no problems."

3) Try to find NIC/driver combinations compatible with the old switches. Purchase as neccessary.

4) Use solid technical references and reasoning to motivate budget for upgrading switches where neccessary.

ErikE
  • 4,676
  • 1
  • 19
  • 25
  • 1
    Half/Full duplex is fun. I noticed the problem you pointed out if I set the switchport to 100/full and left the client at auto. I got lots of errors and the cisco would flash the switch port led orange/green. If the client was just set to 100/full, the Cisco switch would figure it out and be done. Anyway, setting both the switchport and the client to 100/full, 100/half, or whatever did not solve the main problem. – Porch Nov 29 '13 at 22:33
  • 1
    Keeping the server at 100Mbs is not a solution as it would overload the link. I need the servers at 1Gb as all of our high load clients are on 1Gb. It's the 20% that are on 100Mbs that are having problems. And it's not worth the money to pull fibre to old buildings and replace the switch for 4 workstations. And some of the buildings are across a public street and connected by wireless. We also have some IPPhones that the computers pass though and they only support 100Mbs. – Porch Nov 29 '13 at 22:37
  • I have a cheap Intel desktop 1Gbs nic on order. It should arrive early next week. I will report back then. – Porch Nov 29 '13 at 22:38
  • :-) You have an interesting environment for sure! It sounds like the Netgear has its merits too, at least if it was one of the really cheap ones? – ErikE Nov 29 '13 at 22:52