6

We bought some Dell PowerEdge R730 servers with QLogic/Broadcom BCM57810 PCI Express cards, and connect them to Cisco 4900M switches - the 10Gb links don't work reliably. They will sometimes not connect, sometimes connect after a few minutes, and when they do connect they drop several times a day. The disconnects can last 4 minutes or 2 hours.

The Cisco switches have existing 10Gb copper links to Dell PowerVault SANs, which have been stable and working for many months.

I see the disconnects in the VMware logs as messages like:

bnx2x 0000:82:00.1: vmnic5: NIC Link is Down

and

 network connectivity on virtual switch "vSwitch2". Physical NIC vmnic5 is down.

I can't see any helpful error codes or prior messages, only messages caused by the link drops. On Windows it shows as a disconnected card, and on the switch it shows as a disconnected switch port.

When the links connect, they work - jumbo frame pings ping, iSCSI sessions establish, datastores appear with all paths found. But the connections are intermittent.

We've checked:

  • The cables:
    • originally Cat5e single cable, now Cat6 structured cabling. The cable length overall is <7m.
    • Connected with a new cable, host to switch with no patches/joints and no other cables near by.
  • The drivers/OS:
    • Originally VMware ESXi 5.5 U2 Dell build ("ESXi 5.5.0, 2068190") with the bnx2x driver version 2.710.39.v55.2
    • Then the updated driver from vmware.com, bnx2x version 2.710.70.v50.7
    • Then ESXi 6.0, Dell build ("ESXi 6.0.0 2494585") which has bnx2x version 2.712...
    • Then Windows Server 2012 R2 with the latest driver from Dell's site.
  • The QLogic/Broadcom network card firmware; it's the latest from Dell, FFv7.12.17.
  • The switch port configuration, it is simply mtu 9000 and switchport access vlan NNN
  • The switch ports
    • These are 8-port 10Gb RJ45 modules ( WS-X4908-10G-RJ45 ), one per switch. The SANs take up the first four ports in each module, the new servers take up the remaining four ports in each module. This appears to affect all the ports we're using for the new servers. So it's not one failing port, or one failing module.
    • I haven't tried disrupting the SAN connections to test those ports, without some specific reason to think ports 1-4 are more reliable than 5-8 that would be a last resort.
  • The switch interface counters, no errors apart from disconnects.
  • Disabling various of the offload capabilities in the Windows QLogic/Broadcom driver, and enabling EnergyEfficientEthernet, forcing the cards to 10Gb instead of autodetect.
  • Connecting the same hosts to the same switches into 1Gb ports, which appears to work fine, they repeatedly connect very quickly.
  • Cross-connecting two hosts, they connect quickly at 10Gb and hold a stable connection for days.
  • We bought an Intel X540-t2 card, and tried that. It behaves the same.
  • Since then, we've bought Cat 6a patch cables and tested those, no change.

We raised a call with Dell support, they've found nothing wrong and suggest the switches are at fault, but as the switches run 10Gb copper connections to Dell PowerVault Storage, and as far as I can tell from our switch monitoring logs and the SAN event logs, those links don't drop, I'm unwilling to think Cisco switches are the problem.

They are running IOS 15.1(1)SG2 which is not the latest, but the switches are live and stable, I don't want to casually change the firmware "just in case".

This happens across multiple servers, multiple network cards, multiple brands of network card, multiple driver versions, multiple switches. It can't be a single faulty piece of hardware. It's all in an air-conditioned, power-conditioned rack.

This is the first time we've tried VMware host to switch connections at 10Gb, so we have no other configuration we can compare with or hardware we can connect to.

What else can we check?

-- Edit: We were looking to upgrade the switch firmware, but I've just found a related link - this appears to be a known issue between the Cisco WS-X4908-10G-RJ45 module and the Broadcom BCM57810 cards, IOS version dependent - https://supportforums.cisco.com/discussion/11755141/4900m-ws-x4908-10g-rj45-port-startup-delay which has a lot of relevant discussion, and leads to:

https://tools.cisco.com/bugsearch/bug/CSCug68370

WS-X4908-10G-RJ45 and Broadcom 57810S 10Gb BASE-T interoperability issue

CSCug68370

Description

Symptom: 10Gbps BaseT ports (on WS-X4908-10G-RJ45) connected to Dell 820 servers with Broadcom 57810S DP 10Gb BASE-T. On a reload of the switch or removal / re-install of the cable ports are coming up after a long time (up to 1 hour) or not coming up at all. Conditions: 1) Module WS-X4908-10G-RJ45 2) Versions 15.0(2)SG through 15.0(2)SG7, 15.1(2)SG through 15.1(2)SG3 Workaround: Downgrade to 12.2(54)SG

That's not exactly the same server model, and it doesn't mention Intel cards, but the problem is a pretty spot-on match.

TitanBar
  • 81
  • 1
  • 5
  • 2
    Can't help, sorry; but +1 for a well written and researched question. – Massimo Aug 03 '15 at 18:41
  • The only thing I don't see in your troubleshooting is moving the server connections to different switch ports (up to and including the ones which you know to work correctly, i.e. those used by the storage array). But I assume you tried this, too... – Massimo Aug 03 '15 at 18:41
  • Thanks. I'm a bit lost - this is about to turn into a vendors chasing diagnostic messages for months and pointing fingers at each other, and I'm really hoping to avoid that. Good point; I've updated with a bit about the switch modules; I haven't disrupted the existing SAN connections but with eight ports over two modules affected, it feels unlikely to be port hardware. – TitanBar Aug 03 '15 at 18:47
  • I only care about the build number of your ESXi install. Can you provide it? – ewwhite Aug 03 '15 at 18:48
  • @ewwhite the Dell builds are ESXi 5.5.0, 2068190 and ESXi 6.0.0 2494585 – TitanBar Aug 03 '15 at 18:50
  • Do you have the proper/supported GBIC on both side of the cables? – Alex Aug 03 '15 at 18:53
  • @Alex the switch modules and cards are RJ45 sockets connected with RJ45 patch cables, there are no miniGBIC convertors or SFP+ converters involved. ( The modules are WS-X4908-10G-RJ45 ) – TitanBar Aug 03 '15 at 18:56
  • Do you have SmartNet on the switches? If so the TAC should be able to help you dump troubleshooting info that may reveal what's really going on. – Todd Wilcox Aug 03 '15 at 19:13
  • @TitanBar Please post screenshots of your vSphere networking config. [Something like this](http://serverfault.com/questions/584303/network-configuration-for-vmware-w-vmotion-and-a-single-switch/584308#584308). Also show me the port config from the Cisco 4900M for the affected ports. – ewwhite Aug 05 '15 at 07:29
  • @ewwhite I held off for a bit while we had higher spec cables on the way, but with the new information in my edit it might no longer be relevant for me to post networking config. If it is, I will in a bit. – TitanBar Aug 06 '15 at 15:07
  • 1
    What a terrible bug... Good information, though. – ewwhite Aug 06 '15 at 15:10

2 Answers2

3

Please update your ESXi hosts. This is the one thing you've really missed in the troubleshooting steps.

Your 5.5 installation is almost 1 year old!!

As of this writing, the current version of ESXi 5.5 is 2718055. The current ESXi 6.0 build number is 2809209.

Dell, HP, doesn't matter... you're still supposed to update your ESXi installations. Many people overlook this, and it's the second most-often cause of unintended downtime in the environments I see.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • It's a good point; I will run them through VMware Update Manager tonight; do you have any particular suggestion why that might make a difference (presumably ESXi 6 is already different from 5.5, and Windows Server 2012 R2 different again) or is it just something else to update and try? – TitanBar Aug 03 '15 at 19:04
  • Because your SAN was unaffected, you're using RTM builds of ESXi, you already address host firmware and the switch is known-good. – ewwhite Aug 03 '15 at 19:08
  • I have upgraded three ESXi 5.5 hosts to version 2718055, and sadly it has not fixed it; there are still link drops (e.g. one lasting 2.5 minutes), and disconnecting links manually still has long and uneven link reconnect times, sometimes into the tens of minutes. – TitanBar Aug 05 '15 at 04:45
  • It's a good first step. – ewwhite Aug 05 '15 at 07:26
2

Well, looks like it's Cisco bug https://tools.cisco.com/bugsearch/bug/CSCug68370 and upgrading to one of the "known fixed" IOS versions (15.1(2) SG4) seems to have fixed it.

TitanBar
  • 81
  • 1
  • 5