We're lucky, every server we have has multiple NICs/HBAs/CNAs connected to multiple switches and this approach has kept our platform up on numerous occasions. That said we ran into a problem last week that I'm not sure how to fix.
We had a switch that was carrying a good chunk of our traffic crash (the details aren't important but it was a Cisco 6509, it had a hard CPU crash and didn't come back up automatically). Unfortunately it left its line cards working (i.e. L1 & L2 up) but lost all of its uplinks. The servers connected were the following;
- Windows Server 2003 32-bit EE SP2 with Veritas Storage Foundation
- Oracle Enterprise Linux 5.3 64-bit
- VMWare ESXi 4.0
- NetApp 3040 running OnTap 7.3.2
All of these machines failed to detect the crashed switch and kept sending traffic its way rather than detecting the failure and moving their traffic to the another switch.
I need help looking at my options for better multipathing, this can't be the first time this has happened - there must be other ways of doing this (polling the HSRP interfaces for instance) - can you help?
Thanks in advance.