9

I've been chasing a packet-loss and network stability issue for a handful of end-users on an internal network for the past few days... These issues surfaced last week, however the location was struck by lightning six weeks ago.

I was seeing 5-10% packet loss between a stack of four Cisco 2960's and several PC's and phones on the other side of a 77-meter run. The PC's were run inline with the phones over a trunked link (switchport configuration pastebin). We were seeing dropped calls and interruptions in client-server applications and Microsoft Exchange connectivity.

I tried the usual troubleshooting steps remotely, having a local technician do the following during breaks in user and production activity:

  • change cables between the wall jack and device.
  • change patch cables between the patch panel and switch port(s).
  • try different switch ports within the 2960 stack.
  • change end-user devices with known-good equipment (new phones, different PC's).
  • clear switch port interface counters and monitor incrementing errors closely. (Pastebin output of sh int)
  • Pored over the device logs and Observium RRD graphs. No link up/down issues from the switch side.
  • change power strips on the end-user side.
  • test cable runs from the Cisco 2960 using test cable-diagnostics tdr int Gi4/0/9 (clean)*
  • test cable runs with a Tripp-Lite cable tester. (clean)
  • run diagnostics on the switch stack members. (clean)

In the end, it took three changes of switch ports to find a stable solution. The only logical conclusion is that a few Cisco 2960 switch ports are bad or flaky... Not dead, but not consistent in behavior either. I'm not used to seeing individual ports die in this manner.

What else can I test or check to determine if these devices are bad?

What is the best-practices approach to verifying this?

Is it common for single ports to have problems, rather than a contiguous bank of ports?


BTW - show cable-diagnostics tdr int Gi4/0/14 is very cool...

Interface Speed Local pair Pair length        Remote pair Pair status
--------- ----- ---------- ------------------ ----------- --------------------
Gi4/0/14  1000M Pair A     79   +/- 0  meters Pair B      Normal              
                Pair B     75   +/- 0  meters Pair A      Normal              
                Pair C     77   +/- 0  meters Pair D      Normal              
                Pair D     79   +/- 0  meters Pair C      Normal              
ewwhite
  • 194,921
  • 91
  • 434
  • 799

2 Answers2

7

While banks of ports often share an ASIC, each has to have its own separate PHY. If the PHY has been damaged it could very have a problem while its neighbors don't.

That said, output drops are an odd symptom for a physical problem - not impossible, but not typical. Notwithstanding half duplex links, output drops usually have more to do with buffer exhaustion than physical problems.

You may get more information by setting up a packet capture on the other side of the wire. A bad PHY would be expected to manifest with some number of physical layer errors (bad CRC, runt/giant, etc) on one or both sides of the link.

All in all it sounds like you've eliminated enough that it may be past the point of diminishing returns. I'd recommend an RMA if you have a contract.

rnxrx
  • 8,103
  • 3
  • 20
  • 30
  • Since this is occurring on multiple ports on multiple (2) switches, but only for a tiny subset of users, is this a case where I'd need to replace all four switches? I just have a hard time lobbying for the replacement without knowing the core issue, since replacement will require considerable downtime, recabling, etc.. – ewwhite Sep 10 '12 at 05:43
  • Lightning is a very strange animal and damage from it can manifest much later and in unpredictable ways. The downtime sucks, of course, but could be ameliorated somewhat by looping the replacement switch in, moving the patches and then pulling the old ones out. I wish there were an easier answer, but if you have isolated the issue to a few ports then there isn't much else to be done. – rnxrx Sep 10 '12 at 12:53
  • The PHY is almost always integrated into the ASIC these days. It's plain cheaper. The magnetics are about the only part they really can't integrate into the ASIC, which could be damaged, but that's not the PHY. Also, it's pretty common to use Quad set magnetics, so if the problem is on 4 ports, lends to this theory. – Chris S Sep 13 '12 at 19:15
  • Not really - if you go through the architecture of most of the Cisco switches (including the one in question) the same ASIC's are often used for one or two fiber or copper GE's or some grouping of 100TX. A lot more of the functionality is moved onto the ASIC in switch-on-chip architectures but in those cases there is still physical layer being handled by a pluggable optic or some sort of copper media. Given that the same ASIC complex can often handle a number of different speed and power requirements it doesn't make a lot of sense to integrate this function into the same spin? – rnxrx Sep 13 '12 at 21:49
  • Finally replaced all of the switches after too many ports degraded to the point of being unusable. Finally, a good use for SmartNet! – ewwhite May 10 '13 at 12:12
2

Yes, a single port can be bad, but as I recall, you have to replace the entire module. (Caveat: it's been a long time since I've done significant Cisco work...)

I'm not sure if it can help, but check out FITB, by Laurie Denness, one of the Ops engineers at Etsy.

gWaldo
  • 11,887
  • 8
  • 41
  • 68