3

Trying to determine if there is some intermittent inter DC latency on the FC links, but only have access to the OID counters for the DCX 8510. Since it is a L1 link over DWDM there are no stats from the service provider to measure any possible issues other than connecting test kit which always comes up clean since the issue is intermittent.

Seeing values spike for this OID when the issue occurs, but trying to find proper information on it is really tough.

swfcportrxbados

Any help on a better explanation on this OID and pointers to some information to better understand the SNMP outputs would be greatly appreciated

Citizen
  • 1,103
  • 1
  • 10
  • 19
bern
  • 33
  • 3
  • Have you done an errdump? If so, do you see any timeout errors? 2012/12/24-06:34:16, [C2-1012], 592988, SLOT 6 | CHASSIS, WARNING, Brocade_DCX, S3,P-1(51): Link Timeout on internal port ftx=1722774272 tov=2000 (>1000) vc_no=31 crd(s)lost=3 complete_loss:1. 2012/12/24-06:40:12, [C2-1012], 592989, SLOT 6 | CHASSIS, WARNING, Brocade_DCX, S3,P-1(51): Link Timeout on internal port ftx=1723089858 tov=2000 (>1000) vc_no=31 crd(s)lost=3 complete_loss:1. – Citizen Jan 24 '15 at 11:49
  • I don't have CLI access - only SNMP polling. The guys managing the devices don't seem to have the best knowledge on the SNMP info, so trying to best understand how to determine where the problem is coming in so we can set alarms for this going forward. – bern Jan 26 '15 at 10:14
  • That's par for the course. |||||||||||||The guys managing the devices don't really know snmp|||| so...there is a lack of knowledge among the infrastructure management team regarding the management protocol||| It's a dystopian world we live in my friend. – Citizen Jan 30 '15 at 19:38

1 Answers1

3

Background

swFCPortRxBadOs tracks the number of invalid ordered sets, most of the time it is an error against a physical or virtual interface, it can also apply to a backplane.

Invalid ordered sets for DWDM or straight FC, whether it's Cisco or Broccade, will often be the result of a poorly performing host or node. A RAID array with it's disk cue length above 6 or so on the other side of the DWDM could result in a virtual channel timeout. This will typically mean that you have virtual channels getting 'stuck'. When a switch port exhausts all available credits, the switch port connected to the device needs to hold additional outbound frames until a credit is returned by the device to enter the buffer. When a device isn’t responding within a timeout a transmitting switch will hold frames longer, resulting in high buffer occupancy. This results in the switch lowering the rate that it returns buffer credits to the other transmitting switches. This then propagates through switches (potentially multiple switches with devices attempting to send frames to hosts or switches attached to the switch with the high-latency host or switch) and affects fabric performance.

So.....Next Steps

Possible Culprits

  1. Physical Layer Badness - An SFP that is or is going bad that is on the other side or on the switch your looking at.

  2. Virtual Channel 'stuck' - the explanation above. If the virtual channel is stuck then it's not passing traffic or signals and you'll see the er_bad_os counters increasing.

Brocade recommends enable bottleneckmon in the FOS. It will reset the VC (virtual channel) when there is a two second window without any traffic.

bottleneckmon –cfgcredittools -intport -recover onLrOnly

When one or more credits are lost, it will begin to look for it's window to reset the VC.

This a great PDF on Fabric Resiliency Best Practices http://www.brocade.com/downloads/documents/html_product_manuals/NOS_MIB_301/wwhelp/wwhimpl/common/html/wwhelp.htm#context=NOS_MIB_v301_HTML&file=5_sw-mib.06.4.html

use portstatushow for your port and see if you get an er_bad_os 591691 Invalid ordered set

It might give you an assurance that what your experiencing is an invalid ordered set so you can begin troubleshooting your credits and buffers which is where these types of issues frequently lay.

Great article on buffer credits. http://community.brocade.com/t5/Mainframe-Solutions/Buffer-Credits-and-Frame-Size-calculation-in-FOS-7-1/ba-p/455

Citizen
  • 1,103
  • 1
  • 10
  • 19
  • Not at all, this helps me immensely - I wasn't sure if this linked to L1 only or if it could also be from something else which you have confirmed. Will get the client techies to enable bottleneckmon and to look at the virtual channel side. – bern Jan 26 '15 at 10:16
  • 1
    The issue manifests as a "flap" on the AIS servers. – bern Jan 26 '15 at 10:20
  • Glad this was helpful, first time I was seeing this it made my eyes hurt. Took a bit to nail it down. Once you fix it you never forget. TY for the points – Citizen Jan 28 '15 at 17:52