4

So here's the scenario, and I'm hoping @Chopper3 can chime in here. For our SAN fabric, we have a pair of Cisco MDS 9513 FC switches with three EMC frames and four Cisco UCS domains directly attached.

The behavior we are seeing is that the CNA's on the blades are sending FC aborts as a result of the fabric interconnect transmitting FCoE pause frames. Cisco TAC explains this behavior is a result of upstream congestion or latency. We do see a corresponding spike in our data from the 200 or so ESXi servers in the environment reporting latency spikes from 100ms to 2000ms. Some frames and paths seem hit a little harder than the others, which leads me to believe that we're hot-spotting one or more of the links.

The blades are B200M2, B200M3, and B420M3 servers, using. The M2 series use the "Palo" adapter the M81KR, and the M3 series use the VIC1240 adapter.

Since I'm not too depth FC knowledgeable, I'd appreciate some suggestions on how to hunt this down.

masegaloeh
  • 17,978
  • 9
  • 56
  • 104
SpacemanSpiff
  • 8,733
  • 1
  • 23
  • 35
  • You didn't happen to purchase this all at once as a vBlock have you? I've found that vBlock support is a bit better than TAC with these things. – MDMarra Jan 14 '14 at 17:51
  • No, despite being most of the vBlock components. Also, the SANs being used are a VMAX-E, VNX7500, and CX4. – SpacemanSpiff Jan 14 '14 at 17:54

1 Answers1

0

So here's the story on this:

I was looking at it from the wrong perspective. Adapter aborts a normal symptom indicating some component somewhere is not keeping up. In this case, adapter aborts were a symptom of SAN front end ports being too busy to service the requests. This was compounded by a few different conditions.

1) Bad Drivers - Our UCS firmware level dictates a matching driver in ESXi that has known issues recovering from aborts, sending it into a loop that can only be cleared by a reboot.

2) Too Many Variables - Three SANs, with three distinct issues all get represented by adapter aborts.

3) SAN Bugs - We had to disable VAAI due to bugs in our EMC VNX code causing issues.

2015 EDIT:

I wanted to update this thread, because a lot of new information has come to light as well, and detecting is well, hard. I hope this post will steer some folks in the right directions.

1) All of the above is actually still relevant, get all of that squared and inside a support matrix as soon as possible.

2) Some UCS 2.1 versions accidentially turn off (despite NXOS still being configured to do it) priority flow control, which causes some FCoE traffic to be treated like the rest and therefore you sometimes get out of order FC frames.

3) Somewhere in the middle of UCS 2.1 code, an IO Throttling setting went from being a cosmetic field to an active field. The old "burned in" firmware setting was an IO Throttle count of 256 which all hosts pretty much used, though the Windows driver did allow you to tune this. Somewhere in the middle of this code, the original default value of "16" which used to install "256" into hardware became an invalid setting, and the UCSM code began interpreting this as "2048" which is the maximum. The result being, a single UCS VIC adapter being configured to absolutely MURDER our storage arrays.

So, read your release notes. Lessons learned, we've finally got this fixed.

IO Throttle Bug: https://tools.cisco.com/quickview/bug/CSCum10869

PFC Bug: https://tools.cisco.com/quickview/bug/CSCus61659

SpacemanSpiff
  • 8,733
  • 1
  • 23
  • 35