52

I need a fresh pair of eyes.

We're using a 15km fibre optic line across which fibrechannel and 10GbE is multiplexed (passive optical CWDM). For FC we have long distance lasers suitable up to 40km (Skylane SFCxx0404F0D). The multiplexer is limited by the SFPs which can do max. 4Gb fibrechannel. The FC switch is a Brocade 5000 series. The respective wavelengths are 1550,1570,1590 and 1610nm for FC and 1530nm for 10GbE.

The problem is the 4GbFC fabrics are almost never clean. Sometimes they are for a while even with a lot of traffic on them. Then they may suddenly start producing errors (RX CRC, RX encoding, RX disparity, ...) even with only marginal traffic on them. I am attaching some error and traffic graphs. Errors are currently in the order of 50-100 errors per 5 minutes when with 1Gb/s traffic.


Optics

Here is the power output of one port summarized (collected using sfpshow on different switches)

SITE-A      units=uW (microwatt)    SITE-B
**********************************************
FAB1
SW1   TX 1234.3       RX   49.1       SW3   1550nm (ko)
      RX   95.2       TX 1175.6
FAB2
SW2   TX 1422.0       RX  104.6       SW4   1610nm (ok)
      RX   54.3       TX 1468.4      

What I find curious at this point is the asymmetry in the power levels. While SW2 transmits with 1422uW which SW4 receives with 104uW, SW2 only receives the SW4 signal with similar original power only with 54uW.

Vice versa for SW1-3.

Anyway the SFPs have RX sensitivity down to -18dBm (ca. 20uW) so in any case it should be fine... But nothing is.

Some SFPs have been diagnosed as malfunctioning by the manufacturer (the 1550nm ones shown above with "ko"). The 1610nm ones apparently are ok, they have been tested using a traffic generator. The leased line has also been tested more than once. All is within tolerances. I'm awaiting the replacements but for some reason I don't believe it will make things better as the apparently good ones don't produce ZERO errors either.

Earlier there was active equipment involved (some kind of 4GFC retimer) before putting the signal on the line. No idea why. That equipment was eliminated because of the problems so we now only have:

  • the long distance laser in the switch,
  • (new) 10m LC-SC monomode cable to the mux (for each fabric),
  • the leased line,
  • the same thing but reversed on the other side of the link.


FC switches

Here is a port config from the Brocade portcfgshow (it's like that on both sides, obviously)

Area Number:              0
Speed Level:              4G
Fill Word(On Active)      0(Idle-Idle)
Fill Word(Current)        0(Idle-Idle)
AL_PA Offset 13:          OFF
Trunk Port                ON
Long Distance             LS
VC Link Init              OFF
Desired Distance          32 Km
Reserved Buffers          70
Locked L_Port             OFF
Locked G_Port             OFF
Disabled E_Port           OFF
Locked E_Port             OFF
ISL R_RDY Mode            OFF
RSCN Suppressed           OFF
Persistent Disable        OFF
LOS TOV enable            OFF
NPIV capability           ON
QOS E_Port                OFF
Port Auto Disable:        OFF
Rate Limit                OFF
EX Port                   OFF
Mirror Port               OFF
Credit Recovery           ON
F_Port Buffers            OFF
Fault Delay:              0(R_A_TOV)
NPIV PP Limit:            126
CSCTL mode:               OFF

Forcing the links to 2GbFC produces no errors, but we bought 4GbFC and we want 4GbFC.

error and traffic graphs

I don't know where to look anymore. Any ideas what to try next or how to proceed?

If we can't make 4GbFC work reliably I wonder what the people working with 8 or 16 do... I don't assume that "a few errors here and there" are acceptable.

Oh and BTW we are in contact with everyone of the manufacturers (FC switch, MUX, SFPs, ...) Except for the SFPs to be changed (some have been changed before) nobody has a clue. Brocade SAN Health says the fabric is ok. MUX, well, it's passive, it's only a prism, nature at it's best.

Any shots in the dark?


APPENDIX: Answers to your questions

@Chopper3: This is the second generation of Brocades exhibiting the problem. Before we had 5000s, now we have 5100s. In the beginning when we still had the active MUX we rented a longdistance laser once to put it into the switch directly in order to make tests for a day, during that day of course it was clean. But as I said, sometimes it's clean just like that. And sometimes it's not. Alternative switches would mean to rebuild the entire SAN with those only to test. Alternative SFPs, well they're hard to come by just like that.

@longneck: The line is rented. It's a dark fibre (9um monomode) so there's noone else on it. Sure there are splices. I can't go and look but I have to trust they have been done correctly. As I said the line has been checked and rechecked (using an optical time-domain reflectometer). Obviously you don't have all this equipment yourself because it's way too expensive.

@mdpc: What would be the "wrong" type of cable according to you? Up to the switch everything is monomode, yes. The connectors are the correct ones too. Yeah I know there are the green ones where the fibre is cut off at a certain angle etc. But we have the correct ones for all that I know.


Progress Report #1

We have had two fabrics (=2x2 switches) with Brocade 5100s with FabricOS 6.4.1 and two fabrics (another 2x4 switches) on FabricOS 7.0.2.

On the longdistance ISLs (one in each fabric) it turned out that with FOS 6.4.1 setting it to long distance issues warnings about the VC Init setting and consequently the fill word. But those are only warnings. FOS 7.0.2 requires you to do modifications to VCI and the fillword for long distance links.

Setting FOS 6.4.1 to the LS (long-distance static distance) setting with wrong VCI and fillword setting made the whole fabric inoperational (stuck in an SCN loop, use fabriclog -s to see, you don't see it anywhere else, no port error counters or anything increasing).

Currently I'm giving the one fabric with the IMHO more correct settings a beating and it seems to do fine, whereas the other one without much traffic still has errors here and there.

progress1

In short:

  • We have eliminated the active part of the MUX (the FC retimer).
  • We are putting the long distance SFPs into the end equipment themselves.
  • Just to be sure we bought new monomode cables to connect the end equipment to the remaining passive part of the MUX.
  • We are now trying out several long distance configs.

It's almost black magic. Everything that happens is mostly empirical, noone seems to have a clue what are the exact reasons to do something. ("We have tried this, and it didn't work, then we tried that and it worked, so we stuck with that." But noone really seems to know why.)

I'll keep you updated.


Progress Report #2

We got the new lasers for one of the fabrics on warranty. It's ultra clean even on 4GbFC.

They're transmitting with roughly 2mW (3dBm) whereas the others are only at 1.5mW (1.5dBm) although that should really be enough.

The other fabric (where the lasers are apparently ok) still produces one or two CRCs infrequently.

Using sfpshow the SFP producing the actual RX errors shows

Status/Ctrl: 0x82
Alarm flags[0,1] = 0x5, 0x40
Warn Flags[0,1] = 0x5, 0x40

Now I'll have to find out what that means. Not sure if it was there before.

Well I'll first clear my head with a week of vacation. 8-)

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
Marki
  • 2,795
  • 3
  • 27
  • 45
  • 8
    First of all, great question, exactly what this site is for, well done. Secondly have you access to alternative switches/SFP's - ideally another make/model that you could swap in to test? – Chopper3 Aug 26 '13 at 23:10
  • Is your line leased, or is it yours? Are there any splices mid-span? Have they been re-cleaved and resealed against moisture intrusion? – longneck Aug 27 '13 at 00:04
  • Are you sure you have the right type of cable? – mdpc Aug 27 '13 at 03:03
  • I'm replying with edits to the original question. (see above) – Marki Aug 27 '13 at 09:38
  • 4
    Great update, keep up the good work, wish I had some suggestions or advice but you're on the right track, nice to find a new user on SF who knows their stuff :) – Chopper3 Aug 29 '13 at 15:38
  • 1
    Are there any consistencies in the time or duration of the errors? Do they always occur at N hour? Do they always last X minutes? Can you correlate them with weather, nearby sporting events, or other phenomenon? Intermittent issues are the hardest bugs to squash, and I usually start attacking them by graphing the times and durations that they occur on a whiteboard. Hopefully patters emerge which could be correlated with [other phenomenon](http://thedailywtf.com/Comments/CSI-Server-Room.aspx). – dotancohen Aug 29 '13 at 15:39
  • I believe we have already thought about almost anything, even and odd number of the day, new moon, full moon, full moon hidden by clouds, ...... ;-) Could be as simple as the cleaning person opening and closing the door making the connector buckle. I don't think they clean that often, but I also don't know what happens at these exact moments... – Marki Aug 29 '13 at 15:43
  • 2
    Are you tracking them on a whiteboard, visible to _everybody_? I won't press, but I highly recommend it. Like you said, you need a fresh pair of eyes and maybe someone in your organization will see the pattern emerge from the times/durations, and not necessarily from the symptoms. – dotancohen Aug 29 '13 at 15:46
  • 1
    Hi Marki. I'm not entirely familiar with what you're talking about, but by your last update it seems as though the problem was fixed by the replacement SFP's? If so, probably a good idea to post this as an answer and ask a new question if you have further issues. – Mark Henderson Aug 31 '13 at 21:58
  • It may be but I wouldn't understand yet completely why since must of the stuff seems to thebin the specs. Except for the alerts i now saw. Now I'm on vacation but I'll keep this updated when I have conclusions, don't worry. – Marki Sep 02 '13 at 16:33
  • 1
    I am curious about your CWDM units. CWDM is an evil beast; if you have 40KM lasers on a 15KM link, and there's a reflection, you may find that the SFPs are seeing light from themselves. Is it possible to add attenuators to try and bring the light levels down somewhat? – Tim Woolford Sep 15 '13 at 20:40

1 Answers1

4

Ok, I guess I need to post an answer. In one word it is: insist.

The problem is not resolved 100% to my liking, as we still have one fabric with 1 (one) CRC error sporadically. The other one is clean. But I can live with that.

In any case we won't continue to use the CWDM units for a very long time, but rather switch to a passive DWDM multiplexer next year as our infrastructure will change a lot. Apparently DWDM lasers are less expensive than the CWDM ones too. Oh we'll see and maybe I'll have lots of problems to ask you then :-)


Update Nope to the above, we bought CWDM again, and it's really less expensive. AFAICS for certain applications however, you have to go DWDM because there are no CWDM lasers for it. Finally we tried to get as close to the manufacturer as we could and the whole thing came at about 1/5 of the price compared to buying from an distributor or even an integrator.


So I can conclude, if you bought a solution that doesn't work as expected: insist. On the technical side we did two things

  • remove the active part of the MUX (can't say I regret that, but also not sure if that was finally another source of error or not)
  • have the SFPs throughly checked

(And of course all the standard diagnostics, change one thing at a time, see what happens etc, don't need to tell you that. So we checked each line and cable etc. too, unfortunately at our expense.)

In this case it took a long time of insisting but finally we got to the level where the manufacturer himself spared a few people and some equipment to perform the checks that helped. And of course we had the integrator pay that, since our hardware is under maintenance. So this was as much a commercial challenge as it was a technical one.

PS. Oh and, the flags that I mentioned in my last update didn't indicate anything bad, but I don't remember what they exactly meant. When I find the statement I'll update the answer for completeness' sake.


In the end, the flags meant something bad after all. Apparently it is however not certain which side of the link is the cause for the errors. So that pair has to be changed too.

Oh and BTW, 8GbFC DWDM transceivers are only cheaper compared to 8G CWDM ;-) The cheapest way to go is 4GbFC on CWDM and then use ISL trunking (if you have the license)

Marki
  • 2,795
  • 3
  • 27
  • 45
  • I didn't see this when it was asked, unfortunately. I can't tell you for sure that this would help, but if you're using idle-idle fillwords, you're sending a lot of light. This means that each unused frame is pulling a lot of power and generating a lot of heat on the SFP, I think. Changing the fillword to some other mode (I use mode 3, but I have a different switch and SFP) might allow you to push more throughput with fewer errors. – Basil Aug 03 '14 at 15:01
  • @Basil I knew using the correct fillword was a problem for word synchronization at 8GFC but I've thought about it this way... – Marki Aug 11 '14 at 15:50
  • It's recommended any time you can use it- as far as I can tell, it's a question of how much interference an idle frame causes its SFP to create. – Basil Aug 11 '14 at 19:02