1

There are 3 LUNs on a FC-SAN I want to access using 2 HBAs (with two paths each. When the system is booted, everything seems fine, but after a while the sd*-devices from the second HBA disappeared and I have no Idea why or how to get them back without rebooting. Scanning SCSI-bus still finds all devices, but kernel does not get aware of block-devices. It's Red Hat 6.6 with latest updates.

The same LUNs are available on 4 paths on another system but only on 2 on this one.

Does anyone have a clue what I could be missing?

# lspci|grep Fibre
08:00.0 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)
08:00.1 Fibre Channel: QLogic Corp. ISP2532-based 8Gb Fibre Channel to PCI Express HBA (rev 02)


# lsscsi
...
[1:0:0:1]    disk    DataCore Virtual Disk     DCS   /dev/sdb
[1:0:0:2]    disk    DataCore Virtual Disk     DCS   /dev/sdc
[1:0:0:3]    disk    DataCore Virtual Disk     DCS   /dev/sdd
[1:0:1:1]    disk    DataCore Virtual Disk     DCS   /dev/sde
[1:0:1:2]    disk    DataCore Virtual Disk     DCS   /dev/sdf
[1:0:1:3]    disk    DataCore Virtual Disk     DCS   /dev/sdg
[2:0:0:1]    disk    DataCore Virtual Disk     DCS   -
[2:0:0:2]    disk    DataCore Virtual Disk     DCS   -
[2:0:0:3]    disk    DataCore Virtual Disk     DCS   -
[2:0:1:1]    disk    DataCore Virtual Disk     DCS   -
[2:0:1:2]    disk    DataCore Virtual Disk     DCS   -
[2:0:1:3]    disk    DataCore Virtual Disk     DCS   -
...
# rescan-scsi-bus.sh
...
0 new or changed device(s) found.
0 remapped or resized device(s) found.
0 device(s) removed.

This was logged when it happened:

May 24 12:08:57 hostname  kernel: sd 1:0:0:1: Parameters changed
May 24 12:08:57 hostname  kernel: sd 1:0:1:3: Parameters changed
May 24 12:09:01 hostname  kernel: sd 1:0:1:2: Parameters changed
May 24 12:09:24 hostname  kernel: sd 1:0:1:1: Parameters changed
May 24 12:09:24 hostname  kernel: sd 2:0:0:1: rejecting I/O to offline device
May 24 12:09:25 hostname  multipathd: checker failed path 8:112 in map lun0
May 24 12:09:25 hostname  multipathd: ora_data2: remaining active paths: 3
May 24 12:09:25 hostname  multipathd: checker failed path 8:128 in map lun1
May 24 12:09:25 hostname  multipathd: ora_acfs1: remaining active paths: 3
May 24 12:09:25 hostname  multipathd: checker failed path 8:144 in map lun2
May 24 12:09:25 hostname  multipathd: ora_acfs2: remaining active paths: 3
May 24 12:09:25 hostname  multipathd: checker failed path 8:160 in map lun0
May 24 12:09:25 hostname  multipathd: ora_data2: remaining active paths: 2
May 24 12:09:25 hostname  multipathd: checker failed path 8:176 in map lun1
May 24 12:09:25 hostname  multipathd: ora_acfs1: remaining active paths: 2
May 24 12:09:25 hostname  multipathd: checker failed path 8:192 in map lun2
May 24 12:09:25 hostname  multipathd: ora_acfs2: remaining active paths: 2
May 24 12:09:25 hostname  kernel: device-mapper: multipath: Failing path 8:112.
May 24 12:09:25 hostname  kernel: device-mapper: multipath: Failing path 8:128.
May 24 12:09:25 hostname  kernel: device-mapper: multipath: Failing path 8:144.
May 24 12:09:25 hostname  kernel: device-mapper: multipath: Failing path 8:160.
May 24 12:09:25 hostname  kernel: device-mapper: multipath: Failing path 8:176.
May 24 12:09:25 hostname  kernel: device-mapper: multipath: Failing path 8:192.

Unfortunately, I have no access to the SAN-device but I'm being told nothing was touched.

I've just seen that the devices whre in fact gone but came back 2 hours later:

May 24 14:06:35 hostname kernel: scsi 2:0:1:1: Attached scsi generic sg9 type 0
May 24 14:06:35 hostname kernel: scsi 2:0:1:2: Attached scsi generic sg10 type 0
May 24 14:06:35 hostname kernel: scsi 2:0:1:3: Attached scsi generic sg11 type 0
May 24 14:06:37 hostname kernel: scsi 2:0:0:1: Attached scsi generic sg12 type 0
May 24 14:06:37 hostname kernel: scsi 2:0:0:2: Attached scsi generic sg13 type 0
May 24 14:06:37 hostname kernel: scsi 2:0:0:3: Attached scsi generic sg14 type 0

It is possible that the FC-switch in between was switched off in that time. When the system booted previously and the sd-devices were created as usual, the line slightly differs:

May 24 11:14:15 hostname kernel: sd 2:0:1:3: Attached scsi generic sg14 type 0

vs.

May 24 14:06:35 hostname kernel: scsi 2:0:1:1: Attached scsi generic sg9 type 0

It says "scsi" instead of "sd".

Christian
  • 331
  • 1
  • 2
  • 10
  • Looks like a failure on one of your QLogic HBAs or the path connected to it. It could be anything from a bad SFP on the HBA or switch, a bad switch port, a bad cable, or someone changing the LUN mappings. – Andrew Henle May 25 '16 at 10:39
  • That's what I was suspecting, but shouldn't the sg-devices disappear in this case too? I can even query the devices with smartctl and they report correctly. – Christian May 25 '16 at 11:02
  • Bit more detail please - OS exact version please, also are your FC firmware and drivers up to date? – Chopper3 May 25 '16 at 11:36
  • Have you rebooted? Without digging into lots of source code or hoping to stumble upon the answer online somewhere, I can't say *why* the devices still show. But I strongly doubt the kernel structures for devices are going to be freed when hardware fails. If you have rebooted, maybe the hardware failure (assuming that's what it is...) is such that the device can be seen but no data can be transferred. – Andrew Henle May 25 '16 at 11:38
  • Sry I accidentally edited it away, it's Red Hat 6.6 with latest updates, firmware is up to date. Yes, after a reboot everything is fine, but the thing is that the paths to devices schould come back without rebooting after a failure on the way to the storage :) – Christian May 25 '16 at 11:46

0 Answers0