1

I have an Apple XServe RAID connected via fibre channel to a Dell Poweredge R610. This server is mainly for hosting subversion repositories, and storing disk images. In the past 6 months or so we have encountered some issues with this setup where the raid ends up being remounted read only after some errors. It seems fine when the load is minimal, but a few days ago when copying some large disk images to it it had a bunch of errors and remounted read only.

The actual error messages begin with a bunch of task aborts

May 17 15:20:09 sub0 kernel: [4661904.506886] mptscsih: ioc1: attempting task abort! (sc=ffff88011d2aea00)
May 17 15:20:09 sub0 kernel: [4661904.506890] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 2c ea 00 04 00 00
May 17 15:20:09 sub0 kernel: [4661904.507219] mptscsih: ioc1: task abort: SUCCESS (sc=ffff88011d2aea00)
...
May 17 15:21:42 sub0 kernel: [4661997.476282] mptscsih: ioc1: attempting target reset! (sc=ffff88011e632c00)
May 17 15:21:42 sub0 kernel: [4661997.476284] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 14 52 00 04 00 00
May 17 15:21:42 sub0 kernel: [4661997.494532] mptscsih: ioc1: target reset: SUCCESS (sc=ffff88011e632c00)
May 17 15:21:42 sub0 kernel: [4661997.494589] mptscsih: ioc1: attempting bus reset! (sc=ffff88011e632c00)
May 17 15:21:42 sub0 kernel: [4661997.494592] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 14 52 00 04 00 00
May 17 15:21:42 sub0 kernel: [4661997.495403] mptscsih: ioc1: bus reset: SUCCESS (sc=ffff88011e632c00)
May 17 15:21:52 sub0 kernel: [4662007.498403] mptscsih: ioc1: attempting host reset! (sc=ffff88011e632c00)
May 17 15:21:52 sub0 kernel: [4662007.498411] mptbase: ioc1: Initiating recovery
May 17 15:22:02 sub0 kernel: [4662016.680666] mptscsih: ioc1: host reset: SUCCESS (sc=ffff88011e632c00)
May 17 15:22:12 sub0 kernel: [4662026.686900] sd 2:0:0:0: Device offlined - not ready after error recovery
...
May 17 15:22:12 sub0 kernel: [4662026.687032] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662026.687034] sd 2:0:0:0: [sdb] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662026.687037] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 14 52 00 04 00 00
May 17 15:22:12 sub0 kernel: [4662026.720494] lost page write due to I/O error on sdb1
...
May 17 15:22:12 sub0 kernel: [4662027.117326] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662027.117328] sd 2:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662027.117331] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 2c ea
May 17 15:22:12 sub0 kernel: [4662027.117339] 00 04 00 00
May 17 15:22:12 sub0 kernel: [4662027.122264] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662027.122266] sd 2:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662027.122268] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 30 ea 00 04 00 00
May 17 15:22:12 sub0 kernel: [4662027.125053] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662027.125055] sd 2:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662027.125058] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 18 52 00 04 00 00
May 17 15:22:12 sub0 kernel: [4662027.127869] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662027.127871] sd 2:0:0:0: [sdb] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662027.127874] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 10 62 00 03 e8 00
...
May 17 15:22:12 sub0 kernel: [4662027.130737] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662027.405150] sd 2:0:0:0: [sdb] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662027.405152] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 34 ea 00 04 00 00
May 17 15:22:12 sub0 kernel: [4662027.410575] JBD: Detected IO errors while flushing file data on sdb1
May 17 15:22:13 sub0 kernel: [4662028.182860] JBD: Detected IO errors while flushing file data on sdb1

At this point the array is remounted read only. I am at a loss as to what the problem could be(I'm relatively new to dealing with Fibre Channel/RAID arrays of this type)

System Information(let me know if I can provide anything else that may be helpful)

sysadmin@sub0:~$ lspci(snipped to the relevant stuff I presume)
03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
05:00.0 Fibre Channel: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter (rev 02)
05:00.1 Fibre Channel: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter (rev 02)

sysadmin@sub0:~$ cat /proc/mpt/summary
ioc0: LSIFC949E, FwRev=01031700h, Ports=1, MaxQ=1023, LanAddr=00:06:2B:1B:89:14, IRQ=40
ioc1: LSISAS1068E B3, FwRev=00192f00h, Ports=1, MaxQ=266, IRQ=16
ioc2: LSIFC949E, FwRev=01031700h, Ports=1, MaxQ=1023, LanAddr=00:06:2B:1B:89:15, IRQ=50

sysadmin@sub0:~$ cat /proc/mpt/version
mptlinux-3.04.12
  Fusion MPT base driver
  Fusion MPT FC host driver
  Fusion MPT SAS host driver

sysadmin@sub0:~$ cat /etc/issue
Ubuntu 10.04.2 LTS \n \l

Full /var/log/messages: https://gist.github.com/96df4b5b9ac7ec46f74c#file_messages

Full /var/log/kern.log: https://gist.github.com/96df4b5b9ac7ec46f74c#file_kern.log

Thanks for taking the time to read and for any help you can provide.

1 Answers1

0

I'd would be helpful to know more about how the RAID is actually configured e.g. volume, size, RAID level, stripe and chunk sizes etc, and whether you're using multipath.

You're getting an error handling escalation because the aborted command isn't being handled to the satisfaction of the low level driver and the SCSI mid-layer, that's why the recovery severity keeps climbing. How it got there to begin with is going to require lots of analysis like recording blktrace. All I can recommend with this very limited information is to try and upgrade your driver using one the LTS backport kernels (Oneiric for example) and attempt to recreate the problem; The mptsas driver you're using is very old. If you look hard enough, you might be able to update just that driver using a DKMS package.

If you still have the problem then you'll have to consider your capacity for digging in and running down this problem down vs seeking additional support from your OS vendor. These are the sort of problems support contracts are made to resolve. No matter which way you go, be prepared to commit weeks, not days to determine the root cause. Good luck.

ppetraki
  • 322
  • 2
  • 10