2

I have an HP DL380p Gen8 running Ubuntu 14.04 and apparently it's been having some trouble with its RAID10 filesystem for almost a month, despite everything seeming to be okay otherwise. I'm seeing a lot of these messages in dmesg/syslog/etc. though the hex values in the Read lines do vary a bit.

Nov 18 08:09:25 server03 kernel: sd 2:0:0:1: [sdb]  
Nov 18 08:09:25 server03 kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Nov 18 08:09:25 server03 kernel: sd 2:0:0:1: [sdb]  
Nov 18 08:09:25 server03 kernel: Sense Key : Medium Error [current] 
Nov 18 08:09:25 server03 kernel: sd 2:0:0:1: [sdb]  
Nov 18 08:09:25 server03 kernel: Add. Sense: Unrecovered read error
Nov 18 08:09:25 server03 kernel: sd 2:0:0:1: [sdb] CDB: 
Nov 18 08:09:25 server03 kernel: Read(16): 88 00 00 00 00 03 f8 48 f5 38 00 00 00 80 00 00

The iLO and hpssacli both report all disks are fine and the filesystem is not read-only. The /dev/sdb device is a RAID10 using the server's RAID controller, consisting of 20 x 900 GB disks.

This is a production server and while I've rebooted it once to try to clear this up, I'm reluctant to try an fsck without trying to determine what these messages mean when there are no other apparent issues.

So, any thoughts on what might be wrong here?

  • 5
    That disk is plainly _not_ fine. – Michael Hampton Dec 14 '15 at 21:51
  • Something is wrong, certainly, but so far everything I've looked at indicates nothing is wrong from a hardware perspective. I'll update the post with the RAID disk layout. Looking for other suggestions to narrow down the problem, whether it may be in the RAID itself or one of the disks and the controller just can't detect it for some reason. –  Dec 14 '15 at 22:01

2 Answers2

2

Okay, I'll answer with the normal troubleshooting techniques, but here's my disclaimer:

  • I really don't advocate running Ubuntu on bare metal hardware; especially HP ProLiant systems.
  • The support ecosystem is just not there for Ubuntu when it comes to HP systems, drivers, monitoring and value-add software.
  • The HP firmware packages are not built for Ubuntu, so god knows what firmware revisions you're running on.
  • Ubuntu tends to introduce some quirky bugs that I never see with more commercial Linux distributions.

Please provide the following in your question or a separate pastebin.

  • I'd like the output of hpssacli ctrl all show config.
  • I'd like the output of hpssacli ctrl all show config detail.
  • Please give the output of df -h and fdisk -l.
  • Please post the output of lsscsi.

Since you're on Ubuntu, you probably don't have the HP Management Agents installed. While hpssacli can provide a spot check of the array health, the hp-snmp-agents package is what provides actual continuous monitoring.

If you do have some of the HP Health Agents installed, please run hplog -v to extract the IML log.


My guess is that you're running an HP ProLiant DL380p Gen8 25-bay SFF server. Unpatched, many of those units suffered from Smart Array controller and controller cache failures. There are also some critical expander backplane updates that need to be run on that platform.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • I have pasted all requested output at http://pastebin.com/xv37rawG. I do need to review the firmware on this server and the others in the database cluster. I am curious, though, as to your further thoughts on Ubuntu and HP. We could discuss that through a private channel if you're so inclined. –  Dec 15 '15 at 19:20
  • Did you know you lost disks in July and November? – ewwhite Dec 15 '15 at 19:56
  • Yes, via email alerts from the iLO, and we had our on-site people replace them with new ones from HP. –  Dec 15 '15 at 20:01
  • The RAID controller should be on [firmware 6.68](http://h20566.www2.hpe.com/hpsc/swd/public/detail?sp4ts.oid=5194969&swItemId=MTX_96d4648b5e214cbb8c904dfda7&swEnvOid=4103#tab-history)... Everything else looks good. this may be an Ubuntu or `hpsa` driver problem. Are you on a current kernel? – ewwhite Dec 15 '15 at 20:07
  • No, It's on kernel 3.13.0-32 while the latest 14.04 kernel in the repo is 3.16.0-55. It's a bit of an ordeal to schedule an upgrade of the server since it's in a live 10-node database cluster. –  Dec 15 '15 at 20:37
0

I ended up fixing this by unmounting and recreating the filesystem and I've not seen any error messages since re-enabling the database application on the server, even after it recreated nearly 4 TB of data from other cluster nodes. (I'm wondering if the past disk replacements in this server somehow contributed to the filesystem getting corrupted.)