3

We have an HP MSA70 with 25 x 600GB HP SAS 10k DP drives, connected to an HP P800 controller. The drives are configured in RAID 6.

Yesterday, some kind of unknown "event" occurred and the array dropped offline. We rebooted the server (running CENTOS 6.2) and upon startup, the Array Controller reported that 13 of the drives are "missing". When we look at the volume in the Array management, there are two entries for each slot for slots 1-12. One shows a 600gb drive and one shows a 0gb drive. There are no more entries after 12.

We contacted HP support, who sent us to Tier 2 support, and after many hours gave up. They said they have never seen this, before (my favorite thing to hear from a vendor).

Has anybody seen this before, and have we lost all of the data?

Thank you.

NXTVoipguy
  • 31
  • 1
  • You didn't mention it but...backups? – Bart Silverstrim Aug 26 '15 at 01:23
  • @BartSilverstrim Maybe, but people don't expect their array to fail in this manner. In its day, the MSA70 was meant to be an expansion unit for HP MSA2000 SAN solutions. Failure is rare, but there are other components in the chain that may have failed. – ewwhite Aug 26 '15 at 01:44
  • What did HP try? – ewwhite Aug 26 '15 at 04:13
  • @ewwhite What I was thinking was that if there were backups, due to the apparent age of equipment, the OP could try finding spare parts and replace controller/backplane/etc until something might be salvaged. Otherwise, OP might want to use something like R-Tools or at a minimum DD to pull images, because troubleshooting failing eq like that might damage data on the drives if they aren't damaged already. – Bart Silverstrim Aug 26 '15 at 11:49
  • Naw, the server won't POST or modify the drives in this state. His data is still there, but the cause of the SAS or backplane issues needs to be identified. – ewwhite Aug 26 '15 at 12:15

1 Answers1

5

Old, old, old, old...

  • CentOS 6.2 is old (6.2, 6 December 2011 (kernel 2.6.32-220))
  • HP StorageWorks MSA70 is old. (End of Life - October 2010)
  • HP Smart Array P800 is old. (End of Life - 2010)

So this makes me think that firmware and drivers are also old. E.g. there's no reason to run CentOS 6.2 in 2015... And I'm assuming no effort was made to keep anything current.

This also makes me think that the systems are not being monitored. Assuming HP server hardware, what did the system IML logs say? Are you running HP management agents? If not, important messages about the server and storage health could have been missed.

Did you check information from the HP Array Configuration Utility (or HP SSA)?

But in the end, you've probably suffered a port failure or expander/backplane failure:

  • How many SAS cables are connected to the enclosure? If 1 cable is connected, then you likely have a backplane issue because of the SAS expander in the enclosure.
  • If two cables are connected, you may have a SAS cable, MSA70 controller or P800 port failure.

Your data is likely intact, but you need to isolate the issue and determine which one of the above issues is the culprit. Replacing a SAS cable is a lot easier than swapping the MSA70 controller or RAID controller card... but I guess you can get another MSA70 for $40 on eBay...

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Ok, I made an error in one item. The server (an HPDL360 G7) is running FreeNAS (freebsd). We replaced the IO module, the P800, the chassis, and the cable (there was one). No luck. HP downloaded the ADU log. Controller firmware is 7.22. – NXTVoipguy Aug 26 '15 at 16:02
  • @NXTVoipguy So if you were using FreeNAS on this setup, were you presenting a single logical drive to the FreeNAS/ZFS, or did you have a series of RAID-0 arrays? The HP Smart Array controllers aren't really equipped to run ZFS where you expose raw disks to the OS. Can you please clarify this? – ewwhite Aug 26 '15 at 19:02
  • Hello. We were presenting one single logical drive to the OS. Using FreeNAS to present iscsi volumes to a set of servers. Thanks again for responding. – NXTVoipguy Aug 26 '15 at 22:04
  • If one cable, that means you have a potential controller/backplane problem. Did you try the cable in the other P800 ports? – ewwhite Aug 26 '15 at 22:07
  • I tried the following: 1. Replaced cable, 2. Replaced P800 card, 3. Moved drives to another MSA70 chassis, 4. Built a new server with a new P800 card (updated firmware on card), 5. Moved MSA70 Controller from one slot to the other, 6. Pulled Drive in position 1, 7. Pulled Drives 1 and 2. – NXTVoipguy Aug 27 '15 at 14:55
  • And what is the result? What are you currently seeing? There should have been a variety of POST messages on the system, based on what you've described. – ewwhite Aug 27 '15 at 15:08
  • Always the same result. I have some screen shots, but I am new to this forum and I don't know if I can upload them. The POST message for the P800 says there is 1 logical drive, but drives 1,2,3,4,5,6,7,8,9,10,11,12 are missing. Then, when I go into the controller config, it shows something really odd. Port 2E, Box 1, Bay 1, 600.1GB SAS HDD OK. Next line reads: Port 2E, Box 1, Bay 1, 0.0GB SAS HDD Missing. – NXTVoipguy Aug 27 '15 at 15:30
  • I also just learned that when this event started, meaning, when the array became unavailable, one of my techs noticed that the LED on the MSA70 controller was amber. – NXTVoipguy Aug 27 '15 at 15:41
  • Sounds like a controller failure. Because FreeBSD was in place (or FreeNAS), I'm thinking there were no alerts of anything leading up to this (because the OS isn't supported). If you only had one cable connected to the MSA 70 enclosure, versus two for dual-pathing, then yes... this could destroy your array. – ewwhite Aug 27 '15 at 16:42
  • That does not sound promising. Any idea on how to recover it? – NXTVoipguy Aug 27 '15 at 17:01
  • No, I don't have any additional recommendations. – ewwhite Aug 27 '15 at 17:04
  • Ok. Thanks for taking a look. I appreciate your time. – NXTVoipguy Aug 27 '15 at 17:37