7

Due to hurricane Matthew, our company shutdown all servers for two days. One of the servers was an ESXi host with an attached HP StorageWorks MSA60.

When we powered things back up today and logged into the vSphere client, we noticed that none of our guest VMs are available (they're all listed as "inaccessible"). And when I look at the hardware status in vSphere, the array controller and all attached drives appear as "Normal", but the drives all show up as "unconfigured disk".

We rebooted the server and tried going into the RAID config utility to see what things look like from there, but we received the following message:

An invalid drive movement was reported during POST. Modifications to the array configuration following an invalid drive movement will result in loss of old configuration information and contents of the original logical drives

enter image description here

Needless to say, we're very confused by this because nothing was "moved"; nothing changed. We simply powered up the MSA and the server, and have been having this issue ever since.

The MSA is attached via a single SAS cable, and the drives are labeled with stickers, so I know the drives weren't moved or switched around:

---------------------
| 01 | 04 | 07 | 10 |
---------------------
| 02 | 05 | 08 | 11 |
---------------------
| 03 | 06 | 09 | 12 |
---------------------

At the moment, I don't know what make and model the drives are, but they are all 1TB SAS drives.

I have two main questions/concerns:

  1. Since we did nothing more than power the devices off and back on, what could've caused this to happen? I of course have the option to rebuild the array and start over, but I'm leery about the possibility of this happening again (especially since I have no idea what caused it).

  2. Is there a snowball's chance in hell that I can recover our array and guest VMs, instead of having to rebuild everything and restore our VM backups?

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • 2
    First of all call HP right now, you may not have a contract with them but anything they might have to charge you will be money well spent. In the meantime unplug and reseat all disks, cables, shelf controllers and disk controllers, you never know, but don't do anything to the array until HP have had a look ok. – Chopper3 Oct 08 '16 at 18:09
  • Can you give us the layout of the 12 disks bays in the MSA60? Also, how is the JBOD enclosure cabled to the server? One SAS cable? Two SAS cables (dual-domain)? – ewwhite Oct 08 '16 at 18:09
  • Also, what are the make/model/capacity of the disks installed? – ewwhite Oct 08 '16 at 18:17
  • I tried adding the info about the MSA and the drives, but it's appearing as "bold" or strong text (even though I didn't format it that way)... maybe a mod can edit it for me. – John 'Shuey' Schuepbach Oct 08 '16 at 19:04
  • It's the messages that appeared _before_ this one that you'll want to get pictures of. – Michael Hampton Oct 08 '16 at 19:04
  • @John'Shuey'Schuepbach They're not HP disks? – ewwhite Oct 08 '16 at 19:05
  • @ewwhite I'm not sure... this is a very old server and MSA that were both "hand-me-downs" from our PACS vendor. The server is an HP ProLiant DL360 G8. I'm no longer at work, so I can't check any of this info - I'll be back there tomorrow morning. – John 'Shuey' Schuepbach Oct 08 '16 at 19:12

2 Answers2

5

Right, this is a very precarious situation...

So the HP Smart Array controller can handle a certain number of physical drive movements before it breaks the array configuration. Remember that HP RAID metadata lives on the physical drives and not the controller...

The MSA60 is a 12-bay 3.5" first-generation SAS JBOD enclosure. It went end-of-life in 2008/2009. It's old enough that it shouldn't be in the critical path of any vSphere deployment today.

In this case, the P411 controller is trying to protect you. You may have sustained a multiple drive failure condition, hit a firmware bug, lost one of the two controller interfaces in the rear of the MSA60 or some other odd error.

This sounds like an older server setup as well. So I'd like to know the server involved and the Smart Array P411 firmware revision.


I'd suggest removing power to all of the components. Waiting a few minutes. Powering on... and watching POST prompts very closely.

See the details in my answer here:
logical drives on HP Smart Array P800 not recognized after rebooting

There may be an option to reenable a previously failed logical drive, with an option to press F1 or F2. If presented, try F2.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
1

You guys are not going to believe this...

First I attempted a fresh cold boot of the existing MSA, waited a couple minutes, then powered up the ESXi host, but the issue remained. I then shutdown the host and MSA, moved the drives into our spare MSA, powered it up, waited a couple minutes, then powered up the ESXi host; the issue still remained.

At that point, I figured I was pretty much screwed, and there was nothing during the initialization of the RAID controller where I had an option to re-enable a failed logical drive. So I booted into the RAID config, verified again that there were no logical drives present, and I created a new logical drive (RAID 1+0 with two spare drives; same as we did about 2 years ago when we first setup this host and storage).

Then I let the server boot back into vSphere and I accessed it via vCenter. The first thing I did was removed the host from inventory, then re-added it (I was hoping to clear all the inaccessible guest VMs this way, but it didn't clear them from the inventory). Once the host was back in my inventory, I removed each of the guest VMs one at a time. Once the inventory was cleared, I verified that no datastore existed and that the disks were basically ready and waiting as "data disks". So I went ahead and created a new datastore (again, same as we did a couple years ago, using VMFS). I was eventually prompted to specify a mount option and I had the option of "keep the existing signature". At this point, I figured it'd be worth a shot to keep the signature - if things didn't work out, I could always blow it away and re-create the datastore again. After I finished the process of building the datastore with the keep signature option, I tried navigating to the datastore to see if anything was in it - it appeared empty. Just out of curiosity, I SSH'd to the host and checked from there, and to my surprise, I could see all my old data and all my old guest VMs! I went back into vCenter and re-scanned storage and refreshed the console, and all of our old guest VMs were there! I re-registered each VM and was able to recover everything! All of our guest VMs are back up and successfully communicating on the network.

I think most people in the IT community would agree that the chances of having something like this happen are extremely low to impossible.

As far as I'm concerned, this was a miracle of God...

  • 1
    Yeah, I think you're right, that was close to being a miracle, count yourself VERY lucky indeed. Now back the lot up and restore to something supportable please. – Chopper3 Oct 09 '16 at 17:30
  • 1
    It’s actually not a miracle… but that was a lot of experimentation and effort without identifying the root cause; which could _still_ be the Smart Array P411 RAID controller in your host server. E.g. this could happen again. Did you respond with the firmware version of the controller? – ewwhite Oct 09 '16 at 17:30
  • @ewwhite 6.64 (Oct 2015). I would've liked to have figured out the root cause, but these servers needed to be up asap. I'm certain my boss will be looking into a replacement server/storage asap as well. – John 'Shuey' Schuepbach Oct 09 '16 at 22:34
  • 1
    If nothing else... Now sounds like a great time to run some backups – Journeyman Geek Oct 09 '16 at 23:48