9

Got a problem here on a 2016 Windows Server Failover Cluster (WSFC) hosting an SQL Failover Cluster Instance (FCI) employing Storage Spaces Direct (S2D). On each server, after successful initial creation, S2D automatically added an otherwise unused RAID-volume to the storage pool (although S2D cannot be created on RAID-volumes and absolutely insists on unraided disks). Now it's broken, due to - as far as I could figure out - exactly that. As a consequence, the virtual disk is offline, taking the whole cluster down with it. It won't come back online, due to a missing cluster network ressource. The disks in question can be retired but not removed. Virtual disk repair does not run, cluster compatibility test claims invalid configuration.

This is a new setup. So I could simply delete the virtual disk, the cluster or even the servers and start over. But before we go productive, I need to make sure, this does not ever happen again. The system shooting itself in the virtual knee to a crashing halt just by needlessly and wrongly adding an unsupported disk is no platform we can deploy. So primarily I need a way to prevent this from happening, rather than to repair it now. My guess is that preventing an S2D setup from grabbing more disks than it was created on would do the trick. The cost of potentially more manual interaction during a real disk replacement is negligible to the clusterf... we have here. Much as I browsed the documentation so far, however, I cannot find any way to control that. Unless I'm missing something, neither Set-StoragePool, Set-VirtualDisk nor Set-Volume offer any parameter to that extend.

Any help or hint would be greatly appreciated.

Following are just more details on the above: We have 2 HPE DL380 Gen9 server machines doubly connected to each other via RDMA capable 10GB Ethernet and via 1GB to the client net. Each feature a RAID controller HP ??? and a simple HBA controller HP ??? (since S2D absolutely requires and works only on directly attached, unraided disks). The storage configuration comprises of an OS-RAID on the RAID-controller, a Files-RAID on the RAID controller, and the set of directly attached disks on the HBA intended for S2D.

I set up 2 Windows Servers 2016 datacenter edition on the OS-RAIDs, installed WSFC feature, ran and passed the cluster compatibility test including S2D option, created the cluster without storage, added a file share witness (on a separate machine), enabled S2D on the storage pool, which automatically comprised of all of the unraided disks, and on top of that pool created a virtual disk of the mirror type and used NTFS as file system, since this is supposed to be the FS of choice for an SQL FCI installation.

I then installed SQL 2016 standard edition as an FCI on that cluster, imported a database and tested it all. Everything was fine. Database was right there and faster than ever. Forced as well as automatic failover was a breeze. Everything looked good.

Next day we tried to make use of the remaining Files-RAID. First thing was to change the RAID level as we didn't like the pre-configuration. Shortly after deleting the pre-configured RAID volume and building a new one (on each server), we detected that the cluster was down. From what I could figure out so far, the pre-configured Files-RAID volume had in the meantime been automatically added to the pool, and as we just deleted it, it was now missing from the pool. While I checked, I found the new Files-RAID, while still being created, already shown as a physical drive of the pool as well. So the pool now included 2 RAID volumes on each server, one of which didn't even exist. These volumes (but not their disks) are listed by Get-PhysicalDisk along with the actually physical disks on the HBA, not sure if that's regular. The pool itself is still online and doesn't complain, the virtual disk however is not simply degraded for missing disks, but completely offline (and so is, in consequence, the whole cluster).

I was able to retire those physical disks (i.e. those which are actually the RAID volumes), and they are now marked as retired. But they are still in the pool and I cannot remove them just now, trying to do so fails. A Repair-VirtualDisk should rebuild the virtual disk to a proper state on just the remaining disks (I went by this: https://social.technet.microsoft.com/Forums/windows/en-US/dbbf317b-80d2-4992-b5a9-20b83526a9c2/storage-spaces-remove-physical-disk?forum=winserver8gen), but this job is immediately over, "successfull" of course, with no effect whatsoever.

Trying to switch the virtual disk back online fails, stating that a networked cluster ressource is unavailable. As far as I understand, this could only refer to the (available) storage pool, since the missing disks are no cluster ressources. The pool shows no errors to fix. Running the cluster compatibility test claims a configuration not suited for a cluster.

I cannot find any part left that would budge another inch, the whole thing looks deadlocked for good. Any ideas on how to prevent a running WSFC from f...ing itself up that way?

I did not encounter any error message I found particularly enlightening, and I didn't want to bomb the page even more by posting all of them. If anyone wants to have any specific detail, just let me know.

Thanks a lot for your time, guys!

Karsten

Update as requested by Mr. Raspberry enter image description here

Karsten Köpnick
  • 203
  • 2
  • 10
  • 3
    Could you please share us with a list of your drives and their bus types? PoweShell command: `Get-PhysicalDisk -CanPool $true | Sort Model | ft FriendlyName, BusType, CanPool, OperationalStatus, HealthStatus, Usage, Size` Also, is there any chance you had made a mistake when reconfigured File-RAID assigning a S2D drive to a new RAID? – Mr. Raspberry May 17 '17 at 13:03
  • 2
    What's the point in S2D + SQL Server? Why do you want to spend money on an unlimited licensed VMs if you don't plan (actually can't...) running any? SQL Server 2016 can do AlwaysOn Basic AG even with the Standard and you can save HUGE amount of money just using Windows Server Standard 2016. https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/basic-availability-groups-always-on-availability-groups – BaronSamedi1958 May 18 '17 at 17:09
  • @Mr. Raspberry: I updated the entry with the list of physical disks. Please note that I left out "-CanPool $true" as none is poolable. – Karsten Köpnick May 18 '17 at 17:52
  • @Mr. Raspberry: Regarding a potential mistake: Quite sure I did not. S2D ignored anything from the RAID. When S2D was created, it took only the 8 disks per server on the HBA controllers as visible above. I checked that. And the RAID volume definitely only consists of drives on the RAID-controller, so I did not use any of the physical S2D disk for it. But at the time I deleted the RAID-volume, S2D had also grabbed it for the pool, which caused the problem. – Karsten Köpnick May 18 '17 at 17:58
  • @BaronSamedi1958: Thanks for the suggestion. The use of WSFC was however already decided for other reasons, and as there is no enclosure, S2D was the way to go it seemed. – Karsten Köpnick May 18 '17 at 18:00
  • 3
    @KarstenKöpnick: Well, I would suggest you consider about SQL Server AlwaysOn FCI + StarWind Virtual SAN Free. This configuration would do the job better in your case of 2-node cluster for less cost and is much easier to deploy and manage with no such issues. https://www.starwindsoftware.com/installing-and-configuring-a-sql-server-2012-failover-cluster – Mr. Raspberry May 19 '17 at 14:47
  • 1
    "S2D was the way to go it seemed" Well... Good luck with that :) – BaronSamedi1958 May 19 '17 at 22:06

2 Answers2

6

Yes, you can disable the auto-pooling behavior. The experience is not great, but it’s certainly do-able and supported. The setting name, and example cmdlet syntax, is in the Settings section of this public doc:

https://technet.microsoft.com/en-us/windows-server-docs/failover-clustering/health-service-overview

Essentially, run this as Administrator:

Get-StorageSubSystem Cluster* | Set-StorageHealthSetting -Name "System.Storage.PhysicalDisk.AutoPool.Enabled" -Value False

Hope this helps! - Cosmos (@cosmosdarwin), Microsoft PM

  • @CosmosDarvin: Thanks! Looks like that could do the trick. I need to read a bit more into the depths of it and understand the implications, then I'll give it a try and report. – Karsten Köpnick May 18 '17 at 18:03
  • @CosmosDarvin: Thanks a lot. I finally had the chance to delve deeper into the topic to find out about potential repercussions. As far as I can tell, with that option disabled, the only consequence would be that disks will have to be added to the pool manually with an Add-PhysicalDisk command. Which is a fine trade-off. I could not find any indications about other complication or disadvantages, so I will give this a try. - Just need to document the necessity for manually adding disks in case of a replacement. - I will report the results. – Karsten Köpnick Jun 12 '17 at 14:12
  • 1
    Reporting the results: I'd like to add that I could not gather any real-life experience with this approach. It was decided to add a disk enclosure and use that instead of S2D. Disk replacements in a RAID that size are a frequent task, and the requirement of having someone with sufficient expertise around at any time to perform a PowerShell intervention, even a documented one, for a simple disk swap was seen as a show stopper. Looking at it that way, I totally agree. So we re-installed using the enclosure and had no problems since. - Thank you all for your kind and expert help. – Karsten Köpnick Nov 28 '17 at 15:36
3

The workaround i've found to this problem is to change the Bus Type of the RAID volumes or disks by changing it from one of the supported type to an unsupported one.

You will have to identify the controller driver from Device Manager and after go on registry and find the driver name on the location below.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\SmartPqi\Parameters

In my case, i've changed the registry key that correspond to SAS into RAID

«BusType»=0x00000008 (RAID) (instead of 0x0000000a) (SAS)

reboot the machine

After this change you can have the storage pool in Windows Storage subsystem instead of Clustered Storage Spaces

Please be careful if you want to apply this type of workaround as it's not a validated solution and might expose your production environment to a high risk.

DragosT
  • 31
  • 1