2

We have a Microsoft Failover Cluster with dynamic disks managed by Veritas Storage Foundation. Today the sysadmins added a new disk for SQL Server but the cluster size on the volume was wrong, so I issued a quick format to change it.

The disk volume failed, the SQL Server group failed as well and the cluster became unresponsive. After some minutes I managed to fail over to a passive node.

The SAN admins say it's my fault because I shouldn't have formatted the disk from the Windows format applet, but I should have used Veritas Enterprise Administrator instead.

Can a format operation bring offline a whole cluster group this way?

Relevant error messages:

From the eventlog:

The cluster resource host subsystem (RHS) stopped unexpectedly.
An attempt will be made to restart it. This is usually due to a 
problem in a resource DLL. Please determine which resource DLL is 
causing the issue and report the problem to the resource vendor.

From the cluster.log

ERR   [RCM] rcm::RcmResControl::DoResourceControl: 
ERROR_RESOURCE_CALL_TIMED_OUT(5910)' because of 'Control(STORAGE_GET_DISK_INFO_EX) 
to resource 'NameOfTheDiskGroup' timed out.'

Veritas Documentation:

Excerpt from Symantec's documentation:

Note: Before manually creating the resource, you must format the cluster-shared volume with NTFS using the VEA GUI and mount it on the node where you are trying to create the resource.

Does this mean the disk cannot be formatted from Windows? I don't read it that way.

For the record, I formatted many disks using the Windows applet in the past and nothing bad happened.

spaghettidba
  • 121
  • 4
  • `1.` `Can a format operation bring offline a whole cluster group this way?` Apparently it can. `2.` Haven't you answered your own question and doesn't the documentation confirm that? – joeqwerty Jun 11 '14 at 14:34
  • 1. I have no evidence of that. Where do you see it? – spaghettidba Jun 11 '14 at 14:35
  • 2. No, it just says you have to format the volume before creating the resource. It doesn't say you shouldn't format the disk from the Windows applet. – spaghettidba Jun 11 '14 at 14:36
  • Your excerpt clearly states `you must format the cluster-shared volume with NTFS using the VEA GUI`. I take that to mean that you need to perform this action from the VEA GUI. I don't know how it could be any clearer or more evident. – joeqwerty Jun 11 '14 at 14:42
  • @joeqwerty can you please point me to the paragraph that says I shouldn't format the disk from the Windows applet? My eyes are failing me and I don't seem to be able to find it. – spaghettidba Jun 11 '14 at 14:47
  • The fact that it doesn't say that you shouldn't doesn't mean that you should. What it DOES say is pretty clear to me. `you must format the cluster-shared volume with NTFS using the VEA GUI`. The word `must` is a pretty strong indication that you shouldn't perform this operation via other methods. – joeqwerty Jun 11 '14 at 15:17
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/15035/discussion-between-spaghettidba-and-joeqwerty). – spaghettidba Jun 11 '14 at 15:29

2 Answers2

1

Seeing as how it's a shared volume, it appears the clustered nodes were already trying to use it, so using the VEA GUI would be the best way to go. It doesn't mention in their documentation, but they most likely do something different from the Windows GUI (even if it's just a temporary write-lock on the CSV from the machine running VEA, so that it can indeed format the volume, telling the nodes to use a different disk, etc.

Also, I suspect the bigger problem was:

Note: You must ensure that the selected drive letter for the new cluster-shared volume is available and not in use on any of the cluster nodes.

It sounds like your disk was in use when you formatted it. Formatting the disk to NTFS using Windows is likely trivial, but the fact that the disk was in use and you didn't use the VEA GUI which arguably could have prevented some problems is what caused this.

MDMoore313
  • 5,531
  • 6
  • 34
  • 73
0

Yes. If the disk was already configured as a dependency of SQL Server (and to be used, a disk must be a dependency of the SQL Server resource), by the way a WSFC works, you may have caused a 'failure' so to speak causing the disk resource to go offline, and would escalate to bringing the entire Role offline. This may not be it, but that's the cluster perspective. I've never formatted a disk after the fact and seen what it does.

It could also be the fact that Symantec/Veritas is NOT NTFS, so the way you configure it, you screwed things up and the disk resource went offline in formatting. Again, if configured as a resource dependency of SQL Server, that would escalate.

  • Thanks for your answer, Allan. The disk was not a dependency for the SQL Server resource yet and I also set the policy "If resource fails, do not restart". – spaghettidba Jun 11 '14 at 15:46