3

Scenario:

Three node (shared nothing) cluster on Windows Server 2012. Two nodes in the primary data center, both with votes (node weight = 1), and a file share witness. The third node is in a remote data center and has no votes (node weight of 0).

Problem: One cluster node (which owned the cluster name), went down for automatic updates. The cluster name failed to the remote data center node and the remote node was able to get a lock on the file share witness file. At that point, our VPN tunnel dropped. The one node that was up in the primary data center (and had services running) noticed that the remote cluster node was down and attempted to bring the cluster name online. The file share witness file was still locked by the remote node, and the one visible running cluster node in the primary data center was unable to bring the cluster name online and it shut down the cluster service on itself.

Caveats: Firewalling the file share from the remote node is not an option due to other processes that use it.

I've considered attempting to remove the remote cluster node from possible owners of the cluster name, but I've not done or tested that before and I don't want to blow up my production cluster. Is it possible to remove a cluster node from possible owners for the cluster name? If we have to fail our services to the remote data center, there are a number of moving pieces that need be coordinated, so I don't want "automated" failover of service to the remote data center. The reason the remote node is in a cluster at all is for the SQL Server Availability Groups, to manage the replication to the remote node.

I've also considered removing the file share witness and giving the remote node a vote. The new dynamic quorum "should" keep the cluster online if one node goes down for a reboot and network connectivity is lost to the remote data center.

Given my scenario, which option (or other alternatives) will give me the highest availability.

Steven Murawski
  • 1,570
  • 3
  • 14
  • 25

2 Answers2

3

I actually like giving the remote node a vote because it'd make planned failovers that much easier. You could migrate databases & resources over to the remote datacenter, then gradually shut down the nodes in the primary datacenter, and you wouldn't have to monkey with the voting in order to get it to work. Plus, you're not worried about high availability on the file share.

Brent Ozar
  • 4,425
  • 17
  • 21
2

So I'm with Brent here. I've never been a fan of removing a node as a voter unless you are absolutely 100% sure that you don't care about it. The one thing you should strive to do is keep the WSFC cluster group where your primary replica is to hopefully avoid split brain.

Removing the cluster node as a possible owner from the WSFC is a bad idea. If you need to do that, evict the node from the cluster. Bad, bad mojo.

With Windows Server 2012 you also have dynamic quorum so unless your failures were all simultaneous, you can pretty much get down to last man standing (with warnings, of course).

Also, I would solve any network issues. They will be killer in a geographically dispersed situation as you can tell.

Allan Hirt
  • 41
  • 3