7

I've been reading a lot about high availability virtualization, either via Hyper-V or VMWare. In that context, essentially high availability means that the VM is hosted by a cluster of physical servers (nodes), so if one of the physical servers goes down, the VM can still be served by other physical servers. So far so good, the physical cluster and the VM itself are highly available.

However if the service being provided, let's say SQL server, MSDTC, or any other service, are actually being provided by the VM image and the virtualized operating system. So I imagine that there is still a point of failure at the virtual layer that isn't accounted for. Something could happen within the virtual machine itself that the physical cluster can not account for, correct? In that instance the physical failover cluster (Hyper-V) or VMWare host, can not fail over, because the issue is not with one of the servers in the physical cluster - failing over a physical node would not do any good.

Does this necessitate building a virtual failover cluster on top of the physical one, or is this not necessary?

Alternatively, I suppose you could skip the phsyical clustering, and just cluster at the virtual layer (Child based failover clustering), because that should still survive a physical failure.

See image below showing parent based (left), child based (right) and a combination (center). Is parent based as far as you need to go, or is child based more appropriate?

Clustering example Image

Atilla Ozgur
  • 128
  • 4
Jeremy
  • 903
  • 2
  • 9
  • 19

7 Answers7

10

The physical cluster makes your virtual hardware highly available, i.e. failures of a physical server don't affect any given virtual machine. However, the virtual machine itself can still fail (e.g. OS crashing, someone shutting down the virtual server, etc.), so the service running on top of the virtual machine may still fail at some point (although it's less likely than it'd be for the same service running on standalone physical hardware). To mitigate this risk, you create the clustered service, so that the service remains unaffected even if a virtual server fails. Of course you could achive more or less the same results, if you built the clustered service directly on physical servers.

Whether you run your clustered service on physical servers or on top of a clustered virtualization platform depends on your requirements. If you don't need a virtualization platform for anything else or the clustered service needs lots of system resources then I'd recommend building the cluster on physical hardware. But if your physical hardware has resources to spare or you already have a virtualization cluster, I'd run the clustered service on virtual machine, because that makes managing the (virtual) hardware a lot easier.

Ansgar Wiechers
  • 4,197
  • 2
  • 17
  • 26
6

Don't forget to take a reality pill along the way, though.

You need to understand the required uptime for your application, and more importantly, the maximum amount of time your application can be unavailable when it does fail. And it will.

This second point is critical; I've seen a "five nines" application being managed by a large systems integrator that was offline for nearly a day because the complexity of the technology being used to keep it highly available. For day-to-day operational availability, the technology ticked the boxes, but when something went wrong with the config, the folks at the aforementioned company were properly stuck.

Don't get me wrong, clustering, SAN snapshots, VM snapshots, off-site replication, HA lock-step virtualisation, Etc., have their place, but just make sure you choose what's required, not what looks nice and shiny.

I'll step down from my soap box now ;-)

Simon Catlin
  • 5,222
  • 3
  • 16
  • 20
  • 1
    Totally get your point on taking a pill, but is clustering at the virtual level (child based) or even the combination of parent AND child based really going over the top? I mean if you don't at least cluster at the virtual level you can't even apply patches without taking the whole system down, never mind accounting for those crashes. – Jeremy Sep 02 '12 at 01:49
  • O/S level failover clustering definitely gives you ability to perform systematic maintenance, so yes, there's a benefit here... if your uptime requirements don't allow periodic maintenance windows. If you're just looking for operational availabilty (protection from hardware / environmental failures), consider looking at VMware fault tolerance (FT). As it's another extension of the hypervisor abstraction, it gives you O/S and application agnostic HA. – Simon Catlin Sep 03 '12 at 19:27
  • Can VMware fault tolerance "failover" based on the same criteria that a failover cluster can? Ie, are there situations where one approach will fail over and one approach will not? – Jeremy Sep 11 '12 at 04:10
  • - It looks ike it still doesn't give you failover protection from hailures within the VM – Jeremy Sep 11 '12 at 04:29
5

Does this necessitate building a virtual failover cluster on top of the physical one, or is this not necessary?

It does yes.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
4

You first have to build a high availability system (for SQL, for the OS, etc). This mean you must have more than one physical or virtual computer, and you must use software able to support high availability.

Once this is done, you can use a high availability virtualisation system, which "only" protect you against hardware failure.

The second level of high availability require 2 physical computers (or more).
So let's say your first level of high availability is done with 2 computers : you now don't need to bother about the second level, because it won't give you anything better.

Gregory MOUSSAT
  • 1,737
  • 2
  • 25
  • 48
3

I think you've got the essence of the ideas about availability down. Both Hyper-v and VMware HA functionality does not provide HA to the guests, just HA of the virtualization service. Based on the availability requirements of the guest services you also require HA at the guest level (and depending on the technology involved may mean clustering). You need to evaluate each service for the particulars on how to provide that required uptime. SQL server for instance could use either transaction mirroring or server clustering. Many times the additional overhead and challenges in clustering on the virtual services outweigh the benefits provided and it may mean that the service ends up being provided on dedicated hardware instead. (picking on sql server for a bit) SQL server is usually a potential candidate for remaining physical due to the potential for high network, IO, CPU and memory utilization as well as a need for redundancy.

Jim B
  • 23,938
  • 4
  • 35
  • 58
2

The answer is it depends.

Clustering solutions usually do more than the application layer. Traditionally a cluster dependency graph will include things like,

  1. Network / IP availability check
  2. Storage / share volume availability.

Running some of these checks inside a VM is hoorrendously difficult. For e.g. In Windows 2003 Clusters, it requires a Quorum drive that it uses a SCSI lock on to ensure that it is the owner of the resources. On failures it also sends out 'poison packets' to acquire that lock. All of these feature are near impossible to implement without a RDM to a LUN.

All of these 'hardware detection' components will have a large overhead within a VM.(VM performance is always great for user apps, but anything kernel base will always incur varying degrees of overhead).

So in the case of Microsoft Windows 2003 clusters (and I had to virtualise I'd use your 'child' approach).

The ideal place to strive for is,

  • VMware HA for hardware failure detection.
  • vSphere application monitoring

Followed by,

  • VMware HA
  • An application only monitor (without the hardware dependency)
  • Make sure anti affinity is on for the paired VMs so DRS, HA never restart the nodes on the same hosts!

Finally

  • Child clustering
M Afifi
  • 727
  • 4
  • 7
-2

If you want really HA you will need to cluster your HA-VMs as well, yes.

If you want to avoid EVERY SPOF you will have a hard time.

  • Use different hardware - not a single piece should be from the same vendor
  • Use different software - including the operating-system
  • Use different programming languages for the same application
  • Use different compilers for the application
  • Use different network vendors for each set of redundant connection
  • Use different power providers
  • Use different locations for your servers
  • ...

I once attended a course for a NAS-system where they told us that the NASA goes that way - each piece exists in three different flavours. Only if at least two of them have the same result, the result is ok. Apart from that everything has to be redundant (in each of the three pieces).

On pre-flight all three have to deliver the same result, of course.

Nils
  • 7,657
  • 3
  • 31
  • 71
  • 1
    not for nothing but aside from this being a supportability and stability nightmare, I'm not sure I'd take reliability tips from guys that destroyed a space shuttle because they didn't want to replace a cheap o-ring. – Jim B Sep 02 '12 at 03:51
  • 1
    @Nils: I'm guessing you need some piece of software to determine that at least two of the three systems yield the same result. How do you have redundancy on that? – Buhb Sep 02 '12 at 07:01
  • @Buhb possible two different integrated circuits evaluating this and displaying it as a state on a hardware output-line (wild guess on my side). – Nils Sep 02 '12 at 20:42
  • @Nils and who will decide output of which cirquit should prevail ? :) – Sandman4 Sep 02 '12 at 20:44
  • 1
    @Sandman4 I guess even NASA has to take a decision at some point. ;-) – Nils Sep 02 '12 at 21:06