6

I'd like to setup a two node high availability cluster using corosync/pacemaker/drbd. For that, of course, I need fencing. As far as I understand, all IPMI/iLO/... solutions do the job, but only as long as the chassis has power. In the case that node B loses power, node A does not have any chance of using STONITH against node B.

What kind of hardware solves this problem? Is there a (standard rack) server which provides IPMI/iLO/... hardware running on battery? Should I use a network-connected UPS?

C-Otto
  • 294
  • 5
  • 16

2 Answers2

9

As far as I'm aware there's no standard hardware (or software) solution for this.
You can't Shoot The Other Node In The Head if it's not there.

You can handle this a number of different ways - one that I can suggest is using a Smart PDU - As a last resort when no other STONITH technique works command its power outlets "off" and you don't have to worry about it coming back until someone commands the power "on" again. (This is really only a safeguard against somebody pulling the power cables accidentally...)

A similar solution can also be made using managed switches turn down the ports the machine is attached to, or drop them into a "fixer" VLAN so you can connect to the box there and get it ready to re-join the cluster.


Both of the ideas above rely on your datacenters being powered and connected (the PDU, Switch, etc. all need to be working, and connectivity needs to be present so you can send commands to the infrastructure equipment).

If you can't rely on power a classic solution is configuring your servers NOT to automatically power on after a power failure (IPMI/iLO/etc. will still come up when the chassis is powered, so you can bring it up later as a manual step, perhaps after isolating its network ports as described above).
This avoids a "bad" node coming back on-line, but does add a manual (or automanual) step to the process.

If your problem is connectivity and not power you have a much harder problem - Disconnected nodes need to shoot themselves in the head. (That problem is why my cluster configurations do not automatically reactivate a failed member: When a box fails and comes back it sits in a partially-online state and waits for me to tell it to rejoin. This is a manual step, but it's one that shouldn't have to happen with any frequency.)

voretaq7
  • 79,345
  • 17
  • 128
  • 213
4

You can configure the iLO/IPMI based fencing, and then using for example fence_apc fencing agent with an APC power switch as a secondary fence device. That way, if the server loses power, then the secondary fencing agent is still able to STONITH the server in a way that makes sense to the cluster.

like described here:

A node can have multiple fence methods and each fence method can have multiple fence devices.

Multiple fence methods are set up for redundancy/insurance. For example, you may be using a baseboard management fencing method for a node in your cluster such as IPMI, or iLO, or RSA, or DRAC. All of these depend on a network connection. If this connection would fail, fencing could not occur, so as a backup fence method you could declare a second method of fencing that used a power switch or somesuch to fence the node. If the first method failed to fence the node, the second fence method would be employed.

You can also consider adding fence_manual as a secondary fencing agent, that way you can always recover your cluster, but manual intervention will then of course be required.

Petter H
  • 3,383
  • 1
  • 14
  • 18