4

I'm configuring a cluster 2 nodes with pacemaker/corosync, and I have some question about it (and maybe best practice : i'm far to be specialist)

**OS:** redhat 7.6

I configurated the cluster with those properties

 - **stonith-enabled:** true

 - **symmetric-cluster:** true (even if is default value i think)


and added in corosync.conf

 - **wait_for_all:** 0 (i want a Node be able to start/work even if his twin is KO)

 - **two_nodes:** 1


Considering the fencing:

- Using ILO of blade HP (ILO1 for Node1, ILO2 for Node2)

I read that it was sometimes a good practice to prevent a node suicide, so added constraints 

- ILO1-fence can't locate in node1 

- ILO2-fence can't locate on node2

The problems I have is the following, happening at starting Node2 when Node1 is shutdown :

  • pacemaker/corosync can't start ILO2-fence on Node1 (of course cause Node 1 is down), and so don't start the other resources, and so my cluster is all not working >:[

I am wondering if I miss something in my configuration, or if I don't understand well how such a cluster should work.

Because I'd expect Node2 to start, cluster sees Node1 is KO and just start the resources to make Node2 works on its own.

But is true, since ILO2-fence can only be located on Node1 (because of constraint to avoid suicide), then this resource will always fails ... (when trying without those "anti-suicide" constraints, if Node2 has some services failure, then it shutdowns directly after start, which i don't want)

I would apreciate some returns and enlightments :)

Thank you :)

drkmkzs
  • 191
  • 1
  • 1
  • 7

1 Answers1

7

You have, let's say, 4 votes in your cluster - 2 nodes and 2 ILO-fence. Cluster can run, if >2 (3) are accesible. ILO2 is configured with only node1, so if node1 is down - the qourum is lost. Using ILO-fencing is not recommended:

"A common mistake people make when choosing a STONITH device is to use a remote power switch (such as many on-board IPMI controllers) that shares power with the node it controls. If the power fails in such a case, the cluster cannot be sure whether the node is really offline, or active and suffering from a network fault, so the cluster will stop all resources to avoid a possible split-brain situation." link

You have 2 options for 2 node cluster:

  1. Use one external fencing device (witness node, voting VSA or SMB2/3 file share).

  2. Use solution developed for 2-node clusters (like Microsoft Hyper-V Storage Spaces Direct (S2D) a.k.a. Azure Stack HCI, VMware Virtual SAN (vSAN) esp. ROBO Edition, or StarWind Virtual SAN (VSAN)) out of box.

BaronSamedi1958
  • 12,510
  • 1
  • 20
  • 46
batistuta09
  • 8,723
  • 9
  • 21
  • 1
    I understand ... And maybe the 3) solution is, as some users advice in comments on related topic, that a 2 nodes cluster should avoid stonith if you are not able to apply the 1) and the 2) ... Thank you – drkmkzs Feb 04 '20 at 17:06
  • 4
    3) solution is less protected as standalone server with backups or async replication. In case of 1 node failure, the whole cluster will be down. Here some explanations against 2 node pacemaker clusters - https://www.thegeekdiary.com/most-common-two-node-pacemaker-cluster-issues-and-their-workarounds/ – batistuta09 Feb 05 '20 at 08:35
  • 2
    yes but at least the solution 3) still provides HA, since resources that fails on node1 will be moved on node2. Node1 will remain failed, but service is still provided. Am I wrong ? – drkmkzs Feb 05 '20 at 14:04
  • 3
    just to clarify: do you use shared storage (iSCSI/NFS/etc) ? – batistuta09 Feb 05 '20 at 15:36
  • 2
    Yes, there's a shared disks bay – drkmkzs Feb 05 '20 at 16:39
  • I never found any issue using Ilo as fencing devices, for every server you need to have double power supply and apc, without using location constraints for the fencing devices, set pcmk_host_list – c4f4t0r Feb 06 '20 at 10:39
  • c4f4t0r, then in that case you allow "node suicide", right ? – drkmkzs Feb 06 '20 at 13:54