How can CMAN restart after a network failure

Question

I try to configure a 3-nodes HA cluster with pacemaker/cman on centos 6.5. STONITH is disabled (pcs property set stonith-enabled=false).

When I simulate a network issue (iptables -A INPUT -s $OTHER_NODES_IP -j DROP), the master resource is moved to another node, and is stopped on the failed one.

When I enable network again (iptables -D INPUT -s $OTHER_NODES_IP -j DROP), the failed node can't automatically join the cluster.

In the log we have: corosync[3323]: cman killed by node 3 because we were killed by cman_tool or other application

How can cman restart instead of being killed ?

score 0 · Accepted Answer · answered Sep 24 '14 at 16:06

0

The idea here is that you want cman to be killed if it looses contact with the cluster. This is referred to as "fencing". During this time, the node will drift from the rest of the cluster. If it were to simply come back into production without review, you could get some rather serious corruption or undefined behaviour among your nodes.

The idea here is, by fencing your malfunctioning node you help to ensure data integrity. Once you have fixed the malfunction and verified that the node is back up-to-date you can easily just restart cman.

answered Sep 24 '14 at 16:06

Unix-Ninja

246
1
3

In my case, when the resource is started, it is in slave mode, and it resync with the master. Fencing is then no needed, and I just want the resource to be started (it was stopped because quorum was missing). Can't a cluster be able to autorestart without manual intervention ? (with cman or other tools) – rmillet Sep 24 '14 at 17:28
To my knowledge, there is nothing out-of-the-box that will do this. These nodes are getting fenced for a reason, so the ability to bypass the fence becomes self-defeating. However, if you really do want to do this, you can write a script that can check the status on cman and start it for you if it's down. I wouldn't recommend doing this, but it is an option. – Unix-Ninja Sep 24 '14 at 18:33

How can CMAN restart after a network failure

1 Answers1