Why both nodes in OCFS2 cluster keep rebooting?

Question

We created an OCSFS2 cluster by using SUSE Linux High Availability Extensions on SLES SP3. The cluster nodes are two Apache servers which share one disk. We have stonith enabled and SBD daemon. It works fine, but...

When one of the nodes is disconnected from the network (Network Card disconnected in VirtualBox) and therefore both nodes fail to communicate in cluster, both servers are rebooted 30 seconds later.

Once the nodes are started again, one of them keeps rebooting the other, and the service availability is lost completely. To recover, the first failed node is reconnected to network (Network Card connected again in VBox) and the problem is fixed.

Questions are:

Why does this happen?
How can I avoid this behaviour?

The expected result is to ensure service level availability, so that if a node disconnects temporarily from network, the another one could continue serving.

If I either kill the corosync daemon (killall -9 corosync) on one node, or the node is shutdown normally, the remaining node keeps working fine. Why doesn't this work when the network card is disconnected? :-/

I'm providing the Cluster Configuration (crm configure show) here:

take a look http://serverfault.com/questions/371067/how-to-setup-stonith-in-a-2-node-active-passive-linux-ha-pacemaker-cluster — Federico Sierra, Jun 10 '15 at 01:23
Got it, it was a split-brain scenario. I solved it by adding a 3rd node to the cluster. — jonathan.vargas.cr, Jun 10 '15 at 23:10

Why both nodes in OCFS2 cluster keep rebooting?

0 Answers0