2

I am trying to build a dynamic cluster of machines that have to communicate with the master node (for reporting, updates and various other tasks that are to be dealt with by the master node).

For convenience I thought of using the heartbeat project http://linux-ha.org/wiki/Heartbeat. Hearbeat provides a nice failover and recovery mechanism that I want to leverage. I do not plan to use ldirectord or any virtual IP. I really want to use heartbeat for the master node designation.

Currently I am just running a simple 2 nodes setup, node1 and node2 whose IP addresses I don't control (attributed via DHCP).

Because the nodes can be dynamically added to the cluster, I configured ha.cf like this

keepalive 2
warntime 6
deadtime 12
logfacility local0
bcast eth0 # Linux
mcast eth0 225.0.0.1 694 1 0
auto_failback on
node virtual
node node1
node node2
respawn hacluster /usr/lib/heartbeat/ipfail
apiauth ipfail gid=haclient uid=hacluster

And haresources like this

virtual \
        nginx

So, I set the cluster so that the virtual node is the preferred resource for the cluster. This node doesn't exist. So I expect all standby nodes to go through an election process to decide who would take over it when it's down (= always). I proceed like this because I want to dynamically add and remove nodes to the configuration, but I still need to have a preferred node

It works just fine when I start a single node, since it just picks up the resource. However, when I start the second node, with the exact same configuration (scp'd the configuration, so no risk of difference), both nodes release their resources (I can test that since none of the nodes fires its nginx and the previous master, say node1, shuts it down).

I can post complete logs if needed, but essentially what happens is nodes just trying to get the resources from each other and eventually both releasing their resources with a lot of

ERROR: Both machines own our resources!

and

WARN: 1 lost packet(s) for [node2] [22:24]

finishing with

Jul 23 15:17:21 node1 heartbeat: [16390]: info: node2 wants to go standby [foreign]
Jul 23 15:17:21 node1 heartbeat: [16390]: info: remote resource transition completed.
Jul 23 15:17:21 node1 heartbeat: [16390]: ERROR: Both machines own our resources!
Jul 23 15:17:21 node1 heartbeat: [16390]: ERROR: Both machines own our resources!
Jul 23 15:17:22 node1 heartbeat: [16390]: info: remote resource transition completed.
Jul 23 15:17:22 node1 heartbeat: [16390]: info: standby: acquire [foreign] resources from node2
Jul 23 15:17:22 node1 heartbeat: [16678]: info: acquire local HA resources (standby).
Jul 23 15:17:22 node1 heartbeat: [16678]: info: local HA resource acquisition completed (standby).
Jul 23 15:17:22 node1 heartbeat: [16390]: info: Standby resource acquisition done [foreign].

If anyone has a suggestion on how to sort this out (fixing or alternative method), I'm all ears.

Cheers.

quanta
  • 50,327
  • 19
  • 152
  • 213

0 Answers0