Corosync :: Restarting some resources after Lan connectivity issue

Question

I am currently looking into corosync to build a two-node cluster. So, I've got it working fine, and it does what I want to do, which is:

Lost connectivity between the two nodes gives the first node '10node' both Failover Wan IPs. (aka resources WanCluster100 and WanCluster101 )
'11node' does nothing. He "thinks" he still has his Failover Wan IP. (aka WanCluster101)

But it doesn't do this:

'11node' should restart the WanCluster101 resource when the connectivity with the other node is back.

This is to prevent a condition where node10 simply dies (and thus does not get 11node's Failover Wan IP), resulting in a situation where none of the nodes have 10node's failover IP because 10node is down 11node has "given back" his failover Wan IP.

Here's the current configuration I'm working on.

node 10sch \
    attributes standby="off"
node 11sch \
    attributes standby="off"
primitive LanCluster100 ocf:heartbeat:IPaddr2 \
    params ip="172.25.0.100" cidr_netmask="32" nic="eth3" \
    op monitor interval="10s" \
    meta is-managed="true" target-role="Started"
primitive LanCluster101 ocf:heartbeat:IPaddr2 \
    params ip="172.25.0.101" cidr_netmask="32" nic="eth3" \
    op monitor interval="10s" \
    meta is-managed="true" target-role="Started"
primitive Ping100 ocf:pacemaker:ping \
    params host_list="192.0.2.1" multiplier="500" dampen="15s" \
    op monitor interval="5s" \
    meta target-role="Started"
primitive Ping101 ocf:pacemaker:ping \
    params host_list="192.0.2.1" multiplier="500" dampen="15s" \
    op monitor interval="5s" \
    meta target-role="Started"
primitive WanCluster100 ocf:heartbeat:IPaddr2 \
    params ip="192.0.2.100" cidr_netmask="32" nic="eth2" \
    op monitor interval="10s" \
    meta target-role="Started"
primitive WanCluster101 ocf:heartbeat:IPaddr2 \
    params ip="192.0.2.101" cidr_netmask="32" nic="eth2" \
    op monitor interval="10s" \
    meta target-role="Started"
primitive Website0 ocf:heartbeat:apache \
    params configfile="/etc/apache2/apache2.conf" options="-DSSL" \
    operations $id="Website-one" \
    op start interval="0" timeout="40" \
    op stop interval="0" timeout="60" \
    op monitor interval="10" timeout="120" start-delay="0" statusurl="http://127.0.0.1/server-status/" \
    meta target-role="Started"
primitive Website1 ocf:heartbeat:apache \
    params configfile="/etc/apache2/apache2.conf.1" options="-DSSL" \
    operations $id="Website-two" \
    op start interval="0" timeout="40" \
    op stop interval="0" timeout="60" \
    op monitor interval="10" timeout="120" start-delay="0" statusurl="http://127.0.0.1/server-status/" \
    meta target-role="Started"
group All100 WanCluster100 LanCluster100
group All101 WanCluster101 LanCluster101
location AlwaysPing100WithNode10 Ping100 \
    rule $id="AlWaysPing100WithNode10-rule" inf: #uname eq 10sch
location AlwaysPing101WithNode11 Ping101 \
    rule $id="AlWaysPing101WithNode11-rule" inf: #uname eq 11sch
location NeverLan100WithNode11 LanCluster100 \
    rule $id="RAND1083308" -inf: #uname eq 11sch
location NeverPing100WithNode11 Ping100 \
    rule $id="NeverPing100WithNode11-rule" -inf: #uname eq 11sch
location NeverPing101WithNode10 Ping101 \
    rule $id="NeverPing101WithNode10-rule" -inf: #uname eq 10sch
location Website0NeedsConnectivity Website0 \
    rule $id="Website0NeedsConnectivity-rule" -inf: not_defined pingd or pingd lte 0
location Website1NeedsConnectivity Website1 \
    rule $id="Website1NeedsConnectivity-rule" -inf: not_defined pingd or pingd lte 0
colocation Never -inf: LanCluster101 LanCluster100
colocation Never2 -inf: WanCluster100 LanCluster101
colocation NeverBothWebsitesTogether -inf: Website0 Website1
property $id="cib-bootstrap-options" \
    dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
    cluster-infrastructure="openais" \
    expected-quorum-votes="2" \
    no-quorum-policy="ignore" \
    stonith-enabled="false" \
    last-lrm-refresh="1408954702" \
    maintenance-mode="false"
rsc_defaults $id="rsc-options" \
    resource-stickiness="100" \
    migration-threshold="3"

I also have a less important question concerning this line:

colocation NeverBothLans -inf: LanCluster101 LanCluster100

How do I tell it that this collocation only applies to '11node'.

score 1 · Answer 1 · answered Aug 28 '14 at 15:36

1

If I understand correctly what you need, you can do this by putting location constraints:

pcs constraint location WanCluster101 prefers 11sch=10
pcs constraint location WanCluster101 prefers 10sch=5

What I did in the past was to put the constraint for both IPs, both ways. So that when one node goes down, the other takes both IPs, no matter which of them goes down, the other would have both IPs. This lead to add the constraints with the priorities that other way around for each IP (one has higher priority on the first node and lower on the second and the other had higher priority on the second node and lower on the first).

answered Aug 28 '14 at 15:36

Florin Asăvoaie

6,932
22
35

I'm almost certain this doesn't answer my question. I'm talking about a split brain condition on my question. – moebius_eye Aug 29 '14 at 04:48
Depends how split brain is caused. If BOTH nodes are able to stay on the network during the split brain and only the connection between them is affected, then the only way to solve this problem is by using a fencing mechanism. If one of the nodes completely gets disconnected from the network and is totally offline (which is still split-brain because the node is actually up & running), both of the nodes will, indeed, get the IP assigned to themselves but only the one that has network available will work. They will sort it out between themselves when network is restored. – Florin Asăvoaie Aug 29 '14 at 05:14
1

Yes. All you are saying is true, and that case scenario works fine. The scenario where it doesn't is when there's a split brain but both nodes are up and running with WAN connectivity. When this occurs, node 10 (10sch on the config) takes both Wan IPs. The trouble is: when we recover from this situation, node 11 (11sch) doesn't send the ARP packet to the gateway. **SOLUTION**: telling corosync to restart the WanCluster101 resource when recovering from split brain. Now, how do I do that? – moebius_eye Aug 29 '14 at 08:52
If you do the constraints thing, I assume it will do it because the server with the lower priority will try to pass the resource over and they will renegotiate what happens. – Florin Asăvoaie Aug 29 '14 at 10:59
Okay. I think you're right after all. I didn't have the time to test this yet. But that will be done soon. – moebius_eye Sep 02 '14 at 08:03

score 1 · Accepted Answer · answered Sep 01 '14 at 18:43

1

1: before test your cluster connectivity, you need to configure you stonith device, the stonith is very important in cluster, for resolve the split-brain situation 2: for the less important question, you can try to use location constrains

you can start from something like this:

location mycol dummy1 \
        rule $id="myrule" -inf: defined dummy2 and #uname eq suse02

answered Sep 01 '14 at 18:43

c4f4t0r

5,149
3
28
41

I don't want to Shoot The Other Node In The Face. I want both of them to keep running. Only, I want the first node to take over the whole cluster. That, I already did. What I still need to figure out is a way to restart resource `WanCluster101` when recovering from split brain. I think What Florin said is the right answer after all. But thank you for taking the time to answer. – moebius_eye Sep 02 '14 at 08:03
cluster needs stonith for split-brain recovery. – c4f4t0r Sep 02 '14 at 09:50
1

You're right. I went to #linux-ha on freenode and they told me the same thing. – moebius_eye Sep 03 '14 at 09:51

Corosync :: Restarting some resources after Lan connectivity issue

2 Answers2