IP failover with 2 nodes on different subnet: cannot ping virtual IP from second node?

Question

I'm going to setup redundant failover Redmine:

another instance was installed on the second server without problem
MySQL (running on the same machine with Redmine) was configured as master-master replication

Because they are in different subnet (192.168.3.x and 192.168.6.x), it seems that VIPArip is the only choice.

/etc/ha.d/ha.cf on node1

logfacility none
debug 1
debugfile /var/log/ha-debug
logfile /var/log/ha-log
autojoin none
warntime 3
deadtime 6
initdead 60
udpport 694
ucast eth1 node2.ip
keepalive 1
node node1
node node2
crm respawn

/etc/ha.d/ha.cf on node2:

logfacility none
debug 1
debugfile /var/log/ha-debug
logfile /var/log/ha-log
autojoin none
warntime 3
deadtime 6
initdead 60
udpport 694
ucast eth0 node1.ip
keepalive 1
node node1
node node2
crm respawn

crm configure show:

node $id="6c27077e-d718-4c82-b307-7dccaa027a72" node1
node $id="740d0726-e91d-40ed-9dc0-2368214a1f56" node2
primitive VIPArip ocf:heartbeat:VIPArip \
        params ip="192.168.6.8" nic="lo:0" \
        op start interval="0" timeout="20s" \
        op monitor interval="5s" timeout="20s" depth="0" \
        op stop interval="0" timeout="20s" \
        meta is-managed="true"
property $id="cib-bootstrap-options" \
        stonith-enabled="false" \
        dc-version="1.0.12-unknown" \
        cluster-infrastructure="Heartbeat" \
        last-lrm-refresh="1338870303"

crm_mon -1:

============
Last updated: Tue Jun  5 18:36:42 2012
Stack: Heartbeat
Current DC: node2 (740d0726-e91d-40ed-9dc0-2368214a1f56) - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, unknown expected votes
1 Resources configured.
============

Online: [ node1 node2 ]

 VIPArip    (ocf::heartbeat:VIPArip):   Started node1

ip addr show lo:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet 192.168.6.8/32 scope global lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever

I can ping 192.168.6.8 from node1 (192.168.3.x):

# ping -c 4 192.168.6.8
PING 192.168.6.8 (192.168.6.8) 56(84) bytes of data.
64 bytes from 192.168.6.8: icmp_seq=1 ttl=64 time=0.062 ms
64 bytes from 192.168.6.8: icmp_seq=2 ttl=64 time=0.046 ms
64 bytes from 192.168.6.8: icmp_seq=3 ttl=64 time=0.059 ms
64 bytes from 192.168.6.8: icmp_seq=4 ttl=64 time=0.071 ms

--- 192.168.6.8 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3000ms
rtt min/avg/max/mdev = 0.046/0.059/0.071/0.011 ms

but cannot ping virtual IP from node2 (192.168.6.x) and outside. Did I miss something?

PS: you probably want to set IP2UTIL=/sbin/ip in the /usr/lib/ocf/resource.d/heartbeat/VIPArip resource agent script if you get something like this:

Jun 5 11:08:10 node1 lrmd: [19832]: info: RA output: (VIPArip:stop:stderr) 2012/06/05_11:08:10 ERROR: Invalid OCF_RESK EY_ip [192.168.6.8]

http://www.clusterlabs.org/wiki/Debugging_Resource_Failures

Reply to @DukeLion:

Which router receives RIP updates?

When I start the VIPArip resource, ripd was run with below configuration file (on node1):

/var/run/resource-agents/VIPArip-ripd.conf:

hostname ripd
password zebra
debug rip events
debug rip packet
debug rip zebra
log file /var/log/quagga/quagga.log
router rip
!nic_tag
 no passive-interface lo:0
 network lo:0
 distribute-list private out lo:0
 distribute-list private in lo:0
!metric_tag
 redistribute connected metric 3
!ip_tag
access-list private permit 192.168.6.8/32
access-list private deny any

show ip route:

Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, A - Babel,
       > - selected route, * - FIB route

K>* 0.0.0.0/0 via 192.168.3.1, eth1
C>* 127.0.0.0/8 is directly connected, lo
K>* 169.254.0.0/16 is directly connected, eth1
C>* 192.168.3.0/24 is directly connected, eth1
C>* 192.168.6.8/32 is directly connected, lo

sh ip rip status:

Routing Protocol is "rip"
  Sending updates every 30 seconds with +/-50%, next due in 7 seconds
  Timeout after 180 seconds, garbage collect after 120 seconds
  Outgoing update filter list for all interface is not set
    lo:0 filtered by private
  Incoming update filter list for all interface is not set
    lo:0 filtered by private
  Default redistribution metric is 1
  Redistributing: connected
  Default version control: send version 2, receive any version 
    Interface        Send  Recv   Key-chain
  Routing for Networks:
    lo:0
  Routing Information Sources:
    Gateway          BadPackets BadRoutes  Distance Last Update
  Distance: (default is 120)

Which router recieves RIP updates? It looks like the problems is in routing, not cluster configuration — DukeLion, Jun 05 '12 at 12:00
@quanta Is your router handling the inter-vlan traffic listening for those RIP updates and successfully adding them to its routing table? — Shane Madden, Jun 05 '12 at 17:07
@ShaneMadden: Could you please elaborate more details on your question's first part? As you can see from above, the routing table doesn't include RIP. — quanta, Jun 06 '12 at 05:09
Well this is configuration of RIP process sending updates, but you need a router that is receiving it and routing traffic destined to VIP to active server. — DukeLion, Jun 06 '12 at 08:25
@DukeLion: I have to ask the other department to configure routing on the node2's default gateway (192.168.6.1) if I want. Can I do this by setting up Quagga on the node2? Can you give me some suggestions? I've been re-reading [this](http://lists.linux-ha.org/pipermail/linux-ha/2006-May/019543.html) guide but still confused. — quanta, Jun 06 '12 at 10:54

score 2 · Answer 1 · answered Jun 06 '12 at 14:29

I think problem is not in cluster configuration, but in your routing architecture.

VIPArip resource agent manages local quagga to send routing updates. But you also need to use this routing updates to change routes to point to active server. I'll try to explain how it works.

RIP HA

Look at the picture. HA1 and HA2 is linux-ha cluster members with quagga running. Blue router listens to RIP from both network links.

When vip goes up on HA1, quagga sends RIP update to the blue router. It adds vip prefix to it's routing table with 192.168.1.2 nexthop.

When failover occurs, vip goes down on HA1 and quagga stops completely, so updates won't be sent. Blue router will remove routing table record after timeout, even if VIP won't go up on HA2. And when VIP goes up on HA2 it starts quagga and it will send RIP updates. Blue router will add record to the routing table with 192.168.2.2 nexthop.

It is possible to use viparip in more complex network topology, just make sure your border routers get routing updates throughout your network.

IP failover with 2 nodes on different subnet: cannot ping virtual IP from second node?

1 Answers1