keepalived VRRP_script not failing over

Question

So I am running keepalived on two servers and I can't get it to failover to the other.

Below I have my config for one of the servers. The only different between the two is the priority numbers master being 110 and back being 109.

But when I stop my process with /etc/init.d/process stop keepalived doesn't fail over. I just get the VRRP_Script(chk_script) failed and nothing else. No failovers or nothing.

vrrp_script chk_script {
script "/usr/local/bin/failover.sh"
interval 2
weight 2
}

vrrp_instance HAInstance {
state BACKUP
interface eth0
virtual_router_id 8
priority 109
advert_int 1
nopreempt
vrrp_unicast_bind 10.10.10.8
vrrp_unicast_peer 10.10.10.9
virtual_ipaddress {
  10.10.10.10/16 dev eth0
}
notify /usr/local/bin/keepalivednotify.sh
track_script {
  chk_script weight 20
}
}

This is my chk_script below. The same problem also happens when I do killall -0 process as my script.

!/bin/bash
SERVICE='process'
STATUS=$(ps ax | grep -v grep | grep $SERVICE)

if [ "$STATUS" != "" ]
then
    exit 0
else
    exit 1
fi

Does anyone know a fix for this? Thanks.

Does your backup instance notice the priority change or log anything? Logs from both would be helpful. — Jim G., Aug 31 '15 at 21:51
No it does not. The only time it notices a change is when the master goes away. Such as when I stop keepalived. Stopping the process i am monitoring only shows VRRP_Script(chk_script) failed on the master. With nothing on the slave. — Nvasion, Aug 31 '15 at 22:06

giomanda · Answer 1 · 2016-09-19T11:23:10.983

I had exactly the same issue however my problem was not in the firewall nor in my Ethernet adapter but in the "weight" settings of the check script.

This was my configuration:

MASTER:

vrrp_instance haproxy {
state MASTER
interface eth0
virtual_router_id 51
priority 150
advert_int 1

BACKUP:

vrrp_instance haproxy {
state BACKUP
interface eth0
virtual_router_id 51
priority 100
advert_int 1

Check_script:

vrrp_script chk_haproxy {
   script "python /root/ha_check.py"
   interval 2     # check every 2 seconds
   weight 2
   rise 2
   fall 2

}

The reason the master was refusing to release the VIP was because despite the fact the script had failed, the master was still having higher priority number from the BACKUP server. This happened because the "weight" setting on check_script was not enough to cover the "GAP" between the priority number, meaning raising the priority number of the BACKUP server greater to the one of MASTER Server. I will further explain:

According to the manual of keepalived, a positive number on the "weight" setting will add that number to the priority if the check succeeds.
A negative number will subtract that number from priority number if the check fails.

So, according to my configuration:

Server Priorities Prior failure of the script:
MASTER: 152
BACKUP: 100
Failover_IP: MASTER

The failover ip is correctly "grabbed" by master server since Master has higher priority compared to Backup server (152 > 100)

Server Priorities AFTER failure of the script:
MASTER server: 148
BACKUP server: 102
Failover_IP: STILL ON MASTER

The failover ip is still on master server because Master has again higher priority compared to BACKUP (148 > 102). The MASTER server was refusing to release the IP and right he did since his priority was higher than the other server.

The solution on my situation was:

Solution -1 : Change the priority number of both servers so they dont have much "GAP".
For example:
Master Priority: 150
Backup Priority: 149
Check_script weight: As it is ( 2 ).

With the above configuration, when the script succeeds (meaning all is ok) the priorities would be:
Master: 152
Backup: 149
IP_Location: On Master (152 > 149)

When script fails:
Master: 150
Backup: 151
IP_Location: On Backup (151 > 150)

Solution - 2: Change the weight number of the script from 2, to -60

It also seems like not specifying a weight at all means that a failed track_script will trigger the fault state directly — Oscar, Jun 21 '17 at 07:13
@Nvasion : Kindly accept this answer as I too got my issue resolved. — Ankur Soni, Jun 13 '18 at 13:24

Patrick Wagner · Answer 2 · 2015-10-10T17:16:28.447

I've had the same issue - two CentOS 7.1 servers with track_script, and failing the vrrp_script on the MASTER would only result in the lone log message "VRRP_Script(chk_script) failed", not in a failover. On the BACKUP server, however, I got a lot of messages of keepalived trying to take over the virtual IP for as long as I had the track_script on the MASTER server fail.

Solution in my case: The firewall (iptables) on the MASTER server wasn't configured correctly to allow VRRP packets / multicast packets, while at the same time the firewall on the other server, the BACKUP, was configured correctly.

I had entered the same iptables rules into both servers as follows:

iptables -A INPUT -i eth0 -d 224.0.0.0/8 -j ACCEPT
iptables -A INPUT -p vrrp -i eth0 -j ACCEPT

This worked on one of the servers (the BACKUP VRRP server) but not the MASTER one because I'd forgotten that the interface wasn't named 'eth0' on the MASTER server, thus the two rules had no effect at all.

This explained the behavior I'd observed:

If keepalived cannot see any other VRRP speaker for a certain virtual_router_id, it still believes itself to be the one with the highest priority (thus rightful MASTER) even after a negative weight modification as it never receives VRRP messages with a priority higher than its own (because advertisements of other speakers are blocked by the firewall and can never reach the keepalived process to make it aware of them). That's why you don't see it release the VIP.

The BACKUP server, however, was able to see the adverts of the (now failed) MASTER, found the priority in those packets reduced to a value less than its own, and proceeded to declare itself MASTER and send gratuitous ARPs to claim the VIP. So we ended up in a situation where both servers thought they'd need to serve the VIP as MASTER.

Conclusions: - Always check the firewall configuration on all VRRP speakers if you experience strange behavior (no failover, several MASTERs). Keepalived logging isn't quite as helpful as it could be (a simple message "VIP not released because I'm still highest prio" after the "VRRP_Script(chk_script) failed" line would've eased troubleshooting immensely.

A track_script is not an on/off type of switch ("if script OK: eligible for VIP election; if NOT OK: completely ineligible for VIP election") - it merely increases / decreases the likelihood of winning the election, and if keepalived only ever observes itself as the only VRRP speaker and never receives any messages of other speakers, there's not much of an election really - you always win.

score 0 · Answer 3 · edited Oct 07 '21 at 08:14

I just bumped into same situation as you and did some studying about keepalived. Lets think what is happening in each server. Assuming you want to implement the manual failback architecture,

On the 1st BACKUP node

Every time the track_script fails number of fall times it sends the advertisement to the 2nd BACKUP node. Point here is the Priority set inside the advertisement. In your case,

129 (109 + 20)

is sent to the 2nd BACKUP server.

On the 2nd BACKUP server

Next is on the 2nd BACKUP node.

According to RFC ,

If the Priority in the ADVERTISEMENT is Zero, then:

  o  Set the Master_Down_Timer to Skew_Time
else:

  If Preempt_Mode is False, or If the Priority in the
  ADVERTISEMENT is greater than or equal to the local
  Priority, then:

    o Reset the Master_Down_Timer to Master_Down_Interval
  else:

    o Discard the ADVERTISEMENT
  endif
endif

Since , you have nopreempt enabled and receiving higher priority vrrp , 2nd BACKUP node is not going to state transition phase.

Solution

So if you want to make state transition happen on 2nd node , you either can,

Set weight to 0 on 1st BACKUP node. This will send Priority 0 advertisement to 2nd BACKUP node. doc describes more about weight 0.
Turn off the nopreempt on 2nd BACKUP node.
Set weight to at least -2 on 1st BACKUP node.

A RFC (request for comments) is not a good source of information to configure a software, because that software might not follow the recommendations of that RFC. — salvador, Jun 02 '20 at 09:44

keepalived VRRP_script not failing over

3 Answers3

On the 1st BACKUP node

On the 2nd BACKUP server

Solution

Linked