We have been using keepalived in combination with a virtual IP address for two years now. In the rare case that a machine crashes this works very well.
But when there are issues on the box itself we have seen a couple of cases where no failover took place. For example when we had a issue where the system was swapping all the time. The load was 25 instead of the normal 5 and there was no way to ssh into the machine. Ping was working. Keepalived kept running and the virtual IP adress was not taken over by the other slave.
Also we had the situation where in a MySql HA setup somebody locked the complete database by mistake by doing a backup on the master instead of the slave. That was also not picked up.
Is the issue here that I am just using the wrong scripts to check on the machine itself if the master is working fine, or is this typical for a virtual IP setup?
I feels strange to me that you don't use a third system to determine if the master is available. Of course I understand why: keepalivd should be switched on the master itself by the master.
I noticed lately that for Redis HA setups people are using Zookeeper (eg https://github.com/ryanlecompte/redis_failover). Is that because of the limitations I ran into?