Which technique should be chosen for IP failover with manual control

Question

I have the following setup, Linux stack, with front-end running nginx proxy and static assets and back-end running Ruby on Rails and MySQL in master-master replication:

Primary site: front-end.a, back-end.a
Secondary site: front-end.b, back-end.b
A router sitting on a shared network that can route to both primary and secondary sites

The primary site serves requests most of the time. The secondary site is redundant. back-end.b is in master-master replication with back-end.a but is read-only.

When the primary site goes down, requests need to be redirected to the secondary site. This will show a service unavailable 503 page until manual intervention ensures that the primary site won't come back and hits the big switch that makes the secondary site live and read-write.

The primary site can then be brought back in a controlled fashion, with back-end.a becoming a read-only replication slave of back-end.b. When everything on the primary site is ready again, front-end.b will start serving service unavailable, back-end.b will switch to read-only, requests need to be redirected to the primary site again, and finally the primary site needs to become read-write.

The priorities:

The site must not become completely dead and unreachable
Switchover to a live working site must be fairly fast
Preventing data loss / inconsistency is more important than absolute reliability

Now, the current approach being used is Linux-HA / Heartbeat / Pacemaker, using a virtual IP shared between front-end.a and front-end.b with a location preference set to front-end.a.

This works excellently for failing over the IP if the primary site disappears. However, the level of manual control thereafter is rather lacking.

For example, after the primary site has failed and the secondary site needs to be brought up, we need to ensure the primary site doesn't try to steal back the IP address when it comes back up. However, Linux-HA doesn't seem to support this very well. crm resource move is the documented command to move a resource (it works by adding an infinite weight location rule), but if the resource has already failed over, this command fails saying that the resource has already been moved. Adding an explicit higher weight location preference doesn't seem to work reliably. So far the most reliable thing to do has been to remove the existing location rule and replace it with a new rule preferring the secondary site. This feels like we're fighting the tool and trying to make it do something it wasn't designed to.

And there are other oddities with Linux-HA. Frequently the cluster gets stuck in a split-brain state while booting up - both nodes are sending out heartbeat packets (verified with packet sniffing), both nodes can ping one another, but crm_mon on both reports the other node as offline. The heartbeat service needs to be restarted on one or the other nodes to get it to work - and sometimes it needs a SIGKILL rather than SIGTERM to bring it down. Also, crm_mon shows the CIB (cluster database) is replicated pretty much instantaneously when configuration is altered on either front-end.a or front-end.b, but Pacemaker takes its time actually moving the IP resource - it can take several minutes for it to move across, potentially putting our SLAs at risk.

So I'm starting to look at other options that are more focused on virtual IPs and IP failover rather than general clustered resources. The two other options I see are ucarp and keepalived.

However, given the amount of time I've spent setting up heartbeat etc. and trying to make it work, I'd like feedback on the best approach for this setup.

score 0 · Answer 1 · answered Mar 24 '14 at 03:54

0

It's been awhile since I've looked at Pacemaker, but there are configuration options that should help here. You can explicitly configure it to not fail back to the primary node automatically. This would help with part of your issue. A quick search shows that 'on-fail=standby' might be what you want, but I think there's other settings that do the same thing. The terminology you want here is 'resource stickiness'.

Also, with only two nodes you can easily run into situations where both nodes are online, but think the other node is down. This can lead to data corruption pretty easily. You don't mention having configured it, but this is what STONITH is for. It's pretty risky to run Pacemaker without it, especially if data integrity is important to you.

answered Mar 24 '14 at 03:54

devicenull

5,572
1
25
31

There is never a situation where both databases are read-write. Our STONITH procedures involve phoning people up, since the two halves of the cluster are in different data centres and if they could communicate shutdown signals over the network, we wouldn't have a partition problem. – Barry Kelly Mar 24 '14 at 09:48
Resource stickiness did seem like what we want, but the rules are vague (based on unitless weights and multipliers) and didn't seem to have the desired effect (i.e. the IP switched back anyway even with stickiness set on the resource). – Barry Kelly Mar 24 '14 at 09:50

Which technique should be chosen for IP failover with manual control

1 Answers1