5

I have a pair of CentOS Linux servers in each datacenter. They have failover within each datacenter, managed by heartbeat and DRBD (I know these are outdated tools, but they are stable, so there's no desire to change them).

They also have the capability for inter-datacenter switch as well, to make the east datacenter active while west becomes passive. But this is a manual engineering process, and that's okay.

The west datacenter is currently the active one, the east datacenter is passive.

serverA.west <-> serverB.west  <-----------> serverA.east <-> serverB.east
     ACTIVE DATA CENTER                           PASSIVE DATA CENTER

Servers can run mysqld and a Java application.

The Java application on this server should run only on the Primary host in the active datacenter (i.e. serverA.west). If another instance of the Java application starts on the Secondary host (serverB.west), or on either host in the passive datacenter, there's a risk of split-brain problems.

Today serverA.east rebooted, which caused heartbeat to flip over to serverB.east. Heartbeat then dutifully started the Java app on serverB.east, which we don't want to happen.

Heartbeat also started mysqld on serverB.east, which is correct, because MySQL replication should keep going, replicating the changes from the west datacenter continuously so the east DC is ready to take over when needed.

/etc/ha.d/haresources names the /etc/init.d scripts for mysqld and the Java application as the resources to start.

We want to allow heartbeat to manage the A/B pair in the passive datacenter. It should start mysqld on a failover, but not the Java app. But if the east datacenter is the active one, then heartbeat should start the Java app during a heartbeat-automated failover.

What's a good way to implement this?

What I am hoping for is something that takes one step to configure as we switch the active datacenter from west to east. Ideally, it should be mistake-proof, i.e. it should be guaranteed that exactly one of the datacenters is configured as the active one.

Bill Karwin
  • 206
  • 1
  • 7
  • DRBD is still actively developed and is by no means outdated. Heartbeat is still maintained, but yes, Corosync/Pacemaker is stable and actively developed. The way you're using Heartbeat has been deprecated for many years; so many that I would be surprised if you could find any documentation on implementing even the most basic clusters in this way anymore. If you were open to using the current defacto Linux HA Cluster stack you could even automate failover between your two datacenters (by using Booth). location/ordering constraints are what you're after, but I can't help with Heartbeat v1 :( – Matt Kereczman Dec 09 '16 at 17:04
  • @MattKereczman, thanks but we are not intending to rearchitect that part of our infrastructure, we plan to move everything to AWS instead. – Bill Karwin Dec 09 '16 at 17:59

2 Answers2

2

I think, you can't do it with (native) heartbeat only. You can use pacemaker, he can work with quorums, but... You don't have a quorum. Imagine, that link between data centers fails - every of east and west will think, he is only one survivor and every of them start application, switch mysql to master mode etc. And you'll get really split-brain position.
IMHO, if you need really HA, you need 3th data center, then migrate MySQL to MariaDB with Galera cluster, and start on them your Java app, may be even in active-active-active mode.

  • We do not need to automate failover between datacenters, only failover between A/B nodes within each datacenter. But thanks for your answer. What you suggest is of course correct if we needed to automate failover between datacenters. – Bill Karwin Dec 09 '16 at 16:16
  • > But if the east datacenter is the active one, then heartbeat should start the Java app during a heartbeat-automated failover. It is automated failover. But, the current configuration can not distinguish situations - datacenter west fail or only the link between west and east. I would have made the third arbitrator machine with trivial script something like WEST_FAILED=0 if [ dc_active(west) ]; then [ $WEST_FAILED -eq 1 ] && ssh east service Java stop sleep 10 elif [ dc_active(east) ]; then WEST_FAILED=1 ssh east service Java start fi – Bob Grandys Dec 10 '16 at 10:52
0

The solution I came up with is to keep two versions of /etc/ha.d/haresources.

root:/etc/ha.d$ ls -l
lrwxrwxrwx 1 root root   16 Dec 22 10:31 haresources -> haresources-dark
-rw-r--r-- 1 root root  151 Dec 22 10:22 haresources-dark
-rw-r--r-- 1 root root  161 Dec 22 10:30 haresources-live

The "haresources-dark" is used in all servers in the DR datacenter (east). I use a symlink so haresources points to haresources-dark.

The only difference between the two versions of haresources is the mention of Java applications. In the dark version, Java applications are not started.

If/when we ever switch to the DR datacenter, we'll have to update these symlinks manually. But that is acceptable.

This is not mistake-proof. I have to manually set up the symlinks on all my heartbeat-managed servers in the DR datacenter. And there's nothing to enforce that one datacenter is "dark" and the other is "live." This is going to be a manual solution for now.

Bill Karwin
  • 206
  • 1
  • 7