Heartbeat/DRBD failover didn't work as expected. How do I make the failover more robust?

Question

I had a scenario where a DRBD-heartbeat set up had a failed node but did not failover. What happened was the primary node had locked up, but didn't go down directly (it was inaccessible via ssh or with the nfs mount, but it could be pinged). The desired behavior would have been to detect this and failover to the secondary node, but it appears that since the primary didn't go full down (there is a dedicated network connection from server to server), heartbeat's detection mechanism didn't pick up on that and therefore didn't failover.

Has anyone seen this? Is there something that I need to configure to have more robust cluster failover? DRBD seems to otherwise work fine (had to resync when I rebooted the old primary), but without good failover, it's use is limited.

heartbeat 3.0.4
drbd84
RHEL 6.1
We are not using Pacemaker

nfs03 is the primary server in this setup, and nfs01 is the secondary.

ha.cf

  # Hearbeat Logging
logfacility daemon
udpport 694


ucast eth0 192.168.10.47
ucast eth0 192.168.10.42

# Cluster members
node nfs01.openair.com
node nfs03.openair.com

# Hearbeat communication timing.
# Sets the triggers and pulse time for swapping over.
keepalive 1
warntime 10
deadtime 30
initdead 120


#fail back automatically
auto_failback on

and here is the haresources file:

nfs03.openair.com   IPaddr::192.168.10.50/255.255.255.0/eth0      drbddisk::data  Filesystem::/dev/drbd0::/data::ext4 nfs nfslock

score 2 · Answer 1 · answered Jun 19 '12 at 21:04

not perfect solution but I had this problem some 2-3 years ago with an older drbd. What I did was to add on both hosts a script in cron that checked if actual host is an active master or a slave. If it was on a slave it checked if some known file in NFS directory is available. If not; I assumed that NFS is broken; it send over ssh power off command. You can try to work along this line. I'm sure they are better ways. This one was good enough for me.

score 2 · Accepted Answer · answered Jun 19 '12 at 21:08

2

I guess you will have to implement some monitoring to check if your primary system behaves as expected. If any check fails, you should switch off the server (through IPMI/ILO or a switched PDU) and let heartbeat do its job.

I think you will always find a situation in which it doesn't work as you would expect it to do.

answered Jun 19 '12 at 21:08

Oliver

5,883
23
32

so to an extent this is expected in Heartbeat? I didn't know if I was configuring it wrong or if there were just some kinks with it. – Quinn Murphy Jun 19 '12 at 22:06
Heartbeat can impossibly guard you against all possible failures, which is why you need to somehow power off a node which doesn't behave as it should (google for `STONITH`). Otherwise, this could lead to split brain situations, something you absolutely want to avoid. – Oliver Jun 20 '12 at 05:56

Heartbeat/DRBD failover didn't work as expected. How do I make the failover more robust?

2 Answers2