Data loss due to MySQL DRBD Heartbeat failover script

Question

Using DRBD version: 8.2.6 (api:88/proto:86-88)

Here is the contents of /etc/ha.d/haresources

    db1     192.168.100.200/24/eth0 drbddisk::mysql Filesystem::/dev/drbd0::/drbd::ext3::defaults mysql

and /etc/ha.d/ha.cf

    logfile        /var/log/ha-log
    logfacility     local0
    keepalive 1
    deadtime 30
    warntime 10
    initdead 120
    udpport        694
    bcast  eth0, eth4  
    auto_failback off
    node    db1
    node    db2
    respawn hacluster /usr/lib64/heartbeat/ipfail
    apiauth ipfail gid=haclient uid=hacluster
    deadping 5

When testing failover between machines I ran the following commands on db2:

    service heartbeat stop
    service mysqld stop
    drbdadm down mysql
    service drbd stop

/proc/drbd on db1 reported

     0: cs:Connected st:Primary/Unknown ds:UpToDate/DUnknown C r---

What happened next, after:

Bringing the services back online on db2
Transferring primary to db2 using hb_primary script
Taking db1 down as above
Bringing the services back online on db1
Transferring primary back to db1 using hb_primary script

was db1 remounted the DRBD disk, assumed the correct IP and started MySQL. There was massive MySQL table corruption; it was all fixable (using InnoDB recovery mode 6, mysqlcheck and the occasional backup), but how did it happen?

I speculate:

DRBD disconnected the disk from the filesystem while it was being used by MySQL, as a clean MySQL shutdown would not have resulted in corrupt data
heartbeat controlled DRBD, and stopping the heartbeat service "pulled the plug" on DRBD
this may happen again in the case of an actual failover (due to heartbeat ping timeout)

I do not have access to this setup again for some time, and would like to repeat the test.

Are the configuration settings correct?

Was the corruption the result of my manual testing?

Is there a better way to test failover than to stop the heartbeat service and let it run the haresources commands?

score 2 · Accepted Answer · answered Jun 04 '09 at 14:30

This probably isn't a big help, but this has been discussed extensively of late over at the Pacemaker and Linux-HA mailing lists.

I'm not very good with heartbeat, but with pacemaker I would set up a constraint that caused the cluster resource manager to flush disks with a write lock to the disk (or down mysql temporarily) before trying to switch over, and then releasing the lock once the switch had been completed.

score 2 · Answer 2 · edited Jul 16 '11 at 23:29

From everything I've read, and my limited experience with heartbeat, all you should have to do to manually failover from one server to another is issue the

service heartbeat stop

command. Everything that is in your haresources file will be controlled by heartbeat. Case in point, I have a cluster I'm setting up that needs to run the following services:

snmpd
mysql

Here is the haresources config

localhost00 \
drbddisk::home \
Filesystem::/dev/drbd0::/opt/local::ext3::defaults \
drbddisk::perf \
Filesystem::/dev/drbd1::/opt/local/perf::ext3::noatime,data=writeback \
IPaddr::1.1.1.1/24 \
mysqld \
snmpd

and here's the results I get (my appologies if it's a mess, I can't get the line breaks in the right spot):

[root@localhost00 ~]# service snmpd status
snmpd (pid 18558) is running...
[root@localhost00 ~]# service mysqld status
mysqld (pid 18509) is running...
[root@localhost00 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn@c5-x8664-build, 2008-10-03 11:30:17
m:res      cs         st                 ds                 p  mounted           fstype
0:home  Connected  Primary/Secondary  UpToDate/UpToDate  C  /opt/local       ext3
1:perf  Connected  Primary/Secondary  UpToDate/UpToDate  C  /opt/local/perf  ext3
[root@localhost00 ~]# service heartbeat stop
Stopping High-Availability services:
                                                           [  OK  ]
[root@localhost00 ~]# service snmpd status
snmpd is stopped
[root@localhost00 ~]# service mysqld status
mysqld is stopped
[root@localhost00 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn@c5-x8664-build, 2008-10-03 11:30:17
m:res      cs         st                   ds                 p  mounted  fstype
0:home  Connected  Secondary/Secondary  UpToDate/UpToDate  C
1:perf  Connected  Secondary/Secondary  UpToDate/UpToDate  C
[root@localhost00 ~]#
[root@zenoss00 ~]# service heartbeat start
Starting High-Availability services:
                                                           [  OK  ]
[root@zenoss00 ~]# service snmpd status
snmpd is stopped
[root@zenoss00 ~]# service mysqld status
mysqld is stopped
[root@zenoss00 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn@c5-x8664-bu
m:res      cs         st                   ds                 p  mounted  fstype
0:zenhome  Connected  Secondary/Secondary  UpToDate/UpToDate  C
1:zenperf  Connected  Secondary/Secondary  UpToDate/UpToDate  C
[root@zenoss00 ~]# service snmpd status
snmpd (pid 23055) is running...
[root@zenoss00 ~]# service mysqld status
mysqld (pid 23006) is running...
[root@zenoss00 ~]# service drbd status
drbd driver loaded OK; device status:
version: 8.2.6 (api:88/proto:86-88)
GIT-hash: 3e69822d3bb4920a8c1bfdf7d647169eba7d2eb4 build by buildsvn@c5-x8664-build, 2008-10-03 11:30:17
m:res      cs         st                 ds                 p  mounted           fstype
0:zenhome  Connected  Primary/Secondary  UpToDate/UpToDate  C  /opt/zenoss       ext3
1:zenperf  Connected  Primary/Secondary  UpToDate/UpToDate  C  /opt/zenoss/perf  ext3
[root@zenoss00 ~]#

notice that stopping heartbeat stopped all the services that are assigned to heartbeat (mysqld, snmpd); also notice that drbd is still running and heartbeat did NOT stop it. DRBD needs to be running the whole time for failover to work.

Try your failover again, but don't run the drbd commands, and I think you'll avoid your data corruption.

score 0 · Answer 3 · answered Jun 02 '09 at 18:15

The way to test heartbeat would be that you will issue service heartbeat stop on one machine and it fails over to the other machine and automatically brings up all the services on the other node, also you do not want to turn of drbd services .

The other way to test is to do a hard reboot on one machine.

Data loss due to MySQL DRBD Heartbeat failover script

3 Answers3