Recovery strategy for Master-Master replication

Question

I have implemented a HA solution for mysql based on master-master replication. There is a mechanism on the front end part which guarantees that only one db will be read/written to at a given time (i.e. we only use replication for HA).

I have confirmed replication works as expected, but I am wondering about failure scenario and recovery. In particular, I worry about what happens when one master fails in an unrecoverable state, and needs to be recreated from the other master:

As the other master is live and most likely used, I cannot lock it and create dumps from mysqldump (our databases are moderately large, and mysqldump can easily take hours after a few months of usage).
Even assuming I have a dump, it is crucial that the binlog position as shown by SHOW MASTER STATUS correspond to the dump being done after the database has been locked.

The simple solution to the first problem is to use a third database acting as a backup, from which I can do the mysqldump. But then how do I make sure the recreated master can start replication from the running master in a consistent way?

Consider adding a slave to one of the masters so that you can perform your dumps from that. It'll help for backups as well. — John Gardeniers, Apr 11 '11 at 23:55

score 2 · Accepted Answer · answered Apr 11 '11 at 23:51

There are two basic approaches to this problem that I am aware of. First, if you are running InnoDB instead of Myisam, then you can do the backup in a transaction (--single-transaction --lock-tables=FALSE), which combined with --flush-logs (not required but nice) and --master-data will give you a consistent backup with replication position information. Flush logs will reset the logs before the dump is created, which means the position will always be 106, and --master-data will put the logfile name and position right in the dump file. Of course, you have to run this on the master for --master-data to work.

The second way, which you mentioned, is to use a third host to create the backups. In this case, you need to stop replication, make sure the DB is read_only (although really, all your replicas should be read only since this does not effect writes from replication) and then create your backup AND record the replication position. You can not use --master-data in this case. Instead, you might do something like this:

echo 'stop slave' | mysql {options)
mysqldump {your options} > DB.sql
echo 'show slave status\G' > DB.replication
echo 'start slave' | mysql {options)

If you ever need to restore from this backup, you would run the restore and then setup replication where the two parameters master_log_file and master_log_pos come from the DB.replication file:

master_log_file = value of Master_Log_File
master_log_pos = value of Exec_Master_Log_Pos

Note: you can AND SHOULD test this from another replica.

Additional note: if you have a pool of replicas (for example if you have separated reads from writes for a web app) it is possible for the replicas to be out of sync with the new master; this can happen if the failover occurs during a period of heavy write I/O since the replicas stream asynchronously and there is no guarantee that your standby is at the same position as the other replicas when you failover. However, this hasn't happened to me yet...

great, thanks. All this is actually documented in mysqldump. Why it is not there in the main mysql doc is beyond me... — David Cournapeau, Apr 13 '11 at 02:44

Recovery strategy for Master-Master replication

1 Answers1