I have implemented a HA solution for mysql based on master-master replication. There is a mechanism on the front end part which guarantees that only one db will be read/written to at a given time (i.e. we only use replication for HA).
I have confirmed replication works as expected, but I am wondering about failure scenario and recovery. In particular, I worry about what happens when one master fails in an unrecoverable state, and needs to be recreated from the other master:
- As the other master is live and most likely used, I cannot lock it and create dumps from
mysqldump
(our databases are moderately large, andmysqldump
can easily take hours after a few months of usage). - Even assuming I have a dump, it is crucial that the binlog position as shown by SHOW MASTER STATUS correspond to the dump being done after the database has been locked.
The simple solution to the first problem is to use a third database acting as a backup, from which I can do the mysqldump
. But then how do I make sure the recreated master can start replication from the running master in a consistent way?