Setup:
Galera Cluster with 3 nodes running in Docker Containers. In front of every node is a maxscale proxy. Remotely is a replication slave running with bin-log replication where the cluster is acting as the master (behind the maxscale proxy). Every node in the cluster has bin-log enabled and log_slave_updates=ON
. GTID is not configured on the slave.
Problem
Somehow DB2 (see below) has another bin-log file number. .000016
instead of .000015
as the other two servers. Interestingly they have the same position, so I assume the have the same content, as they have the same creation time and exactly same file size (though md5sum shows different checksums).
At some point the slave got out of sync with the following logs:
2022-06-08 0:27:20 9 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'data-master-bin.000015' at position 3374796
2022-06-08 0:28:20 9 [ERROR] Slave I/O: error reconnecting to master 'user@url.to.server:some_port' - retry-time: 60 maximum-retries: 86400 message: Can't connect to MySQL server on 'url.to.server' (110 "Connection timed out"), Internal MariaDB error code: 2003
2022-06-08 0:29:20 9 [Note] Slave: connected to master 'user@url.to.server:some_port',replication resumed in log 'data-master-bin.000015' at position 3374796
2022-06-08 0:29:20 9 [ERROR] Error reading packet from server: binlog truncated in the middle of event; consider out of disk space on master; the first event 'data-master-bin.000015' at 3374796, the last event read from 'data-master-bin.000015' at 3374796, the last byte read from 'data-master-bin.000015' at 3374815. (server_errno=1236)
2022-06-08 0:29:20 9 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'binlog truncated in the middle of event; consider out of disk space on master; the first event 'data-master-bin.000015' at 3374796, the last event read from 'data-master-bin.000015' at 3374796, the last byte read from 'data-master-bin.000015' at 3374815.', Internal MariaDB error code: 1236
2022-06-08 0:29:20 9 [Note] Slave I/O thread exiting, read up to log 'data-master-bin.000015', position 3374796
I can't really say if there was no connection at some point. My assumption is, the maxscale proxy usually redirects all incoming requests to the first server as there is not much load at all on the dbs. However at this specific time the request got redirected to db-02 which caused the error as there data-master-bin.000015 is different from the other two servers. I was able to run START SLAVE
on the slave and the replication is now up and running again, but I fear it will break as soon as maxscale redirects again to db-02.
So my questions are:
- What triggers an increase of the bin-log file index?
- How can I bring the bin-log files back in sync on the 3 cluster nodes?
Configs:
db-server .cnf on every node:
[mysqld]
log_bin
server_id=1
log_slave_updates=ON
log-basename=data-master
#Configure GTID
wsrep_gtid_mode=ON
wsrep_gtid_domain_id=1
proxy .cnf on every server
[Read-Only-Service]
type=service
router=readconnroute
servers=db-01, db-02, db-03
user=maxscale
password=******
router_options=synced
[Replication-Listener]
type=listener
service=Read-Only-Service
protocol=MariaDBClient
port=/*some-port*/
ssl=required
ssl_cert=/path/to/server-cert.pem
ssl_key=/path/to/server-key.pem
ssl_ca_cert=/path/to/ca-cert.pem
DB1
MariaDB [(none)]> SHOW MASTER STATUS;
+------------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------------+----------+--------------+------------------+
| data-master-bin.000015 | 3534812 | | |
+------------------------+----------+--------------+------------------+
1 row in set (0.000 sec)
ls -alc --fu /var/lib/mysql
...
413 2021-12-02 10:06:08.187023308 +0000 data-master-bin.000009
387 2021-12-02 10:09:23.020291420 +0000 data-master-bin.000010
8883991 2022-03-02 16:53:11.689571461 +0000 data-master-bin.000011
1260112 2022-03-10 13:10:08.730056029 +0000 data-master-bin.000012
466 2022-03-10 13:10:29.106313267 +0000 data-master-bin.000013
5967037 2022-04-27 16:41:46.021414396 +0000 data-master-bin.000014
3672324 2022-06-13 22:03:54.919102767 +0000 data-master-bin.000015
175 2022-04-27 16:42:46.150317967 +0000 data-master-bin.index
...
DB2
MariaDB [(none)]> show master status;
+------------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------------+----------+--------------+------------------+
| data-master-bin.000016 | 3534812 | | |
+------------------------+----------+--------------+------------------+
1 row in set (0.000 sec)
ls -alc --fu /var/lib/mysql
...
413 2022-03-10 13:10:34.036661850 +0000 data-master-bin.000014
5966988 2022-04-27 16:44:37.871084681 +0000 data-master-bin.000015
3672324 2022-06-13 22:03:54.935182267 +0000 data-master-bin.000016
75 2022-04-27 16:45:21.228903025 +0000 data-master-bin.index
...
DB3
MariaDB [(none)]> SHOW Master status;
+------------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------------+----------+--------------+------------------+
| data-master-bin.000015 | 3534812 | | |
+------------------------+----------+--------------+------------------+
1 row in set (0.000 sec)
ls -alc --fu /var/lib/mysql
...
413 2022-03-10 13:10:12.637403075 +0000 data-master-bin.000013
5966988 2022-04-27 16:46:33.269169792 +0000 data-master-bin.000014
3672324 2022-06-13 22:03:54.929455002 +0000 data-master-bin.000015
75 2022-04-28 00:37:55.597345308 +0000 data-master-bin.index
...
I'm using Maria DB 10.4