MySQL Slave stuck on a single bin log + bin log position for 17+ hours

Question

tl;dr: Replication is stalled on a specific binlog and position, and I'm not sure why

I have a MySQL Replication setup with MySQL 5.5.

This replication setup has no history of ever falling behind and has always been solid.

This morning, I noticed that the Slave was behind the Master by 17 hours.

Doing more research, it looks to be an issue with the SQL_Thread.

The current master log file, according to the slave (via SLAVE STATUS), is mysql-bin.001306 @ position 20520499. This is in line with the MASTER STATUS output from master.

However, SLAVE STATUS shows that the Relay_Master_Log_File is currently mysql-bin.001302 with a Exec_Master_Log_Pos of 36573336. The Relay_Master_Log_File nor Exec_Master_Log_Pos has advanced at all while I've been monitoring them this morning.

Looking into the binlogs on master, This is the statement located at mysql-bin.001302@3657336:

# at 36573053
#170221 14:33:48 server id 1  end_log_pos 36573130      Query   thread_id=96205677      exec_time=0     error_code=0
SET TIMESTAMP=1487716428/*!*/;
BEGIN
/*!*/;
# at 36573130
# at 36573213
#170221 14:33:48 server id 1  end_log_pos 36573213      Table_map: `database-name`.`table-name` mapped to number 5873
#170221 14:33:48 server id 1  end_log_pos 36573309      Write_rows: table id 5873 flags: STMT_END_F
### INSERT INTO `database-name`.`table-name`
### SET
###   @1='xxxxxxxx'
###   @2=6920826
###   @3='xxxxxxxx'
###   @4='GET'
###   @5='address'
###   @6=2017-02-21 14:40:24
###   @7=2017-02-21 14:40:24
# at 36573309
#170221 14:33:48 server id 1  end_log_pos 36573336      Xid = 1668637037
COMMIT/*!*/;
# at 36573336

Around this time, yesterday, I did execute a few large queries to migrate data to a new table. The process looked a bit like this;

mysql> insert into tmp_table ( select <rows> from origin table ); -- 44 million rows
mysql> insert into dest_table ( select * from tmp_table ); -- 44 million rows

The two tables in question did not have a primary or unique key on them, which I've read could have been a problem. However, while the database + table shown in the binlog entry above is the destination table here -- the insert record shown is not one that was generated during the migration.

If you've gotten this far, you deserve internet points.

At this point, I'm not sure what else to consider or where else to look to find the reason for the log stall. Any insight is appreciated.

Thanks.

For reference, here is the MASTER STATUS and SLAVE STATUS output as of the time of this post:

MASTER STATUS

mysql> show master status;
+------------------+----------+--------------+------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+----------+--------------+------------------+
| mysql-bin.001306 | 20520499 |              |                  |
+------------------+----------+--------------+------------------+
1 row in set (0.00 sec)

SLAVE STATUS

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: master-host
                  Master_User: replication-user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.001306
          Read_Master_Log_Pos: 20520499
               Relay_Log_File: relay-bin.002601
                Relay_Log_Pos: 36573482
        Relay_Master_Log_File: mysql-bin.001302
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 36573336
              Relay_Log_Space: 3565987462
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 63435
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
1 row in set (0.00 sec)

It can be anything, https://www.percona.com/blog/2013/01/09/how-does-mysql-replication-really-work/ — mootmoot, Feb 22 '17 at 18:38
@mootmoot thanks for the link, it helped me confirm my suspicions. — Jim Rubenstein, Feb 22 '17 at 21:09

score 8 · Accepted Answer · answered Feb 22 '17 at 21:05

I was on the right track here with the large query transactions from yesterday.

After I migrated the data, I executed a DELETE statement on the original table, to get rid of the rows that I had migrated away.

These tables are just full of tracking data, and therefore, don't have any primary or unique keys on them.

Due to how ROW based replication works, the slave, instead of executing the identical DELETE statement that was executed on master, it executes a DELETE statement for each row, which ends up looking something like this:

DELETE FROM table WHERE colA=foo AND colB=bar AND colC=baz....etc

And, since there is not an index that matches that query, the single threaded replication SQL thread executed 40 million + delete statements (or...was trying to) which took a long time to run because of all the scanning that had to be done to identify each row (the table was about 80 million rows in size at the time).

In the end, I dealt with this by stopping the slave thread (STOP SLAVE) skipping a single slave transaction (SET GLOBAL sql_slave_skip_counter = 1;) and re-starting the slave thread (START SLAVE).

This resulted in my Master and Slave being out of sync on the table in question here -- but I was able to leverage the nature of the row based replication to bring it back in sync by executing the following on the Master:

mysql> CREATE TABLE table_tmp; -- with the same schema as 'table' (SHOW CREATE TABLE table;)
mysql> RENAME TABLE table TO table_bak, table_tmp TO table;
mysql> INSERT INTO table ( SELECT * FROM table_bak );
mysql> DROP TABLE table_bak;

Since the DELETE was executed on the Master, the INSERT here only inserted the records I wanted to keep (the deleted ones are gone). And, since row based replication inserts each row individually instead of executing the same INSERT INTO...SELECT statement, the Slave table was only populated with the desired data. Then, the subsequent DROP TABLE statement drops the table on the slave without having to address each row individually.

The caveat here is that because the Master version of the table was still 30-40 million rows...the INSERT and consequential replication ends up locking up your slave for a little while (duplicating the problem above), but it's a much shorter stall (ended up being about 20 minutes) due to mysql not having to scan the database for rows to delete.

I hope this can be of help to someone in the future. Sorry it's winded, hope it was informative and helpful.

Thanks to taking the time to report your solution, is greatly appreciated — shodanshok, Feb 23 '17 at 07:14

MySQL Slave stuck on a single bin log + bin log position for 17+ hours

1 Answers1