I have a master server and several slave servers replicating a single database. I am using in MySQL 5.0 in SLES 11. During fault tolerance testing I found that when the slave's network connection is broken (cable un-plugged) and then restored, replication hangs. It shows no errors and the slave appears to be running but the Read_Master_Log_Pos and Exec_Master_Log_Pos values do not match the log postion on the master server.

The Slave_IO_State is "Waiting for master to send event".

The Slave_IO_Running and Slave_SQL_Running values are both are "Yes".

The Master_Log_File and Relay_Master_Log_File match.

If I stop and start the slave or restart the mysql daemon, replication starts working again.

Any ideas on what I can do about this?

Ed Manet
  • 522
  • 1
  • 5
  • 16
  • I think you have mismatched data on your slave , stop mysql slave on any other slave take dump capture slave position and restore dump on the buggy slave and start replication again. – Abhishek Anand Amralkar Oct 16 '13 at 15:46
  • I don't think it's mismatched data. The data on the slave was a recent dump from the master and everything worked fine before the network cable was un-plugged. – Ed Manet Oct 16 '13 at 16:41

1 Answers1


When a MySQL Slave connects to the master, it requests a stream of the binary log, and the master autonomously sends binlog events as often as they occur, with no acknowledgement required from the slave unless you're using semisynchronous replication.

The slave doesn't originate any traffic, other than low level acknowledgements handled by the TCP stack. An interruption in connectivity (at various layers of the stack, not limited to an unplugged cable) can cause the connection to break in several ways, including the master's TCP stack tearing down the connection due to timeouts or an ICMP unreachable message or a stateful firewall between the machines "forgetting" about the TCP session and silently dropping the subsequent packets, with the slave quietly sitting and waiting for the next packet to come along from the master.

The solution here is the global variable slave_net_timeout.

The number of seconds to wait for more data from the master before the slave considers the connection broken, aborts the read, and tries to reconnect.

This is configured on the slave. When the slave connects to the master, before requesting the binlog stream, it asks the master to send heartbeat events, which are formatted like binlog events and streamed as if they were the next event in the master's binlog, but don't actually increment the binlog position counters. They are essentially zero overhead in normal operation, because they aren't sent unless the master has generated no new binlog events for one-half of the slave's slave_net_timeout setting (default; or another value you can configure during CHANGE MASTER TO), so heartbeat events are only actually generated when traffic is very light... so there isn't any real harm as far as I can tell in setting this value as low as just a few seconds.

If the slave sees the timeout expire, it will close its connection and and reconnect to the master.

On the remote chance that the master doesn't realize that the slave has gone away, when the slave reconnects, the master will close the original connection, because a MySQL master, when accepting a new slave connection, checks whether another slave with the same server_id is already connected, and if so, drops the original connection. This is, incidentally, the reason why two slaves configured with the same server_id (an unsupported configuration) can't successfully stay connected to the same master -- as soon as one of them connects, it causes the other one to be bumped, and a cycle ensues with each slave forcing the other one's connection to be dropped.

Setting this variable to a suitably-low value in my.cnf and restarting the slave should remedy this problem.

Michael - sqlbot
  • 21,988
  • 1
  • 57
  • 81
  • Thanks for this. It looks like the default value is an hour. I set it down to 5 minutes and that seemed to help. – Ed Manet Nov 13 '13 at 18:28