Heavy write to Galera cluster - table locked, cluster practically unusable

Question

I set up Galera Cluster on 3 nodes. It works perfectly for reading data. I have done simple application to make some test on the cluster. Unfortunately I have to say that the Cluster fails totally when I try to do some writing. Maybe it can be configured differently or I do sth wrong?

I have a simple stored procedure:

CREATE PROCEDURE testproc(IN p_idWorker INTEGER)
BEGIN
  DECLARE t_id INT DEFAULT -1;
  DECLARE t_counter INT ; 
  UPDATE test SET idWorker = p_idWorker WHERE counter = 0 AND idWorker IS NULL limit 1;
  SELECT id FROM test WHERE idWorker = p_idWorker LIMIT 1 INTO t_id;
  SELECT ABS(MAX(counter)/MIN(counter)) FROM TEST INTO t_counter;
  SELECT COUNT(*) FROM test WHERE counter = 0 INTO t_counter;
  IF t_id >= 0 THEN
    UPDATE test SET counter = counter + 1 WHERE id = t_id;
    UPDATE test SET idWorker = NULL WHERE id = t_id;
    SELECT t_counter AS res;
  ELSE
  SELECT 'end' AS res;
  END IF;
END $$

Now my simple C# application creates for example 3 MySQL clients in separate threads and each one executes the procedure every 100ms until there is no record where column 'counter' = 0.

Unfortunately - after about 10 seconds sth is going bad. On servers there is process 'query_end' that never ends. After that - you cannot make update on the test table, MySQL returns:

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

. You cant even restart mysql. What you can do is to restart server, sometimes whole cluster. Is Galera Cluster so unreliable when you do massive concucurrent writing/updates? Hard to believe.

bernhardrusch · Answer 1 · 2014-07-09T14:37:15.553

We had pretty much the same issue - when we were updating it failed with the lock timeout. Our current strategy is to use 1 server for the background processes and the massive writes and use the other 2 servers for the live web servers.

This is pretty easy if you're using HAProxy - but we had to make some changes to our code too.

So far it seems to work much better but we'll definitely see if it is working nice in some weeks (when we don't run into the same problems we had).

Some notes from our experience:

Some weeks later I can say that the changes really made a big difference. I think the most important change was to analyze the background processes and schedule them, so they don't overlap (or don't overlap as much).

Changing the server, so just one server is mostly used for writing and the other two servers are used for reading improved our user experiene during the heavy background processes.

The third step we did was to improve our background processes. During one transaction the programmer dropped the table and rebuilt it from the ground up. We changed this to first check if a change is necessary and then update the row. This improved the performance of this process extremely.

Our experience is that reading is very fast in a Galera cluster, but writing can be pretty slow, especially if you are doing tons of write operations. We had some hiccups in the beginning and had to reimport the whole table - this was a big killer for the database. Another thing which crashed our server twice was that the binlog filled up the disks of the server, which crashed the server. Also make sure to change all the databases to Innodb tables, otherwise data loss is possible. One of our programmers set all the log tables to MyISAM - let's say we've lost some of our logs in this process.

But after all I can say that Galera is working now pretty well. It's especially nice if you have to update the DB server or doing some other maintainance, as it isn't a big problem if you're shutting one node down to do some maintenance.

score 1 · Answer 2 · answered Feb 13 '13 at 16:08

I know it is late, but I will leave a little bit of my experience on Galera cluster here. Our application is doing about 160-200 inserts/second and we do as much reads. At night it is much less but during the day this is our average and it can peak to an hire rate. We are probably not in the same context of your application but what helped us at the time we made the transition from standard MySQL to Galera was to set our application to auto commit each transaction and that removed instantly the behavior you are describing in this question.

In python using PySQLPool library we had to add a line like this to our query wrapper:

PySQLPool.getNewQuery(self.connection, commitOnEnd=True)

Then another issue we faced was that it wasn't writing fast enough to the tables. One way we found out to make it fast enough was to use the innodb_flush_log_at_trx_commit option. Since we can afford a 1 second lost of transaction we setted the setting in our my.cnf like this:

innodb_flush_log_at_trx_commit  = 0

With these 2 simple settings we are now in production for more than a year using Galera and the 3 servers are acting well with our application.

Best.

Thanks for your answer - better late then never ;) I have the auto commit turned on but it still wasn't working perfectly. I will test the trx commit = 0 though. BTW - I recently downloaded new version of Galera and my tests shows that the `SELECT ... FOR UPDATE` actually works when you disable auto-commit for the session right before it and return it on after doing the commit. It eliminates the need to use those ugly updates. — Joe, Feb 15 '13 at 13:02
We almost never do update, but a recent chat on the codership google group I had with the codership group where they mention that we should try the percona galera cluster because they have integrated multiple bug fixes. see the 10th comment from Alexey Yurchenko https://groups.google.com/forum/#!topic/codership-team/maqqM26Vyl4 that could also fix some issues that you have or not, but it maybe be worth the try. — drivard, Feb 16 '13 at 15:08

Heavy write to Galera cluster - table locked, cluster practically unusable

2 Answers2