1

I have a three node Percona XtraDB Cluster (5.5) setup.

Every night, we shut down MySQL on one randomly selected node in order to take backups of the data directory.

When our traffic is reasonably busy, this causes a couple (2-4) error alerts along the lines of SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '140577' for key 'PRIMARY'. Primary key conflict, obviously, except we're using auto_increment columns as the primary key on these tables. As each node has an offset assigned by the cluster engine, this shouldn't be occurring.

My suspicion is that taking the node out of the cluster causes the other two nodes to change their auto increment offsets, in a way that's causing them to potentially conflict while the change takes place. I'm at a loss as to why this wouldn't be an atomic action as well as how I might fix it.

Has anyone encountered this? Is there a way to temporarily freeze the auto increment settings in the cluster so they don't shuffle around during the backup process or some other solution I'm not thinking of?

ceejayoz
  • 32,469
  • 7
  • 81
  • 105
  • Is there some specific reason you want to take backups of the data directory files directly? I would either use `mysqldump` with the single transaction flag to take the backup, or use `xtrabackup` for the task instead of your method. – Tero Kilkanen Jul 30 '14 at 20:02
  • @TeroKilkanen Backing up the data directory allows us to provision a new server much more rapidly using Ansible - we load the server with the datadir and then it just needs to do an IST. It's also easier for us to do a streaming backup that doesn't require huge amounts of spare space via Duplicity - with xtrabackup/mysqldump we'd need enough space on the drive for a whole separate copy of the data. – ceejayoz Jul 30 '14 at 20:23
  • We're also having the same problem but with MariaDB with Galera. We're not taking the nodes out for backups in this way, but this same auto increment issues happens when we take a node out for updates/to restart. Did you find a solution to the auto increment problem? – Luke Cousins Apr 08 '15 at 21:04

2 Answers2

0

From top of my head I'd say add slave to the mix and have backups done on slave without removing node from cluster.

What happens is when you put node back in it does not know how the numbers of primary key advanced on active nodes before it manages to get a new write because it will not be cooling down until it catches up with other two replicas. So it writes new record and gives it a primary key ID which already exists on two replicas that were not out. In essence node should not receive writes until it's in sync.

Hrvoje Špoljar
  • 5,162
  • 25
  • 42
  • Slaves aren't a great option for us - they add a lot of management overhead. The Percona cluster is mostly self-managing - we can add/remove nodes (we're on AWS) at will without breaking replication at all. Percona syncs data and adjusts the auto_increment_offset *before* allowing writes to the rejoining server, so there *shouldn't* be primary ID conflicts in the manner you describe. – ceejayoz Aug 15 '14 at 15:38
  • I don't see how slave is management overhead; you add slave which does nothing just replicates data and at any point in time you hit it as hard as you want with xtrabackup and backup your stuff without worrying about performance loss in production. – Hrvoje Špoljar Aug 15 '14 at 17:19
  • In my experience, replication breaks a lot more frequently than Percona cluster's internal Galera setup. In addition, the cloud nature of our system means we frequently replace the underlying servers, which would now entail having the slave switch over to a different server for replication. Pain in the ass compared to our current system, where Percona handles pretty much everything for us. – ceejayoz Aug 15 '14 at 18:00
  • either way; rather than pulling out node completely there should be way to leave node in cluster (actively replicating data) and stopping reads from hitting it while doing backup. this way when you re-allow reads from hitting node where backup was made it won't be missing any keys. – Hrvoje Špoljar Aug 15 '14 at 23:12
  • Stopping reads doesn't do the trick - Percona Cluster is synchronous, so if it gets bogged down in the backup, the entire cluster slows. – ceejayoz Aug 15 '14 at 23:36
0

I know it's an old post and you may already have done it but to someone that sees this behaviour with PXC (Percona XtraDB Cluster) it seems you hit the bug https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1366997

Update your PXC to the latest version you're currently in. For instance if you are on 5.5.19 to the latest 5.5 update. If you're on 5.6.x then upgrade to the most recent update.

Had this problem on production.

Haohmaru
  • 113
  • 4