4

Apparently I started a SELECT query 20 days ago that never finished. It kept running even though the client had disconnected, and ran for so long that one of the undo logs grew to 230 gb in size. (The database in question is 320 gb of data). I've force killed the query, but the undo log just won't shrink back to normal size, causing the disk to be full.

For the record, this is the sizes on disk:

root@the-db:/var/lib/mysql# du -h undo_00* erik_*
11G undo_001
244G    undo_002
1.5G    erik_temporary_undo_004.ibu
22G erik_undo_003.ibu

One interesting thing is that all undo logs are pretty massive?

I've tried to make sense the documentation, but can't get it to work: https://dev.mysql.com/doc/refman/8.0/en/innodb-undo-tablespaces.html

Version running:

root@the-db:/var/lib/mysql# mysql --version
mysql  Ver 8.0.18 for Linux on x86_64 (MySQL Community Server - GPL)

Automatic truncation

If I understand correctly, to have the purge thread truncate an undo log automatically, three condition needs to be met:

Step 1. The undo log needs to be larger than @@innodb_max_undo_log_size, which is:

mysql> SELECT @@innodb_max_undo_log_size;
+----------------------------+
| @@innodb_max_undo_log_size |
+----------------------------+
|                 1073741824 |
+----------------------------+

Step 2. @@innodb_undo_log_truncate needs to be ON

mysql> select @@innodb_undo_log_truncate;
+----------------------------+
| @@innodb_undo_log_truncate |
+----------------------------+
|                          1 |
+----------------------------+

Step 3. There needs to be a total of 2 active undo logs Automated truncation of undo tablespaces requires a minimum of two active undo tablespaces (not sure if these means 2 active before one is taken out of rotation for truncating).

mysql> SELECT NAME, SPACE_TYPE, STATE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE SPACE_TYPE = 'Undo' ORDER BY NAME;
+-------------------------+------------+----------+
| NAME                    | SPACE_TYPE | STATE    |
+-------------------------+------------+----------+
| erik_temporary_undo_003 | Undo       | active   |
| erik_temporary_undo_004 | Undo       | active   |
| innodb_undo_001         | Undo       | active   |
| innodb_undo_002         | Undo       | inactive |
+-------------------------+------------+----------+

According to above, we have 3 active and 4 total undo logs.

However, nothing seems to be happening.

Manual truncation

There is also a manual process for truncating a log, which has another set of prerequisites.

Step 1 and 2 are the same for automatic truncation.

Step 3 is different in that Manual truncation of undo tablespaces requires a minimum of three active undo tablespaces (still not sure if this means 3 active before one is taken out of rotation). To get around my uncertainty, I created 2 extra undo table spaces.

CREATE UNDO TABLESPACE erik_temporary_undo_003 ADD DATAFILE 'erik_undo_003.ibu';
CREATE UNDO TABLESPACE erik_temporary_undo_004 ADD DATAFILE 'erik_temporary_undo_004.ibu';

Step 4 is to deactivate the table space needing a truncation:

ALTER UNDO TABLESPACE innodb_undo_002 SET INACTIVE;

If I understand correctly this will trigger truncation and after it is completed the STATUS will be empty. Again, here is the current status:

mysql> SELECT NAME, SPACE_TYPE, STATE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE SPACE_TYPE = 'Undo' ORDER BY NAME;
+-------------------------+------------+----------+
| NAME                    | SPACE_TYPE | STATE    |
+-------------------------+------------+----------+
| erik_temporary_undo_003 | Undo       | active   |
| erik_temporary_undo_004 | Undo       | active   |
| innodb_undo_001         | Undo       | active   |
| innodb_undo_002         | Undo       | inactive |
+-------------------------+------------+----------+

It was set to inactive about 3 days ago, and it hasn't changed since.

One weird thing is that according to the following query, there are only 2 active undo logs, when the previous query shows 3. Am I missing something?

mysql> SHOW STATUS LIKE 'Innodb_undo_tablespaces%';
+----------------------------------+-------+
| Variable_name                    | Value |
+----------------------------------+-------+
| Innodb_undo_tablespaces_total    | 4     |
| Innodb_undo_tablespaces_implicit | 2     |
| Innodb_undo_tablespaces_explicit | 2     |
| Innodb_undo_tablespaces_active   | 2     |
+----------------------------------+-------+

Shouldn't Innodb_undo_tablespaces_active be 3 at least?

Frequency

The documentation mentions innodb_purge_rseg_truncate_frequency as a way to get the purge thread to run more often. The default is 128 and the example shows setting it to 32. What this actually means in terms of time is very unclear. It just mentions "every 32 runs".

To increase that frequency, decrease the innodb_purge_rseg_truncate_frequency setting. For example, to have the purge thread look for undo tabespaces once every 32 timees[sic] that purge is invoked, set innodb_purge_rseg_truncate_frequency to 32.

For good measure, I've set it to 1.

mysql> show variables like "%truncate%";
+--------------------------------------+-------+
| Variable_name                        | Value |
+--------------------------------------+-------+
| innodb_purge_rseg_truncate_frequency | 1     |
| innodb_undo_log_truncate             | ON    |
+--------------------------------------+-------+

Metrics

I just recently understood how to get the InnoDB metrics:

mysql> set global innodb_monitor_enable = all;

And will update with metrics when they become available.

Relevant configs

mysql> show variables like "%undo%";
+--------------------------+------------+
| Variable_name            | Value      |
+--------------------------+------------+
| innodb_max_undo_log_size | 1073741824 |
| innodb_undo_directory    | ./         |
| innodb_undo_log_encrypt  | OFF        |
| innodb_undo_log_truncate | ON         |
| innodb_undo_tablespaces  | 2          |
+--------------------------+------------+

mysql> show variables like "%truncate%";
+--------------------------------------+-------+
| Variable_name                        | Value |
+--------------------------------------+-------+
| innodb_purge_rseg_truncate_frequency | 1     |
| innodb_undo_log_truncate             | ON    |
+--------------------------------------+-------+

What am I missing to get MySQL to truncate the undo log size?

I just assume that truncating means that the actual file on disk will shrink. Maybe it means something else?

ErikPerik
  • 95
  • 1
  • 8
  • I'm experiencing the same problem. Did you manage to figure it out? – vvucetic Nov 18 '20 at 14:24
  • @vvucetic which version of MySQL are you on? I think it might be fixed in a recent version of 8.x (recall reading something about this in the release notes) BUT what I had to end up doing was rebuild the entire DB. It's a pretty simple process and can be done with little downtime using the CLONE plugin https://mysqlserverteam.com/clone-create-mysql-instance-replica/ and switching the read replica to become primary. – ErikPerik Nov 18 '20 at 14:55
  • We recently upgraded to 8.0.21 which is the latest available in RDS. We had some kind of leak with MySQL 5.7, we upgraded to one of the replicas by promoting to master (replica was not affected by the leak) and now we have the leak just like you, "innodb_undo_002" is growing. I'm almost positive it's due to how we use it, maybe big transactions, but it's challenging to pinpoint and change now. I'm also afraid if we make another promotion, the problem would remain. So, if I understand correctly, you just rebuilt the whole db and it stopped growing? Or you couldn't reclaim storage back? – vvucetic Nov 18 '20 at 16:02
  • @vvucetic once i rebuilt the db it stopped happening. I don't want to say too much but I do recall this happening around the time of upgrading one server to a new version of MySQL. So starting fresh and making sure all servers run the same version of MySQL could well fix it. Feel free to e-mail me at erik at feeder dot co if you want me to check more details of our setup (though I can't really remember too many details about this event) – ErikPerik Nov 18 '20 at 16:57
  • Thank you @erikperik. I guess we should try that. – vvucetic Nov 18 '20 at 19:33
  • We solved it so I offered our solution as an answer. It kind of explains why recreating the server helped for you. – vvucetic Nov 19 '20 at 12:52

1 Answers1

2

After spending a few days fighting with an ever-growing undo tablespace, we finally figured it out so I'm gonna share the results:

SELECT NAME, SPACE_TYPE, STATE FROM INFORMATION_SCHEMA.INNODB_TABLESPACES WHERE SPACE_TYPE = 'Undo' ORDER BY NAME;

We had 2/2 active (default) undo tablespaces. 1 was around 1GB and other one was 90GB and growing.

Per docs:

Rollback segments residing in the selected undo tablespace are made inactive so that they are not assigned to new transactions. Existing transactions that are currently using rollback segments are permitted to finish.

That part was the key. To my understanding, all transactions need to finish before cleanup can take place.

We looked for all running transactions:

SELECT trx.trx_id,
       trx.trx_started,
       trx.trx_mysql_thread_id
FROM INFORMATION_SCHEMA.INNODB_TRX trx
JOIN INFORMATION_SCHEMA.PROCESSLIST ps ON trx.trx_mysql_thread_id = ps.id
WHERE trx.trx_started < CURRENT_TIMESTAMP - INTERVAL 1 SECOND
  AND ps.user != 'system_user';

In the list of transaction, there was one that was 2 days old.

If you have performance schema enabled, you can get the process and query that holds it:

SELECT *
FROM performance_schema.threads
WHERE processlist_id = thread_id;

We killed that process and everything recovered within 20 minutes. All storage was reclaimed back.

vvucetic
  • 146
  • 4
  • Nice find! Now that you mention it I recall we also had a very long running transaction. Maybe the rollback was just very slow (rereading the question it's actually in the question :) so it was easier to rebuild – ErikPerik Nov 19 '20 at 16:11