-1

I have a Debian 9 server (12 CPUs, 80GB of RAM), running Percona server for MySQL 8.0, about 1100 clients per second, the load is not very high around 0.3 to 2.30

Latested numbers from show engine innodb status: 183.82 inserts/s, 169.69 updates/s, 5.79 deletes/s, 2179444.29 reads/s

It runs really nice most of the time, but lately time to time it suddenly refuses new connections (when a very busy table gets locked for a long time or some query takes too long).

The logs shows:

[ERROR] [MY-000000] [connection_h] Error log throttle: 1912 'Can't create thread to handle new connection' error(s) suppressed

[ERROR] [MY-010249] [Server] Can't create thread to handle new connection(errno= 11)

[ERROR] [MY-010053] [Server] Event_scheduler::execute_top: Can not create event worker thread (errno=11). Stopping event scheduler


So, basically I have two questions:

1. How to tweak the server to accept more connections (I tried a lot of things which I will list below), it cant handle more than 4821 to be exact

2. Whenever that happens, the event scheduler stops and I have to manually enable it again, its really bad because I have some MEMORY tables that gets full, they handle the heavy traffic and dumps to other tables every 5 seconds


Until now I've tried setting the soft and hard limits in /etc/security/limits.conf for *, mysql and root users:

*         hard    nofile      400000
*         soft    nofile      400000
*         hard    nproc       400000
*         soft    nproc       400000
mysql     hard    nofile      400000
mysql     soft    nofile      400000
root      hard    nofile      400000
root      soft    nofile      400000

I also have made many adjusments to /etc/sysctl.conf listing some below:

kernel.pid_max = 262144
vm.max_map_count = 262144
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.ip_local_port_range = 2000 65500
net.ipv4.tcp_max_syn_backlog = 32768
fs.file-max = 450000
net.core.netdev_max_backlog = 450000
net.core.somaxconn = 32768

I also made changes to systemd conf files: /lib/systemd/system/mysql.service - setting:

LimitNOFILE=220000
TasksMax=32768

mysqld.conf

bind-address                    = 0.0.0.0

# GENERAL #
user                           = mysql
default-storage-engine         = InnoDB
socket                         = /var/run/mysqld/mysqld.sock
pid-file                       = /var/lib/mysql/mysql.pid

# SAFETY #
max-allowed-packet             = 16M
max-connect-errors             = 1000000
skip-name-resolve
sysdate-is-now                 = 1
innodb                         = FORCE

wait-timeout                   = 600

# DATA STORAGE #
datadir                        = /var/lib/mysql/

# BINARY LOGGING #
sync-binlog                    = 0

# CACHES AND LIMITS #
tmp-table-size                 = 32M
max-heap-table-size            = 32M
max-connections                = 20000
thread-cache-size              = 300
open-files-limit               = 65535
table-definition-cache         = 4096
table-open-cache               = 4096

# INNODB #
innodb-flush-method            = O_DIRECT
innodb-log-files-in-group      = 2
innodb-log-file-size           = 512M
innodb-flush-log-at-trx-commit = 0
innodb-file-per-table          = 1
innodb-buffer-pool-size        = 64G
innodb-fast-shutdown           = 0
innodb-buffer-pool-dump-pct    = 75
innodb-buffer-pool-dump-at-shutdown = 1
innodb-buffer-pool-load-at-startup  = 1
innodb-io-capacity             = 400
innodb-io-capacity-max         = 2000

# LOGGING #
log-error                      = /var/log/mysql/mysql-error.log
log-queries-not-using-indexes  = 0
slow-query-log                 = 1
slow-query-log-file            = /var/log/mysql/mysql-slow.log
long-query-time                = 5

event_scheduler = 1
general_log_file               = /var/log/mysql/general.log
general_log                    = 0
local-infile                   = 1

I don't know what else I can do or where else I can look, tried to search for the problems and couldn't find anything that works, for the event scheduler its even worse, there's almost nothing to be found about that (try to search in google MY-010053 or Can not create event worker thread almost no results)

The server does not get unresponsive or slow, if I lock the tables on purpose, I start getting the error in about 10 seconds, and when I unlock it recover immediatly

As Wilson Hauck requested:

ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 326193
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 400000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 326193
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

iostat -xm 5 3

Linux 4.9.0-8-amd64 (zelda)     03/07/2019      _x86_64_        (12 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          11.24    0.00    1.56    0.76    0.00   86.44

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.41    44.17    2.41   67.61     0.05     1.61    48.55     0.37    5.23    4.74    5.25   1.37   9.59

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           9.77    0.00    1.91    0.82    0.00   87.51

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00    52.20    0.00   81.60     0.00     1.85    46.51     0.45    5.51    0.00    5.51   1.14   9.28

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.83    0.00    1.41    0.77    0.00   91.99

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00   132.00    0.00  161.40     0.00     8.12   103.00     0.88    5.47    0.00    5.47   0.66  10.72

show full processlist there's nothing special:

4087783 event_scheduler localhost       Daemon  1   Waiting for next activation     0   0
275050068   xgh 179.191.66.174:51143    xgh Sleep   3           0   0
275050130   xgh 179.191.66.174:51144    xgh Query   0   starting    show full processlist   0   0
300324788   xgh 179.191.66.174:40708    xgh Sleep   12595           5   61
304505269   xgh 179.191.66.174:51680    xgh Sleep   72          0   0
304505986   xgh 179.191.66.174:51706    xgh Sleep   72          0   0
305818676   xgh 172.30.5.2:57288    xgh Query   0   Sending data    SELECT *
 FROM (`noticias`.`noticia`)
 WHERE `texto` =  'Carlos Ghosn deixa prisão após mais de 100 dias detido em Tóquio'    0   0
305818680   xgh 172.30.5.2:57296    xgh Sleep   57          0   0
305818682   xgh 172.30.5.2:57302    xgh Sleep   57          0   0
305818689   xgh 172.30.5.2:57316    xgh Sleep   57          0   0
305818692   xgh 172.30.5.2:57324    xgh Sleep   57          0   0
305842475   xgh 172.30.5.2:49326    xgh Sleep   1           94  550698
305842479   xgh 172.30.5.2:49334    xgh Sleep   1           0   0
305842481   xgh 172.30.5.2:49340    xgh Sleep   1           0   0
305842486   xgh 172.30.5.2:49350    xgh Sleep   1           0   0
305842489   xgh 172.30.5.2:49358    xgh Sleep   1           0   0
305842492   xgh 172.30.5.2:49364    xgh Sleep   1           0   0
305842496   xgh 172.30.5.2:49372    xgh Sleep   1           0   0
305842498   xgh 172.30.5.2:49376    xgh Sleep   1           0   0
305842501   xgh 172.30.5.2:49382    xgh Sleep   1           0   0
305842505   xgh 172.30.5.2:49392    xgh Sleep   1           0   0
305842508   xgh 172.30.5.2:49398    xgh Sleep   1           0   0

SHOW GLOBAL STATUS: https://pastebin.com/cg1v5bHD

SHOW GLOBAL VARIABLES: https://pastebin.com/8mkTzpr0

  • 1
    Additional information request. A) complete (not edited) my.cnf or my.ini Text results of: B) SHOW GLOBAL STATUS; after minimum 24 hours UPTIME C) SHOW GLOBAL VARIABLES; D) SHOW FULL PROCESSLIST; E) complete MySQLTuner report AND Optional very helpful information, if available includes - htop OR top OR mytop for most active apps, ulimit -a for a linux/unix list of limits, iostat -xm 5 3 for IOPS by device and core/cpu count, df -h for a linux/unix free space list by device, for server tuning analysis. – Wilson Hauck Mar 04 '19 at 14:40
  • This would be better off in dba.stackexchange.com, where more people look at this type of problem. – Rick James Mar 07 '19 at 13:57
  • 1
    Is that actualll 2179444.29 `SELECTs` per second? Or is it rows or blocks? – Rick James Mar 07 '19 at 13:58
  • 1
    What did a line like this say? "Per second averages calculated from the last 26 seconds" – Rick James Mar 07 '19 at 14:02
  • 1
    @RickJames it must be 2179444.29 rows/s – Luciano Serra Mar 07 '19 at 17:16
  • 1
    @WilsonHauck I'll update the post with the complete info, thank you! – Luciano Serra Mar 07 '19 at 17:17
  • @LucianoSerra Since you are CLOSE to the 30,000 character limit of posting, please post to pastebin.com and share the links. Text results of: B) SHOW GLOBAL STATUS; after minimum 24 hours UPTIME C) SHOW GLOBAL VARIABLES; which are critical information we need to analyze what your server has been doing. Thanks for the other posted information that is helpful to understand your use of MySQL. – Wilson Hauck Mar 08 '19 at 13:26
  • Here it is: SHOW GLOBAL STATUS: https://pastebin.com/cg1v5bHD - SHOW GLOBAL VARIABLES: https://pastebin.com/8mkTzpr0 – Luciano Serra Mar 12 '19 at 18:00
  • Checking the SHOW GLOBAL VARIABLES I noticed that back_log was 5000 (the last max_connections value before restarting the service), it was a read only variable, so I had to restart the service again, now its 20000 and I could reach 7200 connections with no erros, I can't do more tests right now, but I will try later – Luciano Serra Mar 12 '19 at 18:26

2 Answers2

1

Please provide SHOW CREATE TABLE noticia. If texto is of type TEXT, then the one running query in the PROCESSLIST requires a table scan. If it is VARCHAR, then an index would probably help.

You claim there are lots of connections, but I see only one doing anything. So SHOW FULL PROCESSLIST again, hoping to get several queries.

Rick James
  • 2,058
  • 5
  • 11
  • 1
    You are right, its because most queries run really fast, so it doesn't appear on processlist, that one (noticia) executed in less than a second too, I guess it was executed at the same time. The real problem happens only eventually when something takes longer to execute - I guess it is when the table "maquina" is locked for several seconds then lots of connections gets queued. – Luciano Serra Mar 07 '19 at 21:15
  • 1
    If I manually lock the that table (maquina), in about 5 seconds I can get almost 5000 connections... but its ok, I just wanted to be able to handle a little bit more, I know the server is capable of handling much more... but it gets stuck when it reaches 4830 or so – Luciano Serra Mar 07 '19 at 21:15
  • 1
    @LucianoSerra - OK. We need to dig into (1) Why are there 1K connections/second, and whether that can be decreased. and (2) Is the query against `maquina` poorly optimized or indexes, and we can 'fix' that. – Rick James Mar 07 '19 at 21:55
  • Rick, as you requested, here is the create table info: https://pastebin.com/6WSHuXsa – Luciano Serra Mar 15 '19 at 13:02
  • Rick, I guess I found out the problem, it does not solve the question about being able to handle more connections but I dont really need it right now if the server runs smoothly. We have a backup (mysqldump) routine that runs at specific times, I guess that script was locking the tables for a litte longer than it should, I added the option --single-transaction and --quick I'm not sure if it was root problem, but we did not had those peaks in the last two days, I will keep watching, thanks for the help anyway – Luciano Serra Mar 15 '19 at 13:19
  • @LucianoSerra - The dump will take more time as the data grows. One solution is to have a Slave and do the dump from the slave. – Rick James Mar 15 '19 at 15:52
0

Rate Per Second=RPS - Suggestions to consider for your my.cnf [mysqld] section

innodb_lru_scan_depth=100  # from 1024 to conserve 90% of CPU cycles used for function
innodb_flushing_avg_loops=5  # from 30 to reduce delay and reduce innodb_buffer_pool_pages_dirty of 90,000 +
innodb_io_capacity=1900  # from 400 to allow higher IOPS to data devices
read_rnd_buffer_size=128K  # from 256K to reduce handler_read_rnd_next RPS of ~ 600,000 RPS

Monitor innodb_buffer_pool_pages_dirty with SHOW GLOBAL STATUS LIKE '%dirty%';

Time consumers that need to be reviewed, A) com_rollback_to_savepoint AVG 1 every 62 seconds, B) handler_rollback AVG 1 every 52 seconds, C) handler_savepoint_rollback AVG 1 every 31 seconds.

Disclaimer: I am the author of mysqlservertuning.com as listed in my profile, Network profile. Good luck with your instance averaging 367 connections per second.

Wilson Hauck
  • 426
  • 4
  • 10
  • Thank you, I will try the changes tonight and check the results – Luciano Serra Mar 15 '19 at 12:56
  • Oh I guess I found out the problem, it does not solve the question about being able to handle more connections but I dont really need it right now if the server runs smoothly. We have a backup (mysqldump) routine that runs at specific times, I guess that script was locking the tables for a litte longer than it should, I added the option --single-transaction and --quick I'm not sure if it was root problem, but we did not had those peaks in the last two days. Any way, good to know about mysqlservertuning.com I will need your services for sure – Luciano Serra Mar 15 '19 at 13:18