Server crashes with "huge task timeout"

Question

First things first, I've read a lot about the "huge task timeout" kernel panic and know that this often happens if the server is out of resources.

Error messages which appear only in the VNC console not in any log file:

[264240.505133] "echo 0 > /proc/sys/kernel/huge_task_timeout_secs" disables this message.
[264240.505359] INFO: task nginx:2333 blocked for more than 120 secounds.
[264240.505454] "echo 0 > /proc/sys/kernel/huge_task_timeout_secs" disables this message.
[264240.505658] INFO: task nginx:2334 blocked for more than 120 secounds.
[264240.505752] "echo 0 > /proc/sys/kernel/huge_task_timeout_secs" disables this message.
[264240.505946] INFO: task nginx:2335 blocked for more than 120 secounds.
[264240.506038] "echo 0 > /proc/sys/kernel/huge_task_timeout_secs" disables this message.
[264240.506251] INFO: task php5-fpm:2415 blocked for more than 120 secounds.
...

Server specs:

8 Core Intel® Xeon® E5-2660V3
24 GB DDR4
320GB SSD

The machine is KVM virtualized. It runs debian wheezy with PHP5-FPM, NGINX, MySQL and some other smaller stuff. Mainly it hosts a WebSite and a huge MySQL DB with around 25 GB data.

Disk usage is around 12%.

I've installed Munin for monitoring, which shows no anomaly. But since the last crash I installed also sysstat but I don't really know which of the log files could be useful for you. So please request that one you think are needed.

The crash happened around 10.03.2015 17:37 GMT.

In my opinion this has something to do with MySQL. Here the my.cnf

[client]
port        = 3306
socket      = /var/run/mysqld/mysqld.sock

[mysqld_safe]
socket      = /var/run/mysqld/mysqld.sock
nice        = 0

[mysqld]
user        = mysql
pid-file    = /var/run/mysqld/mysqld.pid
socket      = /var/run/mysqld/mysqld.sock
port        = 3306
basedir     = /usr
datadir     = /var/lib/mysql
tmpdir      = /tmp
lc-messages-dir = /usr/share/mysql
skip-external-locking

bind-address        = 127.0.0.1

key_buffer      = 16M
max_allowed_packet  = 16M
thread_stack        = 192K
thread_cache_size       = 8
myisam-recover-options  = BACKUP
max_connections         = 50
query_cache_limit   = 1M
query_cache_size        = 16M

log_error = /var/log/mysql/error.log

slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log

expire_logs_days    = 10
max_binlog_size         = 100M

innodb_buffer_pool_size = 18G
innodb_log_file_size    = 256M

[mysqldump]
quick
quote-names
max_allowed_packet  = 16M

[mysql]

[isamchk]
key_buffer      = 16M

!includedir /etc/mysql/conf.d/

As you can see I configured MySQL that it can use around 80% of the total RAM. The MySQL server performs in average 2k queries/secound with 50/50 read/write.

Right before the crash I saw in htop that around 21 GB of 24 GB are used and 500 MB of the 1,5 GB swap, CPU usage was normal.

EDIT: sar -u of the time direct before the crash:

18:27:01        CPU     %user     %nice   %system   %iowait    %steal     %idle
18:29:01        all      8,28      0,00      1,31      5,61      0,02     84,77
18:31:01        all      7,65      0,41      1,41      5,73      0,03     84,78
18:33:01        all      7,95      0,00      1,25      5,51      0,02     85,27
18:35:01        all      8,87      0,00      1,42      5,53      0,03     84,15
18:37:01        all      8,99      0,42      1,40      5,94      0,03     83,22
Average:        all      8,65      0,16      1,35      5,08      0,03     84,73

EDIT:

Munin images

http://imgur.com/a/0BZa0

MySQL Query

EDIT:

I contacted my ISP an they said, that nothing abnormal happened at the time of the crash. So it has something to do with my setup. Now I will check what happens if I reduce the innodb_buffer_pool_size to 14 GB and add the innodb_flush_method = O_DIRECT.

Could you please explain the crashing behavior? Is it a core dump, a kernel panic? Resource starvation? OOM Killer? KVM crash? — Mircea Vutcovici, Mar 10 '15 at 19:01
The server freezes completely, I'm unable to connect via SSH/HTTP/etc. only in the VNC Console which is buffered, I can see the last print out (see main post). I asked my provider and they said that the KVM is still running, so I think it is a kernel panic. — Christian D., Mar 10 '15 at 19:05
You should probably look at `iowait` in `sar`. Your symptoms leads me to believe it's an IO issue. — Belmin Fernandez, Mar 10 '15 at 19:31
%steal is CPU requested by your VM and not given because another VM was scheduled to run. The higher it is the slower will be your VM. — Mircea Vutcovici, Mar 10 '15 at 19:52

Mircea Vutcovici · Answer 1 · 2015-03-10T20:08:42.250

0

The problem is not a kernel panic. What you see on the console is hung process in a kernel call.

You should check either the console or the /var/log/syslog or /var/log/messages and search for the full stack trace that is logged by the kernel. You will have an idea which subsystem is slow. Could be disk as mentioned by @Belmin Fernandez, could be network...

Now you have to get some stats from the host too. If CPU usage or disk I/O are overcommited, you could have a resource starvation caused by other VMs running on the same host. It will be difficult to determine if this is the cause only from VM.

KVM has support for paravirtualization drivers. Check with the ISP if they are installed and up to date on all VMs running on the host machine.

If you run both MySQL and Nginx on the same machine, make sure that MySQL and Nginex can keep all active data in RAM. 80% allocated for MySQL could be to high.

Could you please post Munin graphs of the File System Cache. When FS Cache is going lower and lower, then your VM is starving memory.

If you have access on the console, you can use kernel's magic SysRq key to trigger a task list. Enable it echo 1 > /proc/sys/kernel/sysrq, then you can use the console to list the running tasks: ALT + SysRq + t

edited Mar 10 '15 at 20:08

answered Mar 10 '15 at 19:37

Mircea Vutcovici

16,706
4
52
80

`/var/log/syslog` and `/var/log/messages` contain **no** stack. Both stop right before 18:37. Syslog contains as last entrie before the crash the start of the sysstat cronjob while messsages starts with the boot kernel stack. – Christian D. Mar 10 '15 at 20:07
The memory graphs of Munin would be interesting then. If you have Disk I/O would be great too. – Mircea Vutcovici Mar 10 '15 at 20:09
FS Cache is ok in the Munin graphs – Mircea Vutcovici Mar 10 '15 at 20:18
The disk I/O is very low after reboot. I was wondering why is so high before reboot? – Mircea Vutcovici Mar 10 '15 at 20:20
19GB active memory average - I think soon you will need to ad more memory to the VM and to MySQL too. – Mircea Vutcovici Mar 10 '15 at 20:23
I've added an image of the MySQL Queries/second which shows why there is such a heavy IO before the crash. The DB will be optimized soon as well as I will upgrade to a much bigger instance. But for now I have this one. What do you mean by "access to console"? I only have SSH access and VNC. – Christian D. Mar 10 '15 at 20:35
Access to the console, means the terminal where you can configure the BIOS and see the kernel boot messages. – Mircea Vutcovici Mar 11 '15 at 12:46
Canonically, the console is the primary terminal where the kernel messages are printed. It could be a serial tty. – Mircea Vutcovici Mar 11 '15 at 12:48
In my VNC console I can see the kernel boot messages, but it does not support the `SysRq` key. – Christian D. Mar 11 '15 at 14:27

Server crashes with "huge task timeout"

1 Answers1