I have a multiuser ERP application running on a CentOS 5.5 platform. The hardware is HP ProLiant DL380 G6. This has been a stable system for the past year, but there have been problems over the past week. The issue is a gradual rise in system load over the course of several hours to levels of 60+. The system remains responsive, but at some point, the HP ASR watchdog timer kicks in and reboots the server.
I have plenty of these systems in the field, but haven't had this particular issue before. The crash has occurred four times over the past week, but I was able to catch it this morning before the system became completely unresponsive.
This time, I found that the system load was about 75, but there were 14 zombie processes that I could not kill. The PPID's were 1, so I knew a reboot would be in order. Interestingly, the following is an excerpt of the dmesg output and contains messages that I didn't see in previous crashes. What do these entries mean? The PIDs correspond to the zombie processes that I could not kill. in.telnetd
is the Telnet daemon for a particular user session and dbc
is the per-user ERP application instance.
INFO: task in.telnetd:12210 blocked for more than 600 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
in.telnetd D ffff81000102e4a0 0 12210 8899 16297 12762 (L-TLB)
ffff8103848d7d38 0000000000000046 ffff8108272c1738 ffff81082328ae80
ffff81082644d9c0 0000000000000009 ffff8102d8d7a080 ffff81011c9df100
00011ef007c2d74b 0000000000002358 ffff8102d8d7a268 0000000500000001
Call Trace:
[<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8003db0d>] lock_timer_base+0x1b/0x3c
[<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
[<ffffffff8009dc66>] flush_workqueue+0x3f/0x87
[<ffffffff801a96ee>] release_dev+0x503/0x67b
[<ffffffff800646f9>] __down_failed+0x35/0x3a
[<ffffffff80225b40>] sock_destroy_inode+0x0/0x10
[<ffffffff800537af>] tty_release+0x11/0x1a
[<ffffffff80012ad9>] __fput+0xd3/0x1bd
[<ffffffff80023c39>] filp_close+0x5c/0x64
[<ffffffff80038f19>] put_files_struct+0x63/0xae
[<ffffffff80015860>] do_exit+0x31c/0x911
[<ffffffff800491a7>] cpuset_exit+0x0/0x88
[<ffffffff8005d116>] system_call+0x7e/0x83
INFO: task dbc:9054 blocked for more than 600 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dbc D ffff810001036b20 0 9054 1 3272 1795 (L-TLB)
ffff81028e0c3d38 0000000000000046 0000000000000000 00000000000001d0
0000000000000000 0000000000000009 ffff8107d60ff100 ffff81011c9ed080
00011edea224420a 00000000000ebaee ffff8107d60ff2e8 0000000600000000
Call Trace:
[<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8003db0d>] lock_timer_base+0x1b/0x3c
[<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
[<ffffffff8009dc66>] flush_workqueue+0x3f/0x87
[<ffffffff801a96ee>] release_dev+0x503/0x67b
[<ffffffff8837dd1d>] :xfs:xfs_free_eofblocks+0x9d/0x1fe
[<ffffffff800537af>] tty_release+0x11/0x1a
[<ffffffff80012ad9>] __fput+0xd3/0x1bd
[<ffffffff80023c39>] filp_close+0x5c/0x64
[<ffffffff80038f19>] put_files_struct+0x63/0xae
[<ffffffff80015860>] do_exit+0x31c/0x911
[<ffffffff800491a7>] cpuset_exit+0x0/0x88
[<ffffffff8006149d>] sysenter_do_call+0x1e/0x76