3

I have a multiuser ERP application running on a CentOS 5.5 platform. The hardware is HP ProLiant DL380 G6. This has been a stable system for the past year, but there have been problems over the past week. The issue is a gradual rise in system load over the course of several hours to levels of 60+. The system remains responsive, but at some point, the HP ASR watchdog timer kicks in and reboots the server.

I have plenty of these systems in the field, but haven't had this particular issue before. The crash has occurred four times over the past week, but I was able to catch it this morning before the system became completely unresponsive.

This time, I found that the system load was about 75, but there were 14 zombie processes that I could not kill. The PPID's were 1, so I knew a reboot would be in order. Interestingly, the following is an excerpt of the dmesg output and contains messages that I didn't see in previous crashes. What do these entries mean? The PIDs correspond to the zombie processes that I could not kill. in.telnetd is the Telnet daemon for a particular user session and dbc is the per-user ERP application instance.

INFO: task in.telnetd:12210 blocked for more than 600 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
in.telnetd    D ffff81000102e4a0     0 12210   8899         16297 12762 (L-TLB)
 ffff8103848d7d38 0000000000000046 ffff8108272c1738 ffff81082328ae80
 ffff81082644d9c0 0000000000000009 ffff8102d8d7a080 ffff81011c9df100
 00011ef007c2d74b 0000000000002358 ffff8102d8d7a268 0000000500000001
Call Trace:
 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff8003db0d>] lock_timer_base+0x1b/0x3c
 [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff8009dc66>] flush_workqueue+0x3f/0x87
 [<ffffffff801a96ee>] release_dev+0x503/0x67b
 [<ffffffff800646f9>] __down_failed+0x35/0x3a
 [<ffffffff80225b40>] sock_destroy_inode+0x0/0x10
 [<ffffffff800537af>] tty_release+0x11/0x1a
 [<ffffffff80012ad9>] __fput+0xd3/0x1bd
 [<ffffffff80023c39>] filp_close+0x5c/0x64
 [<ffffffff80038f19>] put_files_struct+0x63/0xae
 [<ffffffff80015860>] do_exit+0x31c/0x911
 [<ffffffff800491a7>] cpuset_exit+0x0/0x88
 [<ffffffff8005d116>] system_call+0x7e/0x83


INFO: task dbc:9054 blocked for more than 600 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
dbc           D ffff810001036b20     0  9054      1          3272  1795 (L-TLB)
 ffff81028e0c3d38 0000000000000046 0000000000000000 00000000000001d0
 0000000000000000 0000000000000009 ffff8107d60ff100 ffff81011c9ed080
 00011edea224420a 00000000000ebaee ffff8107d60ff2e8 0000000600000000
Call Trace:
 [<ffffffff80063c6f>] __mutex_lock_slowpath+0x60/0x9b
 [<ffffffff8003db0d>] lock_timer_base+0x1b/0x3c
 [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
 [<ffffffff8009dc66>] flush_workqueue+0x3f/0x87
 [<ffffffff801a96ee>] release_dev+0x503/0x67b
 [<ffffffff8837dd1d>] :xfs:xfs_free_eofblocks+0x9d/0x1fe
 [<ffffffff800537af>] tty_release+0x11/0x1a
 [<ffffffff80012ad9>] __fput+0xd3/0x1bd
 [<ffffffff80023c39>] filp_close+0x5c/0x64
 [<ffffffff80038f19>] put_files_struct+0x63/0xae
 [<ffffffff80015860>] do_exit+0x31c/0x911
 [<ffffffff800491a7>] cpuset_exit+0x0/0x88
 [<ffffffff8006149d>] sysenter_do_call+0x1e/0x76
ewwhite
  • 194,921
  • 91
  • 434
  • 799

1 Answers1

1

Those messages mean something is consuming all available I/O. This is either due to (a) failing hardware (disc/controller/etc) which results in the available I/O being 0 or (b) there being a process (or processes) using all the I/O.

The guilty process may not be the one listed in dmesg. It was the victim.

Since I doubt in.telnetd (aside: why would you have telnetd running ever?) touches /data and you have other systems that don't experience the issue I'm guessing c0d0 is bad or the firmware needs to be updated.

Run the HP Insight Diagnostics on it. Otherwise next time it happens see if you can run iostat to see if the disc is actually being overwhelmed.

Mark Wagner
  • 17,764
  • 2
  • 30
  • 47
  • Telnetd since this is an ERP application with point-of-sale terminals and wireless RF guns (which require telnet). The server was entirely responsive through this, so there were no problems with disk access to the onboard Smart Array controller. I've narrowed this down to a system board/PCI card issue... I think it's a bad SCSI adapter (for the tape drive). – ewwhite Jan 27 '11 at 22:45
  • 1
    It was a bad SCSI HBA. – ewwhite Feb 09 '11 at 13:49