2

Trying to post this question on here. I've posted it on the Ubuntu forums as well with no replies.

Recently I upgraded an HP ProLiant DL380 G4 server from Ubuntu 10.04 server to Ubuntu 12.04 server.

Upon doing so, the server will not - at random times - get to a load of 400+ and then become completely non-responsive. I use an SNMP graphing program (cacti) and the load steadily increases by about 10 every five minutes until it gets over 400 and the graphing stops.

The graphs may not be accurate, but the CPU load averages about 3% before this happens - and right when the load starts increasing, it jumps to about 25% for 15 minutes and dramatically dips down to less than 1% (about 0.3%) until the graphing stops.

I'm not able to open a SSH tunnel to the server to do anything. I've checked the /var/log/syslog and all logging stops at that time as well - with nothing else in there.

The odd thing is - the server still responds to DNS queries for the zones it is authoritative on during this time - and at normal speed.

Just not sure what the next step would be in order to find out what is going on - and how this issue can be corrected. The server cannot stay with Ubuntu 10.04 Server and needs to stay upgraded.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
Brian S
  • 23
  • 3

1 Answers1

0

This would be an I/O-related issue, as the disks and all write activity stops. The kernel and networking stack are running in RAM, thus the server is pingable.

The main things I would check are the system's BIOS/firmware, and the firmware revision of the Smart Array controller on the system. This is a an old ProLiant DL380 G4 (circa 2005), so you either have the onboard Smart Array 6i controller, a Smart Array 641 controller or a Smart Array 6400-series controller.

Can you tell us more?

The rapid load rise is due to processes being blocked waiting for I/O. You don't say what type of application is running on the system, but it seems like you probably have, say, 380+ processes waiting for disk :)

-- edit --

So, I deployed lots of those servers over the years. Do you have access to the firmware? Are you running the HP Management Agents? This will give you more insight into what you need here and get proper drivers in place.

And finally... this is reallllly old gear... Would you consider an upgrade?

See: HP Proliant DL380 G4 - Can this server still perform in 2011?

-- edit --

Try # modinfo cciss and post the result.

[root@MDMarra ~]# modinfo cciss
filename:       /lib/modules/2.6.32-279.14.1.el6.x86_64/kernel/drivers/block/cciss.ko
license:        GPL
version:        3.6.28
description:    Driver for HP Smart Array Controllers
author:         Hewlett-Packard Company
srcversion:     712C176F5D360D8C1166F22
ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Thank you for the fast reply. I was thinking this may be the issue as well. This server runs web services, mysql, mail services, pop3/imap services, dns, snmp, amongst others. It is basically a shared hosting server. As you indicated, this is the onboard Smart Array 6i controller. Firmware appears to be updated as well - so I'm at a loss on how this could be resolved. – Brian S Dec 07 '12 at 15:52
  • I understand about how old the gear is - but these servers seem to be very rock solid and have worked well. With only averaging about 5% CPU usage, I have these servers well under-utilized. I can only see the need to upgrade if I notice performance issues - or - an issue like this one if it cannot be resolved. I find it a bit sad though that just upgrading software or an OS on a server would cause a problem like this and points more towards something in the new software that should be fixed. – Brian S Dec 07 '12 at 16:03
  • Just got the download for the HP Management Agents to a CD. Looking at the install instructions, however - I'm not seeing any kind of program that would check BIOS/Firmware versions and provide information on what may need upgraded. – Brian S Dec 07 '12 at 16:10
  • @BrianS The Smart Array 6i wasn't the fastest controller. Especially without having any cache. Does the system eventually recover, or do you need to hard-reboot? The big difference between versions here may be the CCISS driver/module. See my update above. – ewwhite Dec 07 '12 at 16:11
  • No, the system does not recover even after letting it run for 10+ hours after it occurs. As for running the modinfo cciss - it returns version 3.6.26. I've downloaded the updated firmware but it will not run in linux - as it complains that libstdc++-libc6.2-2.so.3 is not found. I found an ia64 version of that but it will not install in my amd64 architecture. – Brian S Dec 07 '12 at 16:36
  • I was able to download the i386 version along with some other dependencies. It flashed the 6i SmartArray successfully - so now it is a waiting game to see if that corrects the problem. – Brian S Dec 07 '12 at 16:51
  • Alright - server has been cold booted after the firmware update. One thing to note is the modinfo cciss. On the older version of Ubuntu, it shows a version of 3.6.20. On Ubuntu 12.04, it shows version 3.6.26. Still not the 3.6.28 that you show above. – Brian S Dec 07 '12 at 17:04
  • That's fine... [No major changes between the versions](http://cciss.sourceforge.net/#downloads). Hopefully, firmware took care of it. Please mark an answer. – ewwhite Dec 07 '12 at 17:09
  • Thank you. I found a much newer version of the cciss driver at http://cciss.sourceforge.net/. Version 4.6.28-14 released October 3, 2011. But anytime that there is a kernel upgrade, I don't want to compile the kernel with the new driver each time. – Brian S Dec 07 '12 at 17:14
  • Don't use that version... that won't help you. – ewwhite Dec 07 '12 at 17:15