SLES 10 SP4 (x86_64) - HP Proliant DL380 G4 - Becomes unresponsive randomly

Question

Had the misfortune of upgrading a HP Proliant DL380 G4 from SLES 10 SP2 (i586) to SLES 10 SP4 (x86_64). Although the installation completed smoothly, the server became unresponsive after a couple of days of uptime. The server responds to PING but SSH and even console access fails. The only way to recover is to cold boot the server.

The syslogs do not show log anything when the server is unresponsive. On searching I could see similar instances reported for a varied flavors of Linux and usually was resolved by either upgrading the BIOS and or firmware of the server.

Also tried both acpi=ht and acpi=off at boot options without any success.

I have upgraded the server BIOS version available from HP passport site at this link but this did not resolve it.

Then I tried to upgrade the firmware of the Storage controller from here

I have rebooted the server and awaiting to see if this resolves the issue. Any suggestions/recommendations about what is the root cause and how can I go about fixing it?

I could find one post which is comes pretty close to what i am seeing Ubuntu 12.04 - HP ProLiant DL380 G4 - Load Maxes Out / Unresponsive

Server info:

Linux hostname 2.6.16.60-0.85.1-smp #1 SMP Thu Mar 17 11:45:06 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

> lscpu
Architecture:          x86_64
CPU(s):                4
Thread(s) per core:    2 
Core(s) per socket:    1
CPU socket(s):         2
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            15
Model:                 4
Stepping:              1
CPU MHz:               3200.225
L1d cache:             16K
L2 cache:              1024K


> modinfo cciss
filename:       /lib/modules/2.6.16.60-0.85.1-smp/updates/cciss.ko
license:        GPL
description:    Driver for HP Smart Array Controllers version 3.6.28-24 (d927/s1461)
author:         Hewlett-Packard Company
srcversion:     737C49390DD1F6FB9BC03F7

>slabtop
 Active / Total Objects (% used)    : 331966 / 339552 (97.8%)
 Active / Total Slabs (% used)      : 20306 / 20315 (100.0%)
 Active / Total Caches (% used)     : 98 / 136 (72.1%)
 Active / Total Size (% used)       : 78133.61K / 79253.95K (98.6%)
 Minimum / Average / Maximum Object : 0.02K / 0.23K / 128.00K

 OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 191752 191637  99%    0.09K   4358       44     17432K buffer_head
 44916  44891  99%    0.20K   2364       19      9456K dentry_cache
 35620  35561  99%    0.78K   7124        5     28496K ext3_inode_cache
 15064  15035  99%    0.52K   2152        7      8608K radix_tree_node
  6510   5859  90%    0.18K    310       21      1240K vm_area_struct
  5782   5689  98%    0.06K     98       59       392K size-64
  3840   3747  97%    0.08K     80       48       320K sysfs_dir_cache
  3288   3271  99%    0.61K    548        6      2192K proc_inode_cache
  3015   2259  74%    0.25K    201       15       804K filp
  2304   2043  88%    0.02K     16      144        64K anon_vma
  2304   1911  82%    0.02K     16      144        64K dm_tio
  2208   1899  86%    0.04K     24       92        96K dm_io
  2106   2096  99%    0.58K    351        6      1404K inode_cache
  1710   1633  95%    0.12K     57       30       228K size-128
  1680   1515  90%    0.03K     15      112        60K size-32
  1480   1169  78%    0.09K     37       40       148K journal_head

Any pointers would be appreciated.

score 4 · Answer 1 · edited Apr 13 '17 at 12:14

4

I had many HP Smart Array 6400/641/6i SCSI RAID controllers hang in a similar manner back in 2003-2006 with Red Hat/CentOS systems. You're losing the RAID controller and underlying storage system. I/O stalls and even console login fails because the OS can't read from disk. The networking stack is in memory, so the system responds to ping.

Some of it's a driver interaction. Some of it is just that these systems are not meant to be used with the OSes in place today. There aren't any options other than using more modern hardware (or maybe buying another unit for $18 on eBay to test). This didn't happen to all of my server during that era, but definitely impacted some more than others.

There are no firmware updates of consequence, as the last revision is from 2008. The system BIOS also stopped receiving updated in 2008. There's no support for this storage issue or the server hardware in any way.

The DL380 G4 is a 10 year-old server with ancient technology (PCI-X, Ultra SCSI, etc.). Would you expect support for a desktop of that vintage?

See: HP Proliant DL380 G4 - Can this server still perform in 2011?

edited Apr 13 '17 at 12:14

Community

1

answered Oct 29 '14 at 08:55

ewwhite

194,921
91
434
799

Thanks @ewwhite and I tend to agree with you on the age of the server and its h/w limitations. Unfortunately, since the issue only shows up after the upgrade to 64 bit OS while the server performed fairly well with 32 bit OS, I am wondering if there is workaround to avoid the RAID controller from stuffing up? Do you have any suggestions on how I could check for how and when I have lost the RAID controller? Would want to see if I can correlate it to any I/O intensive process that may be running during that time. – watman Oct 30 '14 at 08:20
Other worst case scenario I could think of is trying to disable the RAID on the server completely but I already know from your other posts that there is no light at the end of that tunnel. – watman Oct 30 '14 at 08:22
@watman You don't have any options at this point. You can try another DL380 G4 chassis to see, but the better plan is to virtualize or replace this hardware. – ewwhite Oct 30 '14 at 11:46

SLES 10 SP4 (x86_64) - HP Proliant DL380 G4 - Becomes unresponsive randomly

1 Answers1