How to solve intermittent server hangs? Write to (and read from) a disk stops completely

Question

We have a LAMP server for about 6 months. CentOS 7.0

It ran non stop without a restart for the first 3 months then it hang.

Then it run for the next 2 months (also non stop without a restart) then it hang again.

Then it run for the 14 days then it hang.

After each hang we had to restart the server. We haven't added/updated any system software.

The hang symptoms are the same in all these cases:

Write to (and read from) a disk stops completely.

The web server and MySQL database stops working. We can not login via physical console or remotely via ssh.

However, when this hang happened I had open remote ssh shell sessions with linux "top" and "mytop" commands running, and these were working (refreshing) until the server was restarted.

So this proves that the server was not freeze completely. Some of the software was still running.

The sever could not gracefully restart.

I found nothing in the logs. All logs stopped at the same time.

The last entries on the physical console (KVM) when hangs happens were mentioning errors with Adaptec RAID controller. Please see bellow:

00001
[1143965.194144) 0000000000000246 000000014423ecb4 1111880869b6b740 ffff880000c 00040
00040
[1143965.194786] Call Trace:
[1143965.195044] [<Ifffffffa007f46b>] aac_fib_send+0x3db/8x510 [aacraid] 
[1143965.195307] [<ffffffffa00794d8>] aac_get_adapter_info+0xc8/8xb70 [aacraid] [1143965.195573] [<ffffffffa007e990>] _aac_reset_adapter+0x430/0x620 [aacraid] 
[1143965.195573] [<ffffffffa007e990>] _aac_reset_adapter+0x430/0x620 [aacraid] 
[1143965.195838] [<ffffffffa0071a79>] aac_reset_adapter+0xa9/0x290 [aacraid] 
[1143965.196101] [<ffffffffa0076214>] aac_eh_reset+Oxla4/0xle0 [aacraid] 
[1143965.196368] [<ffffffff813d6d83>] scsi_try_host_reset+0x43/0x100 
[1143965.196628] [<ffffffff813d812,17>] scsi_eh_ready_devs+0x887/0xc20 
[1143965.196889] [<ffffffff813da43c>] scsi_error_handler+0x52c/8x820 
[1143965.197151] [<ffffffff813d9110>] ? scsi_eh_get_sense+0x2a0/0x2a0 
[1143965.197415] [<1111111181085aff>] kthread+0xcf/8xe0
[1143965.197675] [<1111111181085a30>] ? kthread_create_on_node+0x140/0x140 
[1143965.197939] [<111111118151316c>] ret_from_fork+Ox7c/OxbO
[1143965.198200] [<1111111181085a30>] ? kthread_create_on_node+0x140/0x140 
[1143965.198461] Code: 48 c? 87 b8 00 00 00 00 30 08 a0 5d c3 Al 11 84 00 00 00 00 00 Of 11 44 00 00 55 48 8b 87 90 01 00 00 48 89 e5 8b 80 be 00 00 00 <a8> 04 75 14 f6 c4 01 75 14 25 80 00 00 00 83 f8 01 19 c0 83 e0
00 00 Of 11 44 00 00 55 48 8b 87 90 01 00 00 48 89 e5 8b 80 be 00 00 00 <a8> 04 75 14 f6 c4 01 75 14 25 80 00 00 00 83 f8 01 19 c0 83 e0
75 14 f6 c4 01 75 14 25 80 00 00 00 83 f8 01 19 c0 83 e0
[1143974.082729] aacraid: aac_fib_send: first asynchronous command timed out. 
[1143974.082729] Usually a result of a PCI interrupt routing problem; 
[1143974.082729] update mother board BIOS or consider utilizing one of 
[1143974.082729] the SAFE mode kernel options (acpi, apic etc)

We have replaced the RAID controller card but it did not solved the problem, we had a hung server again with the same symptoms.

I'm now having a remote ssh shell running all the time with "dmesg -wH" hoping to catch more of the dmesg log when hangs happen again.

The sever is having an Adaptec RAID card with two SATA SSD 960GB in RAID 1 and two SATA 500 GB HDD in RAID 1.

S.M.A.R.T. attributes are OK for all the drives.

Any advice?

Edit #1 9/13/2015:
There are plenty of free space on all of the partitions.
Logs are rotating properly.

Edit #2 9/13/2015:
RAID controller: Adaptec ASR71605
BIOS : 7.5-0 (32069)
Firmware : 7.5-0 (32069)
Driver : 1.2-0 (30300)
Boot Flash : 7.5-0 (32069)

Update the operating system. Also update the system firmware. — Michael Hampton, Sep 13 '15 at 01:25
Also how much free space is on the disk? The endless shrinking spiral of reboots might point to logs filling up the system and not being properly rotated. — Giacomo1968, Sep 13 '15 at 06:18
Thank you both. There are plenty of free space. The log partition is only 7% full. Logs are rotating properly. — alxsr, Sep 13 '15 at 16:02
If you use drivers from adaptec make sure you use latest version or uninstall them and use driver provided by OS. — dtoubelis, Sep 13 '15 at 16:14
The Adaptec drivers we use came with the CentOS 7.0 an their version is 1.2-0[30300] while on Adaptec site there is v1.2.1-41024 available for download. I did not run fsck on partitions. — alxsr, Sep 13 '15 at 19:12
every 14 days is a very strange pattern. I recomend you to set a maintance on your service and run fsck. Please note it may be run for a long time, depending on your disk size! — BaZZiliO, Sep 13 '15 at 19:45

score 0 · Accepted Answer · answered Nov 07 '16 at 00:08

The solution was to use Adaptec own driver (it can be downloaded from their site) not opensource driver that came with CentOS. The server ran for about 11 months with Adaptec driver (then server hang for unknown reason) which is a vast improvement from 14 days uptime with opensource driver.

How to solve intermittent server hangs? Write to (and read from) a disk stops completely

1 Answers1