We have a LAMP server for about 6 months. CentOS 7.0
It ran non stop without a restart for the first 3 months then it hang.
Then it run for the next 2 months (also non stop without a restart) then it hang again.
Then it run for the 14 days then it hang.
Then it run for the 14 days then it hang.
After each hang we had to restart the server. We haven't added/updated any system software.
The hang symptoms are the same in all these cases:
Write to (and read from) a disk stops completely.
The web server and MySQL database stops working. We can not login via physical console or remotely via ssh.
However, when this hang happened I had open remote ssh shell sessions with linux "top" and "mytop" commands running, and these were working (refreshing) until the server was restarted.
So this proves that the server was not freeze completely. Some of the software was still running.
The sever could not gracefully restart.
I found nothing in the logs. All logs stopped at the same time.
The last entries on the physical console (KVM) when hangs happens were mentioning errors with Adaptec RAID controller. Please see bellow:
00001
[1143965.194144) 0000000000000246 000000014423ecb4 1111880869b6b740 ffff880000c 00040
00040
[1143965.194786] Call Trace:
[1143965.195044] [<Ifffffffa007f46b>] aac_fib_send+0x3db/8x510 [aacraid]
[1143965.195307] [<ffffffffa00794d8>] aac_get_adapter_info+0xc8/8xb70 [aacraid] [1143965.195573] [<ffffffffa007e990>] _aac_reset_adapter+0x430/0x620 [aacraid]
[1143965.195573] [<ffffffffa007e990>] _aac_reset_adapter+0x430/0x620 [aacraid]
[1143965.195838] [<ffffffffa0071a79>] aac_reset_adapter+0xa9/0x290 [aacraid]
[1143965.196101] [<ffffffffa0076214>] aac_eh_reset+Oxla4/0xle0 [aacraid]
[1143965.196368] [<ffffffff813d6d83>] scsi_try_host_reset+0x43/0x100
[1143965.196628] [<ffffffff813d812,17>] scsi_eh_ready_devs+0x887/0xc20
[1143965.196889] [<ffffffff813da43c>] scsi_error_handler+0x52c/8x820
[1143965.197151] [<ffffffff813d9110>] ? scsi_eh_get_sense+0x2a0/0x2a0
[1143965.197415] [<1111111181085aff>] kthread+0xcf/8xe0
[1143965.197675] [<1111111181085a30>] ? kthread_create_on_node+0x140/0x140
[1143965.197939] [<111111118151316c>] ret_from_fork+Ox7c/OxbO
[1143965.198200] [<1111111181085a30>] ? kthread_create_on_node+0x140/0x140
[1143965.198461] Code: 48 c? 87 b8 00 00 00 00 30 08 a0 5d c3 Al 11 84 00 00 00 00 00 Of 11 44 00 00 55 48 8b 87 90 01 00 00 48 89 e5 8b 80 be 00 00 00 <a8> 04 75 14 f6 c4 01 75 14 25 80 00 00 00 83 f8 01 19 c0 83 e0
00 00 Of 11 44 00 00 55 48 8b 87 90 01 00 00 48 89 e5 8b 80 be 00 00 00 <a8> 04 75 14 f6 c4 01 75 14 25 80 00 00 00 83 f8 01 19 c0 83 e0
75 14 f6 c4 01 75 14 25 80 00 00 00 83 f8 01 19 c0 83 e0
[1143974.082729] aacraid: aac_fib_send: first asynchronous command timed out.
[1143974.082729] Usually a result of a PCI interrupt routing problem;
[1143974.082729] update mother board BIOS or consider utilizing one of
[1143974.082729] the SAFE mode kernel options (acpi, apic etc)
We have replaced the RAID controller card but it did not solved the problem, we had a hung server again with the same symptoms.
I'm now having a remote ssh shell running all the time with "dmesg -wH" hoping to catch more of the dmesg log when hangs happen again.
The sever is having an Adaptec RAID card with two SATA SSD 960GB in RAID 1 and two SATA 500 GB HDD in RAID 1.
S.M.A.R.T. attributes are OK for all the drives.
Any advice?
Edit #1 9/13/2015:
There are plenty of free space on all of the partitions.
Logs are rotating properly.
Edit #2 9/13/2015:
RAID controller: Adaptec ASR71605
BIOS : 7.5-0 (32069)
Firmware : 7.5-0 (32069)
Driver : 1.2-0 (30300)
Boot Flash : 7.5-0 (32069)