I have a server that I built years ago that has worked like a champ. But within the past few months it has started to become seriously unstable with no discernible pattern. I have been debugging it and swapping out parts to no avail. I have replaced almost everything in the system that I can think of that might be the cause save drives used for storage.
Note that the system is running CentOS 7.5.
The symptoms are that the machine will spontaneously perform a hard reset as if the power supply were cycling or there was a sudden loss of power. It can happen once every few days or sometimes twice a day. The system can be idle or with a load. There is no pattern.
I have removed everything but the bare essentials. Note that I have replaced:
The motherboard, CPU, RAM, and PSU.
If any of the ram sticks were defective I would expect to see logs of corrected/uncorrectable ECC errors, which I do not. If it were the CPU I would expect something a bit more random with some logging from a possible kernel panic. I suspected that it might be a fault with the power supply and replaced that. The problem persisted so I tried replacing the motherboard. No change.
The system was configured with two processors and 16 sticks of identical memory so I tried to remove one CPU and half the ram, see if it crashed, then swap the other set in. No change in symptoms.
I started removing extra components and have arrived at the bare minimum with no change in symptoms.
- There is never anything suggesting a hardware failure in the logs; they simply end at the point of reset.
- There is nothing in the IPMI logs.
- There is nothing in the UPS logs (removing the UPS did not help either).
- The processors are not overheating. I logged lmsensors with no abnormalities.
- Monitored system temperature, CPU and memory Vcore, fan RPM, and PSU voltages with ipmitool logs.
- All SMART tests report PASSED.
- I swapped the primary disk used for the OS (/ root, boot, swap) to another SSD by mirroring it with mdadm and installing grub.
- Both RAID arrays (see specs below) are ZFS and do not report any faults. There are no issues when scanning for bit rot or corruption.
I am now at a complete and utter loss. With the exception of the few remaining drives in the system, I've run out of things to try replacing save for the case itself.
What could possibly be causing my server to be resetting itself? What else can I test for? Would the fault really be coming from one of the drives?
Currently the system is specced as follows:
Base components:
- SuperMicro H8DG6-F (Motherboard)
- 1x AMD Opteron Processor 6328 (CPU)
- 16GB x 8 Hynix DDR3 ECC HMT42GR7BMR4C-G7 (Memory)
Storage:
- 1x Samsung SSD 850 PRO 128GB XFS (/ root, boot, swap)
- 2x Samsung SSD 850 PRO 512GB ZFS RAID-1 (/data)
- 8x Western Digital RED 4TB WD40EFRX-68WT0N0 ZFS RAID-Z3 (/backup)
The Western Digital RED drives are connected to the case backplane and are conencted to the onboard SAS controller. All if the SSDs are in a ToughArmor MB998SP-B backplane mounted in a 5.25" bay at the front of the case and are connected to the motherboard SATA controller.
Cooling:
- NH-U12DO A3 (CPU)
- Fans added to chipset heatsinks (they get very hot)
- Small heatsink added to Intel Gigabit chip
- Thermal paste on ALL heatsinks has been replaced with Noctua NT-H1 with the exception of the small heatsinks around the CPUs which have thermal pads
Case:
Power Supply:
UPS
UPDATE:
I have been able to trace the stability issue to an unlikely source: software. This seems unlikely and was not previously entertained during differential diagnosis as a software issue (even for a Kernel module) should at worst trigger a kernel panic.
The source has been identified as the ZFS arrays (ZFS on Linux). I can replicate the crash by removing all disks except for the OS and a ZFS array and then performing a scrub on that array while there are simultaneous reads on any ZFS array (the same or other) on the system.
Basic testing setup:
- 1 CPU
- 16GB x 8 Memory
- 128GB SSD for CentOS 7.5 (Boot/Swap/Root)
- SuperMicro H8DG6-F Motherboard
- PWS-865-PQ 865W PSU
- Onboard Matrox G200 Video
All disks are connected to the motherboard. No PCIe slots are populated.
Elimination of other sources:
- CPU (swapped with a second CPU)
- Memory (swapped with a second set of memory)
- Motherboard (swapped with another identical board; BIOS is updated)
- OS Hard Disk (swapped between Crucial and Samsung 128GB SSDs)
- PSU is certified for use with this motherboard (tested against two of these)
ZFS activity:
- Scrub on a single array
- Access read/write on the same array OR another (exclusive)
Test 1: !! CRASH !!
- Basic setup (as described above)
- 2x Samsung SSD 850 PRO 512GB ZFS RAID-1 (/data)
- 8x Western Digital RED 4TB WD40EFRX-68WT0N0 ZFS RAID-Z3 (/backup)
ZFS scrub on /backup. Several Minecraft servers run on /data.
Server reboots suddenly shortly thereafter.
This is similar to what the system is normally configured as but stripped down to a minimal set of components for testing and analysis.
Test 2: !! STABLE !!
- Basic setup (as described above)
- 8x Western Digital RED 4TB WD40EFRX-68WT0N0 ZFS RAID-Z3 (/backup)
ZFS scrub on /backup. NO Minecraft servers active and no access to any ZFS disk.
Server is stable for over 24h and scrub completes.
At this point I suspect the /data array as a fault.
Test 3: !! CRASH !!
- Basic setup (as described above)
- 8x Western Digital RED 4TB WD40EFRX-68WT0N0 ZFS RAID-Z3 (/backup)
ZFS scrub on /backup. Several Minecraft servers run on /backup.
Server reboots suddenly shortly thereafter.
At this point I suspect the /backup array may be the real fault as the /data array is no longer present and the system crashed identically to how it always has.
Test 4: !! CRASH !!
- Basic setup (as described above)
- 2x Samsung SSD 850 PRO 512GB ZFS RAID-1 (/data)
ZFS scrub on /data. Several Minecraft servers run on /data.
Server reboots suddenly shortly thereafter.
Stability seems to be related to ZFS?
Test 5: !! STABLE !!
- Basic setup (as described above)
- 1x Samsung SSD 850 PRO 512GB XFS (/data-testing)
Several Minecraft servers run on /data-testing.
Server has been stable for weeks.
I am now confident that the source of the stability is related to the ZFS arrays. This is very strange as I've run ZFS on this system for years without issue until now. It's also very strange that a fault would cause the entire system to reboot without a kernel panic or log.
I welcome any additional insight that anyone might be able to provide.