6
I have a repurposed PC running as a server. It was assembled in early 2014 and contains an Intel Core i7-4770 on a Gigabyte Z87-HD3. It worked pretty reliably until early 2017 when it started to intermittently freeze (every few weeks to months). No Kernel logs, not even pstore crash data or netconsole did produce anything meaningful. Physical screen is blank, network non-responsive, metrics at 10s granularity shows no correlation to load on CPU, RAM or disk. All LEDs and drives are still running, but there is obviously no IO anymore. RAM has been tested and is verified good, no spurious segfaults or anything that would indicate an intermittent hardware problem. Just hard freezes.
Now on to the very interesting part: Once the system enters this state, the physical reset button stops working completely. Once I press it, nothing happens. It is definitely physically working since it works 100% when the system is not in that state. I checked voltages from the PSU with a multimeter and they are all fine. I can still reset the server by pressing the power button for 5s and it boots up fine after that.
So I'm pretty much at a loss what happens here and what piece of hardware is to blame. I have logic analyzers and I could get access to USB scopes, but nothing that samples above 100MSPS, so I can't probe the actual buses. I would be very grateful for any insights of what might be going on.
1I can appreciate your electronic background and desire to really dig in and troubleshoot things. But, this isn't really how it works any more with computers. Troubleshooting is done through a process of elimination. And with an intermittent issue it can be tedious and take a long time. However, the basic procedure is to test and swap components until you figure out what it is. In this case, most likely it's your power supply or motherboard - because it doesn't respond to the reset button. I would say most likely the motherboard. But, you'll need extra parts to test with to know for sure. – Appleoddity – 2017-12-15T07:20:03.957
Thermal expansion or motherboard layers coming apart, causing the physical separation of one of the motherboard's power lines? It's not the CPU; the reset button would keep working if it was the CPU or memory. The lack of disk IO blinkenlights (confirm?) says that it really is hard frozen, and not the GPU freezing. I'm not certain how to dive deeper into it without deep knowledge of the motherboard, unfortunately. – Christopher Hostage – 2017-12-15T07:24:01.767
Network non-responsive: so no answers from ping, ssh connection attempts and no new remote connections. Did the router see the computer network interface connected? Is it never happened with a remote connection already ongoing (went it frozen too?). Just a guess: what about the temperature, is it possible that the Temperature triggers this reaction? – Hastur – 2017-12-15T08:42:44.803
Thanks for your comments. I'm quickly going to address all of them. Appleoddity: I know, the freezes are very intermittent though, which makes this a very slow (and tedious, there is a ton of PCIe cards and disks) process. But yeah, motherboard is the most likely. Christopher Hostage: Yes, no disk IO lights or any other IO for that matter. Hastur: The machine works as a router (multiple IB + Ethernet interfaces). All ongoing connections die, the network itself (PHY layer) doesn't go down however, just the packet processing. – Lorenz – 2017-12-15T11:24:48.807