Hard freeze stops physical reset button from working

6

I have a repurposed PC running as a server. It was assembled in early 2014 and contains an Intel Core i7-4770 on a Gigabyte Z87-HD3. It worked pretty reliably until early 2017 when it started to intermittently freeze (every few weeks to months). No Kernel logs, not even pstore crash data or netconsole did produce anything meaningful. Physical screen is blank, network non-responsive, metrics at 10s granularity shows no correlation to load on CPU, RAM or disk. All LEDs and drives are still running, but there is obviously no IO anymore. RAM has been tested and is verified good, no spurious segfaults or anything that would indicate an intermittent hardware problem. Just hard freezes.

Now on to the very interesting part: Once the system enters this state, the physical reset button stops working completely. Once I press it, nothing happens. It is definitely physically working since it works 100% when the system is not in that state. I checked voltages from the PSU with a multimeter and they are all fine. I can still reset the server by pressing the power button for 5s and it boots up fine after that.

So I'm pretty much at a loss what happens here and what piece of hardware is to blame. I have logic analyzers and I could get access to USB scopes, but nothing that samples above 100MSPS, so I can't probe the actual buses. I would be very grateful for any insights of what might be going on.

Lorenz

Posted 2017-12-15T05:52:50.787

Reputation: 303

1I can appreciate your electronic background and desire to really dig in and troubleshoot things. But, this isn't really how it works any more with computers. Troubleshooting is done through a process of elimination. And with an intermittent issue it can be tedious and take a long time. However, the basic procedure is to test and swap components until you figure out what it is. In this case, most likely it's your power supply or motherboard - because it doesn't respond to the reset button. I would say most likely the motherboard. But, you'll need extra parts to test with to know for sure. – Appleoddity – 2017-12-15T07:20:03.957

Thermal expansion or motherboard layers coming apart, causing the physical separation of one of the motherboard's power lines? It's not the CPU; the reset button would keep working if it was the CPU or memory. The lack of disk IO blinkenlights (confirm?) says that it really is hard frozen, and not the GPU freezing. I'm not certain how to dive deeper into it without deep knowledge of the motherboard, unfortunately. – Christopher Hostage – 2017-12-15T07:24:01.767

Network non-responsive: so no answers from ping, ssh connection attempts and no new remote connections. Did the router see the computer network interface connected? Is it never happened with a remote connection already ongoing (went it frozen too?). Just a guess: what about the temperature, is it possible that the Temperature triggers this reaction? – Hastur – 2017-12-15T08:42:44.803

Thanks for your comments. I'm quickly going to address all of them. Appleoddity: I know, the freezes are very intermittent though, which makes this a very slow (and tedious, there is a ton of PCIe cards and disks) process. But yeah, motherboard is the most likely. Christopher Hostage: Yes, no disk IO lights or any other IO for that matter. Hastur: The machine works as a router (multiple IB + Ethernet interfaces). All ongoing connections die, the network itself (PHY layer) doesn't go down however, just the packet processing. – Lorenz – 2017-12-15T11:24:48.807

Answers

1

So after a lot of strategic swapping (mainboard, PSUs, CPU) I have a differential confirm (test system experiences the problem, original no longer does) on the CPU being bad. Very unexpected result since no MCEs were ever fired, usually you get MCEs way before hard lockups.

Since this board sadly doesn't have a Trace Hub / JTAG connector and the built-in USB3 debugging is not available on the Haswell platform I have no idea what is actually going wrong. It's pretty certain that the chip ends up in a state where it fails to be released from reset (self-test failure, power rail not coming up, ...). Could be related to the introduction of FIVR (Fully Integrated Voltage Regulator) in Haswell, but that's just speculation.

If you hit this problem, it doesn't need to be the CPU, it could just as well be a failing motherboard or PSU (or something else entirely). I just wanted to post this for completeness and for people to see that it can indeed also be a CPU fault (although it is still pretty unlikely).

Lorenz

Posted 2017-12-15T05:52:50.787

Reputation: 303

Had a CPU failure once in the last decade; it was throwing intermittent/random lockups and ECC correction notices on boot. Wasted time focusing on memory. Finally had a hunch, swapped CPUs (dual socket), got errors to move channel with CPU, and was able to fix the problem. What a pain though! – Damon – 2018-04-23T06:08:59.270

-1

I have seen this behavior twice before, both on x86 laptops. When this happens, the screen freezes, LEDs stay on but no buttons work. The only button that does work is the power button, but when held down 5 seconds.

Laptops usually have no reset button, so I can't be exactly sure of your issue, but the evidence points to a hardware fault. What I saw was solder joints on the board becoming cracked, whether by defect, time, or mechanical stress (enough hot cold cycles). Each bad joint will inject electrical noise. Get enough or in the right places and digital circuits will lock up, causing the entire board to freeze. This is not at the OS level or BIOS, it's lower down, in the hardware. In this state, only the power button's hold-down feature will work, because that uses an analog circuit that doesn't lock.

The fix is to put the board through a reheat cycle (inside a machine) that quickly melts the solder causing the cracks to re-weld and disappear.

I found a firm that specializes in this kind of repair.

On EBay, navigate to Specialty Services -> Restoration & Repair Services -> Computer Restoration & Repair Services. Seller is "NYClaptoptech". I searched for the make/model, and they had a matching "item for sale". I purchased this service the same way I would buy a PC, used the same checkout process. (It did seem odd to set up a service call using the purchase method.) I shipped the motherboard, got it back in 2 weeks. Cost: US$120. Their service is generic and you can simply call them to arrange a repair.

user855923

Posted 2017-12-15T05:52:50.787

Reputation: 1