5
I have a home-built Linux server (Ubuntu 12.04.5 LTS, Intel i5-3570K, 8GB RAM) acting primarily as a mail and web server. It operates in console mode only (no GUI). I will SSH to it now and then, and almost never operate it from the console. It tends to work fine for many days, even weeks, but then sometimes crashes hard without warning. And when I say "crashes hard", I mean the PC suddenly becomes completely unresponsive:
- It leaves no log entries
- It doesn't emit an "Oops", kernel panic message or core dump
- It doesn't display any message on the screen.
- It doesn't respond to any keyboard or mouse input (The NumLock light is also unresponsive to that key)
- It cannot be accessed by SSH
- The case's reset switch will not operate
The only solution is to hold the case power button in till it turns off, then restart it.
Of course this screams "hardware problem", but which component is the most likely? Memtest86+ shows no errors, so that would seem to leave the Big Three - motherboard, CPU or power supply. (The PC is not overclocked, and the sensors last messages (before the crash) indicate no overheating or fan problems)
Is there a statistical likelihood which of these components is likely to be the problem?
I put the last criteria in bold above because it seemed unusual to me. Usually even with a hard crash, a PC can still be rebooted with the case's reset switch. Does this suggest a problem with the PSU, or the motherboard? (holding in the power switch 4-5 seconds to turn off the PC does still work)
Is there a way to test them without simply ordering new parts one at a time until I'm confident (after several weeks of no crashing) that the problem is resolved?
Thanks to anyone who can help.
1Does S.M.A.R.T. report any errors on any installed hard drives? Note: Use the utility "Disks" to check SMART reporting. – Steven – 2015-12-16T19:32:57.250
The malfunctioning reset switch is unusual. The only time I recall seeing a flakey reset was long ago on a mil-spec ruggedized computer (go figure); the board contacts would go bad, so that all the boards would have to be pulled out and re-inserted. Otherwise reset circuits tend to be rather simple involving just the mobo and CPU (although on PCs the ACPI might be involved?). – sawdust – 2015-12-16T20:38:08.417
Steven, there are no S.M.A.R.T. errors I'm aware of, but I'll look again when I'm back at the server. @sawdust, the interesting thing is that the reset switch works fine when the machine is not in its hard-crashed state (not that you'd ever want to use it then, but it does work...) – George Adams – 2015-12-16T20:56:12.807
"reset switch works fine..." -- Yeah, I was wondering about that, but your analysis/writeup is very good, so I assumed it did. Worse case scenario would be a combination of SW+HW issues puts the machine in this condition. I have no idea how reset works on a PC (versus industrial SBCs) (e.g. is it really a HW reset or an NMI, non-maskable interrupt?). Since the PSU is probably the easiest component to substitute, you could try that just to eliminate that as a cause. – sawdust – 2015-12-16T23:28:20.810