How do I diagnose a hard Linux crash?

I have a home-built Linux server (Ubuntu 12.04.5 LTS, Intel i5-3570K, 8GB RAM) acting primarily as a mail and web server. It operates in console mode only (no GUI). I will SSH to it now and then, and almost never operate it from the console. It tends to work fine for many days, even weeks, but then sometimes crashes hard without warning. And when I say "crashes hard", I mean the PC suddenly becomes completely unresponsive:

It leaves no log entries
It doesn't emit an "Oops", kernel panic message or core dump
It doesn't display any message on the screen.
It doesn't respond to any keyboard or mouse input (The NumLock light is also unresponsive to that key)
It cannot be accessed by SSH
The case's reset switch will not operate

The only solution is to hold the case power button in till it turns off, then restart it.

Of course this screams "hardware problem", but which component is the most likely? Memtest86+ shows no errors, so that would seem to leave the Big Three - motherboard, CPU or power supply. (The PC is not overclocked, and the sensors last messages (before the crash) indicate no overheating or fan problems)

Is there a statistical likelihood which of these components is likely to be the problem?
I put the last criteria in bold above because it seemed unusual to me. Usually even with a hard crash, a PC can still be rebooted with the case's reset switch. Does this suggest a problem with the PSU, or the motherboard? (holding in the power switch 4-5 seconds to turn off the PC does still work)
Is there a way to test them without simply ordering new parts one at a time until I'm confident (after several weeks of no crashing) that the problem is resolved?

Thanks to anyone who can help.

George Adams

Posted 2015-12-16T19:31:07.253

Reputation: 171

1Does S.M.A.R.T. report any errors on any installed hard drives? Note: Use the utility "Disks" to check SMART reporting. – Steven – 2015-12-16T19:32:57.250

The malfunctioning reset switch is unusual. The only time I recall seeing a flakey reset was long ago on a mil-spec ruggedized computer (go figure); the board contacts would go bad, so that all the boards would have to be pulled out and re-inserted. Otherwise reset circuits tend to be rather simple involving just the mobo and CPU (although on PCs the ACPI might be involved?). – sawdust – 2015-12-16T20:38:08.417

Steven, there are no S.M.A.R.T. errors I'm aware of, but I'll look again when I'm back at the server. @sawdust, the interesting thing is that the reset switch works fine when the machine is not in its hard-crashed state (not that you'd ever want to use it then, but it does work...) – George Adams – 2015-12-16T20:56:12.807

"reset switch works fine..." -- Yeah, I was wondering about that, but your analysis/writeup is very good, so I assumed it did. Worse case scenario would be a combination of SW+HW issues puts the machine in this condition. I have no idea how reset works on a PC (versus industrial SBCs) (e.g. is it really a HW reset or an NMI, non-maskable interrupt?). Since the PSU is probably the easiest component to substitute, you could try that just to eliminate that as a cause. – sawdust – 2015-12-16T23:28:20.810

Answers

1: Is your Ubuntu Stable?? Did you download a stable version of ubuntu? if not try downgrading to the latest stable build.

2: Have you tried it on another Virtual/Physical Machine? It could very well be a script error testing it in a VM like Virtual Box that will more then likely prevent any hard-crashing if you haven't tried these steps already also it would give you an environment where you could debug and monitor the OS

3: Ram failure? Okay so its very unlikely to be the local SSD/HDD/SSHD because the linux os is loaded into the RAM and it would post a warning if there was an inability to contact the kernel then it would crash. however if the ram where to lock-up because its faulty/Defective the operating system would freeze completely being unable to post (or even be aware of) any errors which might explain there being no logs However it is VERY possible that it could be something else

4: Have a look at the forums Okay i'm not the most-effective Linux user out there and there is a lot that i don't really know i have had similar hardware and software issues, however i don't really know what it is your home-brew server does so its hard to pinpoint the flaw out there id browse the Forum

Shadowforce62

Posted 2015-12-16T19:31:07.253

Reputation: 41

Yes, as I mentioned, this is Ubuntu 12.04.5 LTS (long-term support version). 2. This is a physical server. I have no other hardware that would allow me to virtualize and move it. 3) As I mentioned, Memtest86+ shows no errors.

< – George Adams – 2015-12-16T20:57:57.607

I am a bit surprised no one has suggested the use of the SysRq magic key.

First of all, it should be used instead of the power switch to force a reboot, because this gives programs a chance to save unsaved data to the disk; failure to do so might cause considerable problems upon reboot (not to mention the crashing bore of having to wait for the usual fsck check). This is done as follows: keeping Alt and SysRq simultaneously pressed, enter, each spaced by a few seconds, r e i s u b (the famous mnemonics in English is Raising Elephants Is So Utterly Boring, I prefer Running Errands Is So Utterly Boring, try to come up with a better one if you can).

Even apart from this, when the system freezes the use of Alt + SysRq + X (where X is a letter) allows you to run some diagnostics: for instance, X=d displays all current locks, which may help diagnose a software problem; X=j thaws frozen filesystems; X=l (l is an ell) shows a stack backtrace; X=t outputs to the console a list of current tasks; X=w displays a list of blocked tasks.

You can find more codes on Wikipedia.

While I cannot say this will be a decisive step (there are situations where even this fails), yet it is the next step in the investigation, which will help point to a software or hardware problem, and to restrict the range of possible culprits.

MariusMatutiae

Posted 2015-12-16T19:31:07.253

Reputation: 41 321

The best you can do is look at the logs near the time of the lock up and see if you can correlate the lockup with any system event of any type. It's a difficult thing to do and you may not be able to find anything that could be a direct cause this way.

Some hints for diagnosing hardware problems:

The easiest thing to eliminate is firmware issues/settings:

Make sure your system has the latest firmware/BIOS updates from the manufacturer.
Make sure any storage devices are also updated to latest firmware.
Try disabling any CPU or other power management options in the firmware/BIOS.
Try disabling virtualization in the firmware if you don't use it.

Problems with RAM can cause hard lockups even if they don't show on a memory test. It could be something very intermittent. Actual servers have ECC RAM that prevents rare/transient RAM errors from causing problems but if this is a non-server PC it doesn't have this. Try swapping out the RAM if you can.

A power issue from your wall power could cause problems like this. If you are serious about running a home server you should have a battery backup that also filters out transient power issues.

If problems persist thereafter, try replacing the power supply or using another one.

Afterward, assume the motherboard is flaky and look into replacing.

LawrenceC

Posted 2015-12-16T19:31:07.253

Reputation: 63 487

The logs are always very innocuous. It's just going about its normal business when suddenly BAM - no more log entries. No weirdo scripts firing up just before sudden death. The server is being run on a UPS. ECC RAM is an interesting idea. I'll have to look into that pricing. – George Adams – 2015-12-16T20:59:29.680