-2

Every now and then, one of our remote Linux servers crashes: they're unavailable on the network (sometimes responding to a ping, but not to ssh/http) and they won't respond to mouse or keyboard input.

The servers are high-quality consumer grade hardware running Ubuntu 20.04.3 LTS.

Since these crashes happen infrequently, I'm collecting all the common reasons a server might crash like that so I can set up monitoring (munin) to make sure I have all the information needed when it happens and implement countermeasures (eg. periodic restarts?).

Question:

What are reasons for a Linux computer to become unresponsive, what info can I track to diagnose these issues, what can I do to fix them?

I believe this question and answers will be most useful if there's one answer per cause of failure and I'll be posting answers myself as I find such causes.

  • Sometimes the kernel will record the reason it hangs to the console and/or create a crashdump so definitely configure that those get a) generated b) captured c) investigated. -_-_-_ But *"servers"* & *"consumer grade hardware"* - sigh - one of the nice things in server hardware are the out-of-band management consoles that can collect and report on hardware errors and other events that kill the operating system (before the OS can record the error). - – HBruijn Sep 01 '22 at 10:03
  • OOB management is a very good point, @HBruijn . Thank you. – Johannes Bauer Sep 01 '22 at 10:15
  • I don't see how this will be all that useful. There's *nothing* in your logs? – ceejayoz Sep 01 '22 at 13:01
  • @ceejayoz I didn't find anything, but then it hasn't happened very often, so far, and I got notified only days after the incidents with no precise information about when it happened. – Johannes Bauer Sep 01 '22 at 13:48
  • It's more about a general list of things to track and things to log and reasons to look for when it happened in this particular and other settings where a system suddenly stops working. – Johannes Bauer Sep 01 '22 at 14:08

4 Answers4

0

Reason: Excessive swapping

can cause a system freeze (though this would usually be transient).

Track: RAM and swap usage

Fix: Increase RAM, tune services, (maybe) increase swap

See here

0

Reason: Excessive RAM/CPU usage

Track: RAM and swap usage, resource-hungry processes, their logs

Fix: Increase RAM, tune services, debug resource-hungry processes to see under which conditions their resource consumption spikes

0

Reason: HD Write Failures

Track: SMART diagnostics

Fix: Replace failing disks

-1

You have nice pointers here from the previous comments.

You might also want to stop your server for a weekend (if possible) and test the ram with Memtest86.

You burn a cd or a iso to a usb key and start the machine with it. I understand you have physical access to it.

yield
  • 731
  • 1
  • 8
  • 24