9

It has happened to me already twice within very few days that my server goes down completely, meaning http, ssh, ftp, dns, smtp, basically ALL services stop responding, as if the server had been turned off, except it still responds to ping, which is what most buffles me.

I do have some php scripts that cause a huge load (cpu and memory) on the server in short bursts, used by a little group of users, but usually the server "survives" perfectly well to these bursts, and when it goes down it never coincide with such peaks in usage (I'm not saying it can't be related, but it doesn't happen just after those).

I'm not asking you to magically be able to tell me the ultimate cause of these crashes, my question is: is there a single process whose death may cause all these services to go down simultaneously? The funny thing is that all network services go down, except ping. If the server had 100% of the CPU eaten up by some process, it wouldn't respond to ping either. If apache crashed because of (for example) a broken php script, that would affect http only, not ssh and dns.... etc.

My OS is Cent OS 5.6

Most importantly, after hard-rebooting the server, what system logs should I look at? /var/log/messages doesn't reveal anything suspicious.

matteo
  • 701
  • 2
  • 9
  • 21

2 Answers2

8

(tl;dr still responding to ping is an expected behaviour, check your memory usage)

ICMP echo requests (i.e. ping) are handled by the in-kernel networking stack, with no other dependency.

The kernel is known as being "memory resident", which means it will always be kept in RAM, and can't be swapped to disk like a regular application can.

This means in situations where you run of out of physical memory applications are swapped to disk, but the kernel remains where it is. When both the physical and swap memory are full (and the system can no long manage your programs) the machine will fall-over. However because a) the kernel is still in memory and b) it can respond to ping requests without the help of anything else, the system will keep responding to ping despite everything being dead.

In regard to your problem I'd strongly suspect memory issues. Install "sysstat" and use the "sar" command to see a log of memory/cpu/load/io load etc. I would expect at the times of crash you'd see both 100% physical and swap used.

I would also consider looking at dmesg or /var/log/messages for any sign of the OOM-killer (out-of-memory-killer) being invoked. This is the kernel's emergency system which will start killing processes in the event of memory being exhausted. It's effectiveness depends largely on what processes are being killed. A single process eating up the memory will be efficiently killed and memory freed, however an apache-based website will spawn replacement processes as soon a child process is killed.

Coops
  • 5,967
  • 1
  • 31
  • 52
  • Thanks a lot, I'm almost sure this is the problem, as both the RAM and the swap were full prior to the server failure. (I can see on ovh's Manager's stats). And it's probably some of my crazy php scripts using a lot of memory. It does puzzle me however for a couple of reasons. (1) looks like the memory eaten up by php is not freed afterwards, but that wouldn't make sense; (2) in any case, I wouldn't expect a proper operating system to die completely just because of one (or even a few) processes using too much memory... I would expect it to – matteo Oct 21 '12 at 13:53
  • refuse to allocate memory to programs asking for it when there's not enough ram for the system to keep working correctly... I mean a buggy or even malicious program should never be able to destroy the whole system... – matteo Oct 21 '12 at 13:53
  • 3
    @matteo Linux has what it calls "overcommit": just because you `malloc()` 1GB of ram doesn't actually mean you're going to use it, so the memory manager keeps track of how much memory your program thinks it has and how much memory the program has actually used, and it actually works well, most of the time. At least, until more than one program actually wants to use all of the 1GB it thinks it has. – DerfK Oct 21 '12 at 14:13
  • no sign of oom-killer in either dmesg nor messages, btw - at least I grepped (case insensitive) for both oom and killer – matteo Oct 21 '12 at 16:34
  • @matteo The log message would appear as `Out of Memory: Killed process [PID] [process name].`, so greping for `oom` or `killer` wouldn't find it. – Jonathan Callen Oct 21 '12 at 18:16
  • @matteo - more details of finding the log entry here: http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer – Coops Oct 21 '12 at 21:56
  • @JonathanCallen: thank you, but "grep -i kill /var/messages*" returned nothing. – matteo Oct 22 '12 at 06:25
  • So, I guess OOM-killer never kicked in at all. On one side I'm curious about why (why didn't it kill any process when there was a sudden peak of ram usage to 100% and swap to 80%), but that's just curiosity. On the other hand, is there something I could do so that, next time this happens, I will be able (after the crash and reboot) to figure out WHAT process ate up so much memory? – matteo Oct 22 '12 at 08:33
  • 1
    @matteo I see *no* indication that this is an OOM issue. Typically, the OOM-killer will pick specific or processes that meet certain criteria, but it wouldn't always kill a daemon like ssh. This is definitely on the I/O side. You didn't explain your hardware situation/specs as I requested in my answer. – ewwhite Oct 22 '12 at 12:22
  • @ewwhite if by "OOM issue" you mean "many processes like ssh, httpd and the like having been killed by OOM-killer", no, it is most surely not an OOM issue, but if by "OOM issue" you mean "my server ran out of memory", then I do have the absolute certainty that it was an OOM issue, because I saw the graph where there was a spike in RAM usage from normal levels to 100% and in swap from almost 0 to about 80% within a matter of minutes (an "instant" in the precision of the graph) – matteo Oct 22 '12 at 15:36
  • my server is at OVH, it's a dedicated server and they have a thing they call Manager on their site which remotely monitors your server, that's where I see the memory usage graph I'm talking about (which stopped collecting data when the server stalled, of course) and from where I (hard)rebooted it – matteo Oct 22 '12 at 15:42
5

Usually, it's an I/O or disk subsystem issue. Often times, this will be coupled with an extremely-high system load average. For example, the system detailed in the graph below became unresponsive (yet was pingable) when a script ran awry, locked a bunch of files and the load rose to 36... on a 4-CPU system.

enter image description here

The services that are running in RAM and do not require disk access continue to run... Thus, the network stack (ping) is up, but the other services stall when disk access is required... SSH when a key is referenced or password lookup needed. SMTP tends to shut down when load average hits 30 or so...

When the system is in this state, try a remote nmap against the server's IP to see what's up.

Your logging probably doesn't work if this is a disk or storage issue...

Can you describe the hardware setup? Is this a virtual machine? What is the storage layout?

More than logging, you want to see if you can graph the system performance and understand when this is happening. See if this correlates to a specific activity.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Supposing this is the issue, Is there a way to tell SSH to keep the password(s) in memory, so even if the server is in this state I may at least be able to log into it via ssh and run some commands to see what's going on? – matteo Oct 21 '12 at 15:20
  • 1
    If it's I/O, you need to get to the bottom of the issue. If it's a disk array timeout or driver interaction, that's different than a script that executes poorly or a resource contention issue. – ewwhite Oct 22 '12 at 12:24