1

Few days ago, my CentOS 6.2 webserver with ISPconfig 3 went down to the extent I wan't able to log in via SSH nor using a console. The console was full of messages "out of memory, killing process, sacrificing children" or something like that. The login prompt via SSH appeared after a minute of waiting, password prompt after another minute or two, and so on. The system was clearly heavily overloaded. I wasn't able to restart it cleanly so I hard reseted it. I thought it was some isolated failure, but the same situation repeated a few hours ago. It is a production server and I couldn't afford experimenting, so I just increased RAM (it is a Hyper-V virtual machine) from 1 GB to 2 GB and restarted it once more. Now it's running OK for two days or something like that. The next day the same situation repeated with another similar machine, CentOS 6.3. I just restarted it without increasing RAM and it runs OK now.

I'm not sure what is it, why it occured and how to avoid it. It seems to me there was too much RAM allocated so the system started to page everything in and out, which dropped performance to the point of virtually stopping the machine. This is a sar log from the second machine:

12:03:14 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
06:40:52 AM     all      0.10      0.00      1.59     98.31      0.00      0.00
07:37:29 AM     all      0.09      0.00      1.37     98.54      0.00      0.00
09:51:37 AM     all      0.07      0.00      1.34     98.59      0.00      0.00
11:01:13 AM     all      0.05      0.00      1.35     98.61      0.00      0.00
12:57:39 PM     all      0.09      0.00      1.60     98.31      0.00      0.00

Is it possible it was some kind of DOS attack? Both machines have numerically consecutive IP addresses, so maybe it's something that that takes addresses one by one? Does it point to some weakness in security setup? Is there any way how I can more precisely tell what happened and why?

The biggest surprise was I wasn't able to log in and operate the system at all. Is Linux supposed to do this? Or does it mean my configuration is somehow wrong? Do I have to have some settings in order to disallow any process eating too much memory? Is it something that may happen, or does it mean I have mangled the installation?

EDIT - more information about setup:

Both machines have following installed: ISPConfig 3, Apache, MySQL, PHP, Postfix, Courier, PureFTPd, Bind (the default installation of ISPConfig on CentOS). They act as webservers, with quite a low load - the second machine, from which is the sar excerpt, served 8000 files the day it happened.

The sar log was the excerpt for the time period the problem occured. Immediately after the reboot it reverted to the normal operation which looks like this (current sar log, iotop shows nearly 0 reads 0 writes now):

07:20:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
07:30:01 AM     all      1.15      0.00      0.27      0.38      0.00     98.20
07:40:01 AM     all      0.96      0.00      0.23      0.24      0.00     98.57
07:50:01 AM     all      1.71      0.00      0.37      1.86      0.00     96.07

According to Apache logs, there was no unusual load or exceptional request count. I've found just this unusual line in error log:

[Tue Feb 19 05:39:30 2013] [error] server reached MaxClients setting, consider raising the MaxClients setting

This seems to be the root of the problems, isn't it?

Jan Svab
  • 103
  • 1
  • 2
  • 10
  • 2
    Once you run out of physical memory Linux starts dipping into disk-based swap space which is several orders of magnitude slower than RAM. If your disk is already busy with normal IO the whole thing becomes a massive clustercuss like you've described. So yes, this is expected behaviour for a system on its knees. However, you need to figure out what exactly is causing this. Start by reading logs. – Sammitch Feb 20 '13 at 21:07
  • 1
    Your `%iowait` is high. Try running `iotop`, you might need to install it, and see which processes have high I/O usage. – Daniel t. Feb 20 '13 at 21:37
  • To make an educated guess as to what's happening it would be necessary to a) look at memory consumption BEFORE this occurs (e.g., run top in batch mode, redirect output to a remotely mounted file-system so you can get to it at any time), b) to learn more about your set-up (what software is installed, what's the machines purpose in life?), c) have you go through the relevant logs (syslog, app logs, ... ). – tink Feb 20 '13 at 22:08
  • Ok, so that's the anwer to the second part - "yes, it is normal behavior, you cannot log in even via console when server is overloaded". I'm adding more information to the question. – Jan Svab Feb 21 '13 at 07:44
  • You should check with sar to extract also memory statistics. Perhaps `sar -A` is a bit overkill but should bring us enough information to find what exactly happened, though it won't tell us the process(es) which went berserk. – Huygens Feb 27 '13 at 20:57
  • Unfortunately, I haven't listed sar -A when the problem occurred and now it isn't in the log anymore. Or am I wrong and sar log doesn't rotate? Can I instruct sar to produce output for a given time and date? – Jan Svab Feb 28 '13 at 14:03

1 Answers1

0

Ok, no more hints or answers, so question is probably not specific enough. I'll probably take "webserver overload" as the conclusion. However, I still don't know why it occurred or whether it might be some sort of DOS attack. Perhaps it is a feature of Apache that it runs out of MaxClients allowed when it is up for too long? Nevertheless, the solution "restart server, increase RAM" is probably something I can live with (or at least I will have to).

Jan Svab
  • 103
  • 1
  • 2
  • 10