PID ran away with all our MEM and SWAPPED hard - OSSEC RHEL

Question

Forgive me for the length of this question... it is mostly details... only attempt to follow if you also enjoy reading log files... or drinking coffee.

I'll state the questions first:

1) how the heck did a nano process fire off based on what I've stated below

2) how did nano manage to take so much resource

3) working with ossec restarts surely isn't a coincidence so is that related?

This is a Red Hat 4.1.2-46 XEN environment, three cluster members. We updated our Hurricane monitoring code manually on Jan17 at 11:34am. Two files were changed (using nano) while ossec was running:

preloaded-vars.conf
ossec.conf

ossec was then restarted and then the root user logged off.

Unfortunately the three servers went offline (still had ssh) because a nano process ran away (I imagine that this would have happened had I used VI - so the editor type is not in question). Oddly, no crons started the nano service and no one was logged into the server at the time, and I'm sure that I properly closed out of nano. Before I killed the PID, top provided me with the following insight:

Mem:  28359680k total, 28325064k used,    34616k free,     3424k buffers
Swap:  4194296k total,  4194296k used,        0k free,    70208k cached
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
26351 root      18   0 29.7g  25g  784 R 100.1 95.6   4424:38 nano

Note: the nano editor took up ~28GB of ram.

It took just over three days for this to take our servers down. I found something else before I killed the process. Notice that the nano process began two hours later after the file was first edited and root logged off. Notice that the tty = ?.

# ps -ef | grep nano
root      7836  7689  0 13:19 pts/5    00:00:00 grep nano
root     26351     1 99 Jan17 ?        3-01:44:46 nano /opt/ossec/etc/ossec.conf

Thankfully after I killed the PID I had:

Mem:  28359680k total,  1189924k used, 27169756k free,     4584k buffers
Swap:  4194296k total,   260284k used,  3934012k free,   104352k cached

I first expected to find that the process status would be stopped or traced but it was running (see the R before the %CPU usage stat)

Additional Notes. The preloaded-vars.conf file was created from a .tar file (therefore the 1000:1000). It was edited by root. The .sav file was created when I killed nano (and it's smaller than the main file). On two of the Xen servers nano was stuck editing the preloaded-vars.conf and on the third nano was stuck editing the ossec.conf file. No ossec.conf.save was create when nano was killed.

-rwxr-xr-x  1 1000 1000  2918 Jan 17 11:04 preloaded-vars.conf
-rw-------  1 root root  2909 Jan 20 13:13 preloaded-vars.conf.save

Further Findings: I've discovered that if I open the preloaded-vars.conf file and then from another terminal kill the pid, the default behavior of nano is to create a preloaded-vars.conf.save file when it receives a SIGHUP or SIGTERM message. Still don't understand what caused it to go off the rails to begin with.

I'm starting to think that our XEN environment was partly to blame. No real proof yet. I'm just amazed that an editor was able to use 28GB of ram when the save file was 2909 bytes. — Patrick R, Jan 24 '11 at 13:37

score 1 · Accepted Answer · answered Jan 20 '11 at 22:33

1

Well, the answer to (2) is probably "You don't have any resource limits configured" - check out ulimit to solve that one.

No clue on the others though.

answered Jan 20 '11 at 22:33

voretaq7

79,345
17
128
213

+1 I've been in this new environment for about a month. Check a dozen machines. ulimit responds as unlimited for everyone I've checked so far. – Patrick R Jan 21 '11 at 13:29
the other options show various defaults. I'll spend some time getting adjustments approved – Patrick R Jan 21 '11 at 13:37

PID ran away with all our MEM and SWAPPED hard - OSSEC RHEL

1 Answers1