Forensic Analysis of the OOM-Killer

Question

Ubuntu's Out-Of-Memory Killer wreaked havoc on my server, quietly assassinating my applications, sendmail, apache and others.

I've managed to learn what the OOM Killer is, and about its "badness" rules. While my machine is small, my applications are even smaller, and typically only half of my physical memory is in use, let alone swap-space, so I was surprised. I am trying to work out the culprit, but I don't know how to read the OOM-Killer logs.

Can anyone please point me to a tutorial on how to read the data in the logs (what are ve, free and gen?), or help me parse these logs?

Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 1, exc 2326 0 goal 2326 0...
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): task ebb0c6f0, thg d33a1b00, sig 1
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 1, exc 2326 0 red 61795 745
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 2, exc 122 0 goal 383 0...
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): task ebb0c6f0, thg d33a1b00, sig 1
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 2, exc 383 0 red 61795 745
Apr 20 20:03:27 EL135 kernel: kill_signal(13516.0): task ebb0c6f0, thg d33a1b00, sig 2
Apr 20 20:03:27 EL135 kernel: OOM killed process watchdog (pid=14490, ve=13516) exited, free=43104 gen=24501.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=4457, ve=13516) exited, free=43104 gen=24502.
Apr 20 20:03:27 EL135 kernel: OOM killed process ntpd (pid=10816, ve=13516) exited, free=43104 gen=24503.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=27401, ve=13516) exited, free=43104 gen=24504.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=29009, ve=13516) exited, free=43104 gen=24505.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=10557, ve=13516) exited, free=49552 gen=24506.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=24983, ve=13516) exited, free=53117 gen=24507.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=29129, ve=13516) exited, free=68493 gen=24508.
Apr 20 20:03:27 EL135 kernel: OOM killed process sendmail-mta (pid=941, ve=13516) exited, free=68803 gen=24509.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=12418, ve=13516) exited, free=69330 gen=24510.
Apr 20 20:03:27 EL135 kernel: OOM killed process python (pid=22953, ve=13516) exited, free=72275 gen=24511.
Apr 20 20:03:27 EL135 kernel: OOM killed process apache2 (pid=6624, ve=13516) exited, free=76398 gen=24512.
Apr 20 20:03:27 EL135 kernel: OOM killed process python (pid=23317, ve=13516) exited, free=94285 gen=24513.
Apr 20 20:03:27 EL135 kernel: OOM killed process tail (pid=29030, ve=13516) exited, free=95339 gen=24514.
Apr 20 20:03:28 EL135 kernel: OOM killed process apache2 (pid=20583, ve=13516) exited, free=101663 gen=24515.
Apr 20 20:03:28 EL135 kernel: OOM killed process logger (pid=12894, ve=13516) exited, free=101694 gen=24516.
Apr 20 20:03:28 EL135 kernel: OOM killed process bash (pid=21119, ve=13516) exited, free=101849 gen=24517.
Apr 20 20:03:28 EL135 kernel: OOM killed process atd (pid=991, ve=13516) exited, free=101880 gen=24518.
Apr 20 20:03:28 EL135 kernel: OOM killed process apache2 (pid=14649, ve=13516) exited, free=102748 gen=24519.
Apr 20 20:03:28 EL135 kernel: OOM killed process grep (pid=21375, ve=13516) exited, free=132167 gen=24520.
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 4, exc 4215 0 goal 4826 0...
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): task ede29370, thg df98b880, sig 1
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 4, exc 4826 0 red 189481 331
Apr 20 20:03:57 EL135 kernel: kill_signal(13516.0): task ede29370, thg df98b880, sig 2
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 5, exc 3564 0 goal 3564 0...
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): task c6c90110, thg cdb1a100, sig 1
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 5, exc 3564 0 red 189481 331
Apr 20 20:04:53 EL135 kernel: kill_signal(13516.0): task c6c90110, thg cdb1a100, sig 2
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): selecting to kill, queued 0, seq 6, exc 8071 0 goal 8071 0...
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): task d7294050, thg c03f42c0, sig 1
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): selected 1, signalled 1, queued 1, seq 6, exc 8071 0 red 189481 331
Apr 20 20:07:14 EL135 kernel: kill_signal(13516.0): task d7294050, thg c03f42c0, sig 2

Watchdog is a watchdog task, that was idle; nothing in the logs to suggest it had done anything for days. Its job is to restart one of the applications if it dies, so a bit ironic that it is the first to get killed.

Tail was monitoring a few logs files. Unlikely to be consuming memory madly.

The apache web-server only serves pages to ~~a little old lady who only uses it to get to church on Sundays~~ a couple of developers who were in bed asleep, and hadn't visited a page on the site for a few weeks. The only traffic it might have had is from the port-scanners; all the content is password-protected and not linked from anywhere, so no spiders are interested.

Python is running two separate custom applications. Nothing in the logs to suggest they weren't humming along as normal. One of them was a relatively recent implementation, which makes suspect #1. It doesn't have any data-structures of any significance, and normally uses only about 8% of the total physical RAW. It hasn't misbehaved since.

The grep is suspect #2, and the one I want to be guilty, because it was a once-off command. The command (which piped the output of a grep -r to another grep) had been started at least 30 minutes earlier, and the fact it was still running is suspicious. However, I wouldn't have thought grep would ever use a significant amount of memory. It took a while for the OOM killer to get to it, which suggests it wasn't going mad, but the OOM killer stopped once it was killed, suggesting it may have been a memory-hog that finally satisfied the OOM killer's blood-lust.

possible duplicate of http://serverfault.com/questions/134669/how-to-diagnose-causes-of-oom-killer-killing-processes — Dennis Williamson, Apr 22 '10 at 03:04
I had looked at that question, and decided that it wasn't a duplicate. He's looking for general info about the OOM Killer, which I think I am past. I'm looking for specific information about how to read some of the log files - particularly what some parameters might mean. — Oddthinking, Apr 22 '10 at 05:16
Following thread might be useful for you. http://serverfault.com/questions/134669/ — vasco.debian, Apr 09 '13 at 13:49
While your apps might be small I would like to ask - do you have any trending statistics on your server? It might be useful to watch memory usage growth patterns. E.g. does the memory get used up all at once or slowly over a period of time? Does it always happen at the same time of day? Based on that you might have more clues as to where to look. A tool like [Munin](http://munin-monitoring.org/) is easy to get started with. — Friedrich 'Fred' Clausen, Aug 08 '13 at 08:53
@Fred: When you wipe away the flop-sweat of panic from this three-year old question, you'll see I wasn't asking for general instructions on how to debug, but explicitly for what clues I could garner from the OOM Killer logs. I believe the log file explains a lot about what happened on that foggy night, but there's no translations available to explain what it says without reading oodles of source code for the Linux core. — Oddthinking, Aug 08 '13 at 12:19
@Oddthinking Oops, next time I'll check when the question was asked. — Friedrich 'Fred' Clausen, Aug 08 '13 at 12:29
Install `atop`. It will write an overview of system activity every 10 minutes. From that, you can easily detect slowly growing processes, etc. It won't always catch a fast culprit, but it has helped me several times. — BraveNewCurrency, Sep 24 '13 at 01:50
Thanks for the tip, @BraveNewCurrency. Clearly, it doesn't address the question though. The OOM Killer provides a log of what it did, and it appears the knowledge to make sense of that log for debugging purposes is an arcane skill lost to modern society. — Oddthinking, Sep 24 '13 at 04:57
The OOM Killer log isn't really that helpful. The problem is that Linux over-commits memory. This means that a random process will get an error when it tries to **use** previously allocated memory. If you turn off overcommit, then you will get an error right when an application requests too much memory instead of much later and in a different process. — BraveNewCurrency, Sep 25 '13 at 19:14

score 1 · Answer 1 · answered Sep 24 '13 at 06:26

I'm new to ServerFault and just saw this post. It seems to have resurfaced near the front of the queue even though it is old. Let's put this scary one to bed maybe?

First of all, I have an interest in this topic as I am optimizing systems with limited RAM to run many user processes in a secure way.

It is my opinion that the error messages in this log are referring to OpenVZ Linux containers.

A "ve" is a virtual environment and also known as a container in OpenVZ. Each container is given an ID and the number you are seeing is that ID. More on this here:

https://openvz.org/Container

The term "free" refers to free memory in bytes at that moment in time. You can see the free memory gradually increasing as processes are killed.

The term "gen" I am a little unsure of. I believe this refers to generation. That is, it starts out at 1 and increases by one for every generation of a process in a container. So, for your system, it seems there were 24K+ processes executed since boot. Please correct me if I'm wrong. That should be easy to test.

As to why it killed processes, that's because of your OOM killer configuration. It's trying to bring the free memory back to the expected amount (which looks to be 128 Kb). Oracle has a good write-up of how-to configure this to something you might like better:

http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html

Additionally, if you'd like to see the memory configuration for each of your containers, check this out:

https://openvz.org/Setting_UBC_parameters

Forensic Analysis of the OOM-Killer

1 Answers1