Why are applications in a memory-limited LXC container writing large files to disk being killed by the OOM?

Question

EDIT2: This problem appears to exist as well under 3.8.0-25-generic #37-Ubuntu SMP

EDIT: I modified the question from the original title of "Why would the Linux Out of Memory manager be triggered by writing to a file with dd?" to better reflect that I am worried about the general issue described below:

I'm running into a troublesome scenario where the OOM killer is hard killing processes in my LXC container when I write a file with size exceeding the memory limitation (set to 300MB). The problem does not occur when I run the application on a Xen virtual machine (an EC2 t1.micro) that actually only has 512 MB of RAM, so there appears to be some issue with the file buffering respecting the containers memory limit.

As a simple example, I can demonstrate how a large file written by dd will cause issues. Again, this issue plagues all applications. I am looking to solve the general problem of the application's cache getting too large; I understand how I can make the "dd" work.

Scenario:

I have an LXC container where memory.limit_in_bytes is set to 300 MB.

I attempt to dd a ~500 MB file as follows:

dd if=/dev/zero of=test2 bs=100k count=5010

Roughly 20% of the time, the Linux OOM manager is triggered by this command and a process is killed. Needless to say, this is highly unintended behavior; dd is meant to simulate an actual "useful" file write by a program running inside the container.

Details: While file caches get large (260 MB), rss and file map seem to stay quite low. Here's an example of what memory.stat may look like during the write:

cache 278667264
rss 20971520
mapped_file 24576
pgpgin 138147
pgpgout 64993
swap 0
pgfault 55054
pgmajfault 2
inactive_anon 10637312
active_anon 10342400
inactive_file 278339584
active_file 319488
unevictable 0
hierarchical_memory_limit 300003328
hierarchical_memsw_limit 300003328
total_cache 278667264
total_rss 20971520
total_mapped_file 24576
total_pgpgin 138147
total_pgpgout 64993
total_swap 0
total_pgfault 55054
total_pgmajfault 2
total_inactive_anon 10637312
total_active_anon 10342400
total_inactive_file 278339584
total_active_file 319488
total_unevictable 0

Here's a paste from dmesg where the OOM triggered a kill. I'm not too familiar with the distinctions between the types of memory; one thing that stands out is that while "Node 0 Normal" is very low, there is plenty of Node 0 DMA32 memory free. Can anyone explain why a file write is causing the OOM? How do I prevent this from happening?

The log:

[1801523.686755] Task in /lxc/c-7 killed as a result of limit of /lxc/c-7
[1801523.686758] memory: usage 292972kB, limit 292972kB, failcnt 39580
[1801523.686760] memory+swap: usage 292972kB, limit 292972kB, failcnt 0
[1801523.686762] Mem-Info:
[1801523.686764] Node 0 DMA per-cpu:
[1801523.686767] CPU    0: hi:    0, btch:   1 usd:   0
[1801523.686769] CPU    1: hi:    0, btch:   1 usd:   0
[1801523.686771] CPU    2: hi:    0, btch:   1 usd:   0
[1801523.686773] CPU    3: hi:    0, btch:   1 usd:   0
[1801523.686775] CPU    4: hi:    0, btch:   1 usd:   0
[1801523.686778] CPU    5: hi:    0, btch:   1 usd:   0
[1801523.686780] CPU    6: hi:    0, btch:   1 usd:   0
[1801523.686782] CPU    7: hi:    0, btch:   1 usd:   0
[1801523.686783] Node 0 DMA32 per-cpu:
[1801523.686786] CPU    0: hi:  186, btch:  31 usd: 158
[1801523.686788] CPU    1: hi:  186, btch:  31 usd: 114
[1801523.686790] CPU    2: hi:  186, btch:  31 usd: 133
[1801523.686792] CPU    3: hi:  186, btch:  31 usd:  69
[1801523.686794] CPU    4: hi:  186, btch:  31 usd:  70
[1801523.686796] CPU    5: hi:  186, btch:  31 usd: 131
[1801523.686798] CPU    6: hi:  186, btch:  31 usd: 169
[1801523.686800] CPU    7: hi:  186, btch:  31 usd:  30
[1801523.686802] Node 0 Normal per-cpu:
[1801523.686804] CPU    0: hi:  186, btch:  31 usd: 162
[1801523.686806] CPU    1: hi:  186, btch:  31 usd: 184
[1801523.686809] CPU    2: hi:  186, btch:  31 usd:  99
[1801523.686811] CPU    3: hi:  186, btch:  31 usd:  82
[1801523.686813] CPU    4: hi:  186, btch:  31 usd:  90
[1801523.686815] CPU    5: hi:  186, btch:  31 usd:  99
[1801523.686817] CPU    6: hi:  186, btch:  31 usd: 157
[1801523.686819] CPU    7: hi:  186, btch:  31 usd: 138
[1801523.686824] active_anon:60439 inactive_anon:28841 isolated_anon:0
[1801523.686825]  active_file:110417 inactive_file:907078 isolated_file:64
[1801523.686827]  unevictable:0 dirty:164722 writeback:1652 unstable:0
[1801523.686828]  free:445909 slab_reclaimable:176594
slab_unreclaimable:14754
[1801523.686829]  mapped:4753 shmem:66 pagetables:3600 bounce:0
[1801523.686831] Node 0 DMA free:7904kB min:8kB low:8kB high:12kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:7648kB
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB
unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
[1801523.686841] lowmem_reserve[]: 0 4016 7048 7048
[1801523.686845] Node 0 DMA32 free:1770072kB min:6116kB low:7644kB
high:9172kB active_anon:22312kB inactive_anon:12128kB active_file:4988kB
inactive_file:2190136kB unevictable:0kB isolated(anon):0kB
isolated(file):256kB present:4112640kB mlocked:0kB dirty:535072kB
writeback:6452kB mapped:4kB shmem:4kB slab_reclaimable:72888kB
slab_unreclaimable:1100kB kernel_stack:120kB pagetables:832kB unstable:0kB
bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[1801523.686855] lowmem_reserve[]: 0 0 3031 3031
[1801523.686859] Node 0 Normal free:5660kB min:4616kB low:5768kB
high:6924kB active_anon:219444kB inactive_anon:103236kB
active_file:436680kB inactive_file:1438176kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:3104640kB mlocked:0kB
dirty:123816kB writeback:156kB mapped:19008kB shmem:260kB
slab_reclaimable:633488kB slab_unreclaimable:57916kB kernel_stack:2800kB
pagetables:13568kB unstable:0kB bounce:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? no
[1801523.686869] lowmem_reserve[]: 0 0 0 0
[1801523.686873] Node 0 DMA: 2*4kB 3*8kB 0*16kB 2*32kB 4*64kB 3*128kB
2*256kB 1*512kB 2*1024kB 2*2048kB 0*4096kB = 7904kB
[1801523.686883] Node 0 DMA32: 129*4kB 87*8kB 86*16kB 89*32kB 87*64kB
65*128kB 12*256kB 5*512kB 2*1024kB 13*2048kB 419*4096kB = 1769852kB
[1801523.686893] Node 0 Normal: 477*4kB 23*8kB 1*16kB 5*32kB 0*64kB 3*128kB
3*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 5980kB
[1801523.686903] 1017542 total pagecache pages
[1801523.686905] 0 pages in swap cache
[1801523.686907] Swap cache stats: add 0, delete 0, find 0/0
[1801523.686908] Free swap  = 1048572kB
[1801523.686910] Total swap = 1048572kB
[1801523.722319] 1837040 pages RAM
[1801523.722322] 58337 pages reserved
[1801523.722323] 972948 pages shared
[1801523.722324] 406948 pages non-shared
[1801523.722326] [ pid ]   uid  tgid total_vm      rss cpu oom_adj
oom_score_adj name
[1801523.722396] [31266]     0 31266     6404      511   6       0
    0 init
[1801523.722445] [32489]     0 32489    12370      688   7     -17
-1000 sshd
[1801523.722460] [32511]   101 32511    10513      325   0       0
    0 rsyslogd
[1801523.722495] [32625]     0 32625    17706      838   2       0
    0 sshd
[1801523.722522] [32652]   103 32652     5900      176   0       0
    0 dbus-daemon
[1801523.722583] [  526]     0   526     1553      168   5       0
    0 getty
[1801523.722587] [  530]     0   530     1553      168   1       0
    0 getty
[1801523.722593] [  537]  2007   537    17706      423   5       0
    0 sshd
[1801523.722629] [  538]  2007   538    16974     5191   1       0
    0 python
[1801523.722650] [  877]  2007   877     2106      157   7       0
    0 dd
[1801523.722657] Memory cgroup out of memory: Kill process 538 (python)
score 71 or sacrifice child
[1801523.722674] Killed process 538 (python) total-vm:67896kB,
anon-rss:17464kB, file-rss:3300kB

I'm running on Linux ip-10-8-139-98 3.2.0-29-virtual #46-Ubuntu SMP Fri Jul 27 17:23:50 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux on Amazon EC2.

As a quick summary to all who read it, this is a linux kernel bug — UsAaR33, Jun 25 '13 at 20:49

alexphilipp · Accepted Answer · 2013-06-20T16:27:39.850

13

Edit: I'll keep my original answer below, but I'll try to explain what's happening here and provide a general solution for you.

Edit 2: Provided another option.

The problem that you're hitting here has to do with how the kernel manages I/O. When you make a write to your filesystem, that write isn't immediately committed to disk; that would be incredibly inefficient. Instead, writes are cached in an area of memory referred to as the page cache, and periodically written in chunks out to disk. The "dirty" section of your log describes the size of this page cache that hasn't been written out to disk yet:

dirty:123816kB

So what empties this dirty cache? Why isn't it doing it's job?

'Flush' on Linux is responsible for writing dirty pages out to disk. It's a daemon that wakes up periodically to determine if writes to disk are required, and, if so, performs them. If you are a C type of guy, start here. Flush is incredibly efficient; it does a great job of flushing stuff to disk when needed. And it's working exactly how it is supposed to.

Flush runs outside of your LXC container, since your LXC container doesn't have its own kernel. LXC containers exist as a construct around cgroups, which is a feature of the Linux kernel that allows better limitations and isolation of process groups, but not its own kernel or flush daemon.

Since your LXC has a memory limit lower than the memory the kernel has available, weird things happen. Flush assumes it has the full memory of the host to cache writes in. A program in your LXC starts to write a big file, it buffers...buffers...and eventually hits it's hard limit, and starts calling the OOM manager. This isn't a failure of any particular component; it's expected behavior. Kind of. This sort of thing should be handled by cgroups, but it doesn't seem like it is.

This completely explains the behavior you see between instance sizes. You'll start flushing to disk much sooner on the micro instance (with 512MB RAM) vs on a large instance

Ok, that makes sense. But it's useless. I still need to write me a big-ass file.

Well, flush isn't aware of your LXC limit. So instead of patching the kernel, there are a few options here for things you can try to tweak:

/proc/sys/vm/dirty_expire_centiseconds

This controls how long a page can be held in the dirty cache and written to disk. By default it's 30 seconds; try setting it lower to start pushing it out faster.

/proc/sys/vm/dirty_background_ratio

This controls what percentage of active memory flush is allowed to fill up before it starts forcing writes. There is a bit of fiddling that goes into sorting out the exact total here, but the easiest explanation is to just look at your total memory. By default it's 10% (on some distros it's 5%). Set this lower; it'll force writes out to disk sooner and may keep your LXC from running out of it's limits.

Can't I just screw with the filesystem a bit?

Well, yeah. But make sure you test this out.. you could affect performance. On your mounts in /etc/fstab where you'll be writing this to, add the 'sync' mount option.

Original answer:

Try reducing the blocksize used by DD:
dd if=/dev/zero of=test2 bs=512 count=1024000
You can only write one sector at a time (512 bytes on older HDDs, 4096 on newer). If DD is pushing writes to disk faster than the disk can accept them, it will start caching the writes in memory. That's why your file cache is growing.

edited Jun 20 '13 at 16:27

answered Jun 15 '13 at 19:08

alexphilipp

708
3
11

I should note that if I run similar tests in python where I manually flush the file object, the error still occurs with similar probability. The cache grows of course, but that should be getting purged one would think rather than the process being killed. – UsAaR33 Jun 15 '13 at 19:11
1

I'd give it a shot anyway. I've found forcing an fsync() with Python doesn't always do what you expect. – alexphilipp Jun 15 '13 at 19:12
Just gave it a test; 10% of processes are still crashing. – UsAaR33 Jun 15 '13 at 19:16
That's an improvement at least. Drop the block size to 256; EBS is unpredictable. Are you using provisioned IOPS? – alexphilipp Jun 15 '13 at 19:42
blocksize of 256 has a 7% crash rate. 128 brought it down to 0%. The thing though is that arbitrary programs (in python and otherwise) need to be able to run. How do I make this work in the general case? – UsAaR33 Jun 15 '13 at 20:41
1

@UsAaR33 Get faster disk. – tink Jun 15 '13 at 20:46
@UsAaR33 You could add more RAM to allow for more caching. The kernel should be plenty capable of managing the write on its own, but DD can force certain suboptimal parameters (in this case, block size). Are you seeing issues when you write a file of this size in Python, or is the DD causing the Python process to crash? – alexphilipp Jun 15 '13 at 21:30
@tink/@alex_sf: Cannot change parameters of the system. I am seeing issues in general on the system when applications write files that are larger than memory size (again, I am showing dd as it is a very simple example). Regardless, I don't understand why this triggers OOM issues. If there is no more cache memory, why is the application just not waiting for the disk to catch up, rather than crashing out? – UsAaR33 Jun 15 '13 at 21:32
1

@UsAaR33 An application will write as quickly as it can; it expects the kernel to handle IO. I haven't used an LXC container before, but from a cursory glance it looks like it does not provide it's own kernel in the chroot it creates? If that's the case, the kernel is providing IO with the assumption that it has the full memory of the host system available. It has no idea that you rate limited it to 300MB. Once that limit is hit, OOM starts killing processes. – alexphilipp Jun 15 '13 at 21:46
1

@UsAaR33: Bad settings cause bad results. One part of the system is told lots of memory can be used as a cache, another part of the system is told to kill processes if the cache is too big. Why should it wait for the disk when there's plenty of RAM available? And if there's plenty of RAM available, why not let it use it? – David Schwartz Jun 16 '13 at 03:28
@david/alex: I should point out that I'm using LXC effectively to have multiple "virtual machines" running on my system. I can run this test perfectly with a 300 MB RAM Xen VM (a t1.micro), but it fails when I partition a c1.xlarge instance into multiple 300 MB RAM LXC containers. This seems very wrong - LXC is supposed to some level of virtualization - the application/kernel/etc. managing I/O should be aware of the container's memory limit. I don't know if I have a faulty setting or this is an LXC bug, but it is very problematic. – UsAaR33 Jun 16 '13 at 19:35
micro instances have 512MB of RAM, so the kernel wouldn't allow a file cache to grow to 300MB. It *would* allow it on a c1.xlarge(7GB RAM). This could easily be a bug in LXC. – alexphilipp Jun 17 '13 at 02:09
LXC is claiming it may be a bug with the underlying cgroup memory management. Afaik, no patch since 3.8.0 has addressed this issue; as it is failing even in ubuntu 13.04, I'm considering this an open issue (or I'm just doing something dumb). – UsAaR33 Jun 20 '13 at 01:59
@UsAaR33 I agree that it's a problem with cgroups. I've updated my answer above. – alexphilipp Jun 20 '13 at 02:30
Thanks for the detailed response. Unfortunately, tweaking the dirty parameters to extremes (dirty_background_ratio of 1 and dirty_expire_centiseconds at 300, things still crash out (quite strange given that the memory available for dirty is now ~50% of the quotas I've set). sync may work, but it slows down my system performance so much that it is unusable. I'll be filing a bug on the linux kernel. If no one else gives a solution that works out of the box, you'll be getting the bounty credit. – UsAaR33 Jun 25 '13 at 05:15
That's unfortunate. It sounds like you're hitting the limits of standard EBS. The kernel is trying to flush, but it just can't get the writes out fast enough. Go for provisioned IOPS and you should be able to make it happen. – alexphilipp Jun 25 '13 at 07:08
Oh, I'm writing to instance storage. /shrug. Anyway, this has been confirmed as a bug on the linux kernel mailing list, so I'm assigning you the bounty. Thanks! – UsAaR33 Jun 25 '13 at 20:45

score 3 · Answer 2 · answered Jun 15 '13 at 17:38

3

Is your file writing to /tmp? If so, it might not be on an actual filesystem but resident on disk. Thus as you write to it, more and more memory is taken away to meed the needs of the file. Eventually, you run out of memory+swap space and your performance deteriorates to the point of utter frustration.

answered Jun 15 '13 at 17:38

mdpc

11,698
28
51
65

It is writing to $HOME, which is on an AUFS mount which triggers writes to the underlying disk. (EC2 EBS) – UsAaR33 Jun 15 '13 at 19:10

score 2 · Answer 3 · answered Jun 16 '13 at 06:42

2

unless you are writing to RAM disk, you can avoid caching by using oflag=direct

dd if=/dev/zero of=test2 bs=100k oflag=direct count=5010

answered Jun 16 '13 at 06:42

Kevin Parker

757
1
13
29

direct causes an "Invalid argument" error, but using oflag=dsync does work. – UsAaR33 Jun 16 '13 at 19:18
i am sorry if it didn't work for you, as per man page "direct use direct I/O for data" – Kevin Parker Jun 17 '13 at 07:31

Why are applications in a memory-limited LXC container writing large files to disk being killed by the OOM?

3 Answers3