14

We've been running into the OOM killer despite nearly half of our memory being used for the FS cache. We've been logging memory stats once per minute (as reported by top), but there seems to be plenty of availability.

...

Mem:  15339640k total, 15268304k used,    71336k free,     3152k buffers
Swap:        0k total,        0k used,        0k free,  6608384k cached

Mem:  15339640k total, 14855280k used,   484360k free,    13748k buffers
Swap:        0k total,        0k used,        0k free,  6481852k cached

[OOM killer: postgres killed]

Mem:  15339640k total,  8212200k used,  7127440k free,    32776k buffers
Swap:        0k total,        0k used,        0k free,  2394444k cached

...

OOM details from syslog:

...
Jun 10 05:45:25 db kernel: [11209156.840462] wal-e invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Jun 10 05:45:25 db kernel: [11209156.840469] wal-e cpuset=/ mems_allowed=0
Jun 10 05:45:25 db kernel: [11209156.840474] Pid: 7963, comm: wal-e Not tainted 3.2.0-43-virtual #68-Ubuntu
Jun 10 05:45:25 db kernel: [11209156.840477] Call Trace:
Jun 10 05:45:25 db kernel: [11209156.840498]  [<ffffffff81119711>] dump_header+0x91/0xe0
Jun 10 05:45:25 db kernel: [11209156.840502]  [<ffffffff81119a95>] oom_kill_process+0x85/0xb0
Jun 10 05:45:25 db kernel: [11209156.840506]  [<ffffffff81119e3a>] out_of_memory+0xfa/0x220
Jun 10 05:45:25 db kernel: [11209156.840511]  [<ffffffff8111f823>] __alloc_pages_nodemask+0x8c3/0x8e0
Jun 10 05:45:25 db kernel: [11209156.840520]  [<ffffffff81216e00>] ? noalloc_get_block_write+0x30/0x30
Jun 10 05:45:25 db kernel: [11209156.840528]  [<ffffffff811566c6>] alloc_pages_current+0xb6/0x120
Jun 10 05:45:25 db kernel: [11209156.840534]  [<ffffffff81116637>] __page_cache_alloc+0xb7/0xd0
Jun 10 05:45:25 db kernel: [11209156.840539]  [<ffffffff81118602>] filemap_fault+0x212/0x3c0
Jun 10 05:45:25 db kernel: [11209156.840553]  [<ffffffff81138c32>] __do_fault+0x72/0x550
Jun 10 05:45:25 db kernel: [11209156.840557]  [<ffffffff8113c2ea>] handle_pte_fault+0xfa/0x200
Jun 10 05:45:25 db kernel: [11209156.840562]  [<ffffffff8100638e>] ? xen_pmd_val+0xe/0x10
Jun 10 05:45:25 db kernel: [11209156.840567]  [<ffffffff81005309>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
Jun 10 05:45:25 db kernel: [11209156.840571]  [<ffffffff8113d559>] handle_mm_fault+0x269/0x370
Jun 10 05:45:25 db kernel: [11209156.840576]  [<ffffffff8100a56d>] ? xen_force_evtchn_callback+0xd/0x10
Jun 10 05:45:25 db kernel: [11209156.840581]  [<ffffffff8100ad42>] ? check_events+0x12/0x20
Jun 10 05:45:25 db kernel: [11209156.840589]  [<ffffffff8165b3cb>] do_page_fault+0x14b/0x520
Jun 10 05:45:25 db kernel: [11209156.840594]  [<ffffffff81160d64>] ? kmem_cache_free+0x104/0x110
Jun 10 05:45:25 db kernel: [11209156.840600]  [<ffffffff811ba2c8>] ? ep_remove+0xa8/0xc0
Jun 10 05:45:25 db kernel: [11209156.840604]  [<ffffffff811bb133>] ? sys_epoll_ctl+0xb3/0x3d0
Jun 10 05:45:25 db kernel: [11209156.840614]  [<ffffffff81658035>] page_fault+0x25/0x30
Jun 10 05:45:25 db kernel: [11209156.840617] Mem-Info:
Jun 10 05:45:25 db kernel: [11209156.840618] Node 0 DMA per-cpu:
Jun 10 05:45:25 db kernel: [11209156.840622] CPU    0: hi:    0, btch:   1 usd:   0
Jun 10 05:45:25 db kernel: [11209156.840624] CPU    1: hi:    0, btch:   1 usd:   0
Jun 10 05:45:25 db kernel: [11209156.840627] CPU    2: hi:    0, btch:   1 usd:   0
Jun 10 05:45:25 db kernel: [11209156.840629] CPU    3: hi:    0, btch:   1 usd:   0
Jun 10 05:45:25 db kernel: [11209156.840631] Node 0 DMA32 per-cpu:
Jun 10 05:45:25 db kernel: [11209156.840634] CPU    0: hi:  186, btch:  31 usd:  30
Jun 10 05:45:25 db kernel: [11209156.840637] CPU    1: hi:  186, btch:  31 usd:  47
Jun 10 05:45:25 db kernel: [11209156.840639] CPU    2: hi:  186, btch:  31 usd:  15
Jun 10 05:45:25 db kernel: [11209156.840641] CPU    3: hi:  186, btch:  31 usd:   2
Jun 10 05:45:25 db kernel: [11209156.840643] Node 0 Normal per-cpu:
Jun 10 05:45:25 db kernel: [11209156.840646] CPU    0: hi:  186, btch:  31 usd:   0
Jun 10 05:45:25 db kernel: [11209156.840648] CPU    1: hi:  186, btch:  31 usd:  14
Jun 10 05:45:25 db kernel: [11209156.840650] CPU    2: hi:  186, btch:  31 usd:   0
Jun 10 05:45:25 db kernel: [11209156.840653] CPU    3: hi:  186, btch:  31 usd:   1
Jun 10 05:45:25 db kernel: [11209156.840658] active_anon:3616567 inactive_anon:4798 isolated_anon:0
Jun 10 05:45:25 db kernel: [11209156.840660]  active_file:98 inactive_file:168 isolated_file:20
Jun 10 05:45:25 db kernel: [11209156.840661]  unevictable:1597 dirty:73 writeback:0 unstable:0
Jun 10 05:45:25 db kernel: [11209156.840662]  free:16921 slab_reclaimable:17631 slab_unreclaimable:7534
Jun 10 05:45:25 db kernel: [11209156.840663]  mapped:1614529 shmem:1613928 pagetables:124012 bounce:0
Jun 10 05:45:25 db kernel: [11209156.840666] Node 0 DMA free:7888kB min:4kB low:4kB high:4kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:7632kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Jun 10 05:45:25 db kernel: [11209156.840681] lowmem_reserve[]: 0 4016 15112 15112
Jun 10 05:45:25 db kernel: [11209156.840686] Node 0 DMA32 free:48368kB min:4176kB low:5220kB high:6264kB active_anon:3776804kB inactive_anon:28kB active_file:0kB inactive_file:20kB unevictable:932kB isolated(anon):0kB isolated(file):0kB present:4112640kB mlocked:932kB dirty:0kB writeback:0kB mapped:1458536kB shmem:1458632kB slab_reclaimable:17604kB slab_unreclaimable:8088kB kernel_stack:1872kB pagetables:190616kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:437 all_unreclaimable? yes
Jun 10 05:45:25 db kernel: [11209156.840698] lowmem_reserve[]: 0 0 11095 11095
Jun 10 05:45:25 db kernel: [11209156.840703] Node 0 Normal free:11428kB min:11548kB low:14432kB high:17320kB active_anon:10689464kB inactive_anon:19164kB active_file:528kB inactive_file:652kB unevictable:5456kB isolated(anon):0kB isolated(file):80kB present:11362176kB mlocked:5456kB dirty:292kB writeback:0kB mapped:4999580kB shmem:4997080kB slab_reclaimable:52920kB slab_unreclaimable:22048kB kernel_stack:2584kB pagetables:305432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1974 all_unreclaimable? yes
Jun 10 05:45:25 db kernel: [11209156.840715] lowmem_reserve[]: 0 0 0 0
Jun 10 05:45:25 db kernel: [11209156.840720] Node 0 DMA: 2*4kB 3*8kB 1*16kB 3*32kB 3*64kB 3*128kB 2*256kB 1*512kB 2*1024kB 2*2048kB 0*4096kB = 7888kB
Jun 10 05:45:25 db kernel: [11209156.840752] Node 0 DMA32: 5813*4kB 2636*8kB 114*16kB 15*32kB 5*64kB 1*128kB 1*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 48372kB
Jun 10 05:45:25 db kernel: [11209156.840776] Node 0 Normal: 1888*4kB 10*8kB 46*16kB 4*32kB 3*64kB 2*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 11760kB
Jun 10 05:45:25 db kernel: [11209156.840788] 1615243 total pagecache pages
Jun 10 05:45:25 db kernel: [11209156.840790] 0 pages in swap cache
Jun 10 05:45:25 db kernel: [11209156.840801] Swap cache stats: add 0, delete 0, find 0/0
Jun 10 05:45:25 db kernel: [11209156.840803] Free swap  = 0kB
Jun 10 05:45:25 db kernel: [11209156.840805] Total swap = 0kB
Jun 10 05:45:25 db kernel: [11209156.909794] 3934192 pages RAM
Jun 10 05:45:25 db kernel: [11209156.909804] 99282 pages reserved
Jun 10 05:45:25 db kernel: [11209156.909809] 18899146 pages shared
Jun 10 05:45:25 db kernel: [11209156.909811] 2198511 pages non-shared
Jun 10 05:45:25 db kernel: [11209156.909817] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
Jun 10 05:45:25 db kernel: [11209156.909835] [  332]     0   332     4308      109   1       0             0 upstart-udev-br
Jun 10 05:45:25 db kernel: [11209156.909845] [  346]     0   346     5384      271   2     -17         -1000 udevd
Jun 10 05:45:25 db kernel: [11209156.909851] [  408]     0   408     5364      174   2     -17         -1000 udevd
...
Jun 10 05:45:25 db kernel: [11209156.910703] [ 7963]   111  7963    17456     2966   0       0             0 wal-e
Jun 10 05:45:25 db kernel: [11209156.910707] [ 7968]   111  7968  1639372     2351   3       0             0 postgres
Jun 10 05:45:25 db kernel: [11209156.910711] [ 7969]   111  7969  1639371     1934   2       0             0 postgres
Jun 10 05:45:25 db kernel: [11209156.910716] Out of memory: Kill process 12443 (postgres) score 418 or sacrifice child
Jun 10 05:45:25 db kernel: [11209156.910733] Killed process 12443 (postgres) total-vm:6555152kB, anon-rss:4600kB, file-rss:6396572kB
Jun 10 05:45:30 db kernel: [11209159.293083] postgres invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Jun 10 05:45:31 db kernel: [11209159.293091] postgres cpuset=/ mems_allowed=0
Jun 10 05:45:31 db kernel: [11209159.293095] Pid: 6508, comm: postgres Not tainted 3.2.0-43-virtual #68-Ubuntu
Jun 10 05:45:31 db kernel: [11209159.293098] Call Trace:
Jun 10 05:45:31 db kernel: [11209159.293111]  [<ffffffff81119711>] dump_header+0x91/0xe0
Jun 10 05:45:31 db kernel: [11209159.293115]  [<ffffffff81119a95>] oom_kill_process+0x85/0xb0
Jun 10 05:45:31 db kernel: [11209159.293119]  [<ffffffff81119e3a>] out_of_memory+0xfa/0x220
...

We can try increasing the resolution of these to roughly once per second, but would there be any reason for an OOM here? (We've seen http://bl0rg.krunch.be/oom-frag.html but we're working with much larger absolute amounts of memory, most of which is taken by the kernel's FS cache.)

Also including relevant parts of our postgresql.conf below:

shared_buffers = 6GB
effective_cache_size = 8GB
Yang
  • 1,655
  • 6
  • 20
  • 35
  • 2
    [Related chat discussion](http://chat.stackexchange.com/rooms/9206/discussion-between-yang-and-david-schwartz) – Yang Jun 12 '13 at 00:14
  • Um... `3.2.0-43`? Update time. The OOM killer has been through many (too many) changes over time. Some versions were really quite brain-dead about accounting for shared memory use... like PostgreSQL 9.2 and older's `shared_buffers`. – Craig Ringer Jun 12 '13 at 04:02
  • @CraigRinger Unfortunately there are other considerations including sticking with the kernel in the Ubuntu 12.04 distro for LTS, compatibility, security updates, etc. We're just interested if someone knows how to explain what's happening here - if it helps, pretend we're not interested in changing the status quo / making the problem go away. BTW `shared_buffers` is still in PG9.3. – Yang Jun 12 '13 at 09:56
  • Yes `shared_buffers` is still in 9.3, but it's no longer *POSIX shared memory* in 9.3. It's been replaced with an anonymous `mmap()ed` region. That gets around some icky kernel config issues and pinning problems, though I doubt it'll make the OOM killer any less confused. – Craig Ringer Jun 12 '13 at 10:39
  • 1
    Possibly a duplicate of http://serverfault.com/questions/288319/linux-not-freeing-large-disk-cache-when-memory-demand-goes-up, which has a potential answer. – richvdh Sep 26 '16 at 09:06

2 Answers2

9
  1. For the love of everything good in the world, configure swap space on your servers.
    You really need swap space. I'm not the only one who says so, it's pretty much a universal truth around here. (<-- Those are three links )
    You should of course have enough RAM that your database server isn't swapping regularly -- if you don't the solution is money (which you take you your vendor and use to acquire more RAM).

  2. Since you now have adequate RAM, and swap to use if something goes wrong, you can disable the OOM killer (by disabling memory overcommit), like the Postgres people tell you to.
    (You can also apply their alternate solution and tell the OOM-Killer to never kill Postgres - but then you're just playing Russian Roulette with the rest of your system's processes...)

  3. (optional) Write an answer on Server Fault detailing why the default behavior in most Linux distributions is Bad, Wrong, and violates the POSIX specification for how malloc() is supposed to behave. Repeat it until everyone is sick of hearing about it.


Also note that the kernel's cached memory is available to postgres (or any other application) to use - you should factor it as free/available memory in your calculations.

If I had to hazard a guess at what's happening here I'd say you've got a complex query, Postgres is requesting RAM to execute it, and rather than saying "I don't have that RAM" Linux tells Postgres "Sure, you can have it."
Then when Postgres actually tries to use the RAM it was (allegedly) given Linux realizes it doesn't HAVE the RAM it promised Postgres (because it's overcommitted) - the OOM killer is told to free up the RAM, and dutifully kills the program using the most memory -- your database server.

Postgres is a well-designed program. If it's told it can't have the RAM it's requesting it will handle that gracefully (either by making do with less, or aborting with a message to the user).

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • 4
    Thanks for the elaboration on swap, but this **doesn't answer my question of why this is happening in the first place**. Yes, I understand the basic premise that Linux overcommits by default and that OOM is when we're out of RAM - I could have stated this much in my original question. But the question is **why is it kicking in when I still have plenty of RAM** (much of it just sitting in the FS cache)? Assume I'm not even interested in changing anything - that the OOM killer is fine, so long as I understand why it's being triggered. – Yang Jun 12 '13 at 02:43
  • 2
    After reviewing the links, there are unfortunately a number of assertions without backing evidence or concrete technical explanations. There certainly many Linux environments where swap is not even an option (example: look no further than a Live CD where there isn't already an existing local swap partition to reuse). We are furthermore not interested in enabling swappiness based on our own experiences and environment - we would rather have the OOM. An answer to the original question would be appreciated. – Yang Jun 12 '13 at 09:52
  • 1
    @Yang I assume that if you're building a server for Postgres you would want to follow the recommendations of the Postgres project. My answer is to do what they tell you (shut off the OOM killer). If you would like to wait and see if someone offers you a different answer you're certainly free to do so, but I'm not comfortable offering any other solution - In my opinion the OOM killer is Bad, Wrong, and violates POSIX. It may be acceptable on a desktop/worstation, but disabling it on servers is, IMO, The Right Thing To Do. – voretaq7 Jun 12 '13 at 15:46
  • 3
    I've run across this situation on a server which does have swap, and after saturating available memory+swap the OOM killer was used instead of the kernel reclaiming "cached" memory, which was obviously somehow locked. I never resolved the issue, but @Yang's original question is not answered here. – Patrick Jun 20 '14 at 05:37
  • agreed, this doesn't answer the question. – richvdh Sep 26 '16 at 08:51
  • 2
    Swap is not the answer, it only makes problem appear later. You need swap when the RAM is full and you need OOM Killer when RAM+swap gets full. If amount of swap is zero, you need OOM Killer sooner but you cannot avoid OOM Killer with swap. – Mikko Rantalainen Oct 10 '18 at 11:07
  • If you truly can avoid OOM Killer with swap, you just have too little RAM. – Mikko Rantalainen Oct 10 '18 at 11:09
  • About `malloc()`. Yes, instead of using OOM Killer you can in fact configure Linux to return `null` from `malloc()` if the memory is gone. However, you cannot escape that fact that (1) most software will not correctly handle this case, and (2) stack space is not reserved using `malloc()`. OS will try to automatically acquire more RAM in case more stack space is needed and you still need OOM Killer because there's no API to report that stack space is full. – Mikko Rantalainen Oct 10 '18 at 11:14
5

It appears you (and I in a case with very similar symptoms) have truly run out of memory and have been confused by the cached number.

There apparently are cases when Linux not freeing large disk cache when memory demand goes up

In particular (I don't really understand why), postgres' shared_buffers may be reported under "Cached" (the page cache). In your case the 6481852k cached in top matches this line in the OOM-killer's log:

Jun 10 05:45:25 db kernel: [11209156.840788] 1615243 total pagecache pages

(1615243*4KB ~= 6481852k) - meaning the page cache indeed was not dropped before invoking OOM-killer.

Yet there are few file-backed pages (I assume active_file:98 inactive_file:168 is similar to /proc/meminfo's Active(file)/Inactive(file)), so it's not the discardable pages we know and love.

The post at https://www.depesz.com/2012/06/09/how-much-ram-is-postgresql-using/ demonstrates an example session where shutting down postgres leads to reduction of "cached" by the size of shared_buffers (scroll to "And most of it came off disk cache – as expected, because it was used for shared_buffers.") - unfortunately it doesn't indicate the version of postgres nor the kernel that was used for the experiment.

I'm using 3.13.0-67 x86_64 with PG 9.3. In 9.3 they switched from using Sys V shared memory (shmget) to anonymous mmap(...R+W, MAP_SHARED|MAP_ANONYMOUS|MAP_HASSEMAPHORE...)+fork() (in 9.4 this became configurable via dynamic_shared_memory_type). But I couldn't find any explanations as to why these mmap()s are supposed to show up in "cached" and why, only https://access.redhat.com/solutions/406773 that says "Cached: Memory in the pagecache (Diskcache and Shared Memory)"

Given that there's many kinds of shared memory I'm both enlightened and confused...

Nickolay
  • 225
  • 3
  • 7
  • After several years, this is a much better answer, thank you. There's still the question of why this is accounted as cached, but I am marking this accepted for now. – Yang Jun 16 '17 at 01:31