We're running a linux server hard, but it seems ok, as of 6 GiB RAM the kernel sees fit to allocate 1.6 GiB to cache, and we have 9.9 GiB free swap. When we ran out of memory, instead of freeing up cache or swapping out processes, the kernel oom decided to kill the database server. We have swappiness set to 0 (longer side conversation). We also have 1.7 GiB database server memory pinned in RAM, presumably allocated in large contiguous chunks.
While it's quite nice to have spare RAM used for cache, we'd highly prefer that be freed for applications, prior to kneecapping the server. We were caught quite by surprise.
Any help in predicting and managing this (aside from adding RAM to the system) would be much appreciated! (Our immediate response was to dial back swappiness, resulting in increased free memory, same cache allocation, and concomitant swap IO and increased cpu WaitIO. It appears wasteful to have free memory while users wait for swap, but its better than a dead system.)
Relevant stats.
sar -s 15:00:00 -e 16:00:00 Linux 2.6.32-358.2.1.el6.x86_64 (system) 04/23/2014 _x86_64_ (3 CPU) 03:00:02 PM CPU %user %nice %system %iowait %steal %idle 03:10:01 PM all 17.90 0.00 35.27 3.29 0.12 43.42 03:20:02 PM all 19.94 0.00 36.40 8.73 0.12 34.82 03:30:02 PM all 20.18 0.00 42.10 20.37 0.16 17.19 03:40:01 PM all 5.75 0.00 19.93 5.05 0.06 69.20 03:50:01 PM all 22.21 0.00 34.31 8.72 0.16 34.59 Average: all 17.17 0.00 33.58 9.23 0.12 39.90 sar -s 15:00:00 -e 16:00:00 Linux 2.6.32-358.2.1.el6.x86_64 (system) 04/23/2014 _x86_64_ (3 CPU) 03:00:02 PM CPU %user %nice %system %iowait %steal %idle 03:10:01 PM all 17.90 0.00 35.27 3.29 0.12 43.42 03:20:02 PM all 19.94 0.00 36.40 8.73 0.12 34.82 03:30:02 PM all 20.18 0.00 42.10 20.37 0.16 17.19 03:40:01 PM all 5.75 0.00 19.93 5.05 0.06 69.20 03:50:01 PM all 22.21 0.00 34.31 8.72 0.16 34.59 Average: all 17.17 0.00 33.58 9.23 0.12 39.90 sar -r -s 15:00:00 -e 16:00:00 Linux 2.6.32-358.2.1.el6.x86_64 (system) 04/23/2014 _x86_64_ (3 CPU) 03:00:02 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit 03:10:01 PM 125396 5843216 97.90 22160 1989352 23282984 141.51 03:20:02 PM 44788 5923824 99.25 19092 1759124 23759172 144.40 03:30:02 PM 34400 5934212 99.42 11864 1649488 23856192 144.99 03:40:01 PM 4139748 1828864 30.64 22972 341264 9558784 58.10 03:50:01 PM 1364632 4603980 77.14 30784 2576044 11827612 71.88 Average: 1141793 4826819 80.87 21374 1663054 18456949 112.18 sar -R -s 15:00:00 -e 16:00:00 Linux 2.6.32-358.2.1.el6.x86_64 (system) 04/23/2014 _x86_64_ (3 CPU) 03:00:02 PM frmpg/s bufpg/s campg/s 03:10:01 PM -24.49 -0.54 26.00 03:20:02 PM -34.14 -1.30 -97.51 03:30:02 PM -4.43 -3.08 -46.70 03:40:01 PM 1732.71 4.69 -552.15 03:50:01 PM -1188.87 3.35 957.39 Average: 100.36 0.62 55.05 sar -S -s 15:00:00 -e 16:00:00 Linux 2.6.32-358.2.1.el6.x86_64 (system) 04/23/2014 _x86_64_ (3 CPU) 03:00:02 PM kbswpfree kbswpused %swpused kbswpcad %swpcad 03:10:01 PM 10471148 13892 0.13 4300 30.95 03:20:02 PM 10471148 13892 0.13 4300 30.95 03:30:02 PM 10471148 13892 0.13 4300 30.95 03:40:01 PM 10475820 9220 0.09 3484 37.79 03:50:01 PM 10475820 9220 0.09 3484 37.79 Average: 10473017 12023 0.11 3973 33.04
And kernel log
Apr 23 15:31:01 kernel: proc_count invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0 Apr 23 15:31:01 kernel: proc_count cpuset=/ mems_allowed=0 Apr 23 15:31:01 kernel: Pid: 1713, comm: proc_count Not tainted 2.6.32-358.2.1.el6.x86_64 #1 Apr 23 15:31:01 kernel: Call Trace: Apr 23 15:31:01 kernel: [] ? cpuset_print_task_mems_allowed+0x91/0xb0 Apr 23 15:31:01 kernel: [] ? dump_header+0x90/0x1b0 Apr 23 15:31:01 kernel: [] ? oom_kill_process+0x82/0x2a0 Apr 23 15:31:01 kernel: [] ? select_bad_process+0xe1/0x120 Apr 23 15:31:01 kernel: [] ? out_of_memory+0x220/0x3c0 Apr 23 15:31:01 kernel: [] ? __alloc_pages_nodemask+0x8ac/0x8d0 Apr 23 15:31:01 kernel: [] ? xen_force_evtchn_callback+0xd/0x10 Apr 23 15:31:01 kernel: [] ? alloc_pages_current+0xaa/0x110 Apr 23 15:31:01 kernel: [] ? __get_free_pages+0xe/0x50 Apr 23 15:31:01 kernel: [] ? copy_process+0xe4/0x1450 Apr 23 15:31:01 kernel: [] ? kmem_cache_alloc+0x182/0x190 Apr 23 15:31:01 kernel: [] ? do_fork+0x94/0x460 Apr 23 15:31:01 kernel: [] ? alloc_fd+0x92/0x160 Apr 23 15:31:01 kernel: [] ? fd_install+0x47/0x90 Apr 23 15:31:01 kernel: [] ? _spin_lock_irq+0x10/0x40 Apr 23 15:31:01 kernel: [] ? sys_clone+0x28/0x30 Apr 23 15:31:01 kernel: [] ? stub_clone+0x13/0x20 Apr 23 15:31:01 kernel: [] ? system_call_fastpath+0x16/0x1b Apr 23 15:31:01 kernel: Mem-Info: Apr 23 15:31:01 kernel: Node 0 DMA per-cpu: Apr 23 15:31:01 kernel: CPU 0: hi: 0, btch: 1 usd: 0 Apr 23 15:31:01 kernel: CPU 1: hi: 0, btch: 1 usd: 0 Apr 23 15:31:01 kernel: CPU 2: hi: 0, btch: 1 usd: 0 Apr 23 15:31:01 kernel: Node 0 DMA32 per-cpu: Apr 23 15:31:01 kernel: CPU 0: hi: 186, btch: 31 usd: 29 Apr 23 15:31:01 kernel: CPU 1: hi: 186, btch: 31 usd: 0 Apr 23 15:31:01 kernel: CPU 2: hi: 186, btch: 31 usd: 0 Apr 23 15:31:01 kernel: Node 0 Normal per-cpu: Apr 23 15:31:01 kernel: CPU 0: hi: 186, btch: 31 usd: 58 Apr 23 15:31:01 kernel: CPU 1: hi: 186, btch: 31 usd: 0 Apr 23 15:31:01 kernel: CPU 2: hi: 186, btch: 31 usd: 0 Apr 23 15:31:01 kernel: active_anon:657282 inactive_anon:234243 isolated_anon:0 Apr 23 15:31:01 kernel: active_file:1582 inactive_file:1577 isolated_file:0 Apr 23 15:31:01 kernel: unevictable:415617 dirty:0 writeback:4 unstable:0 Apr 23 15:31:01 kernel: free:7393 slab_reclaimable:13322 slab_unreclaimable:36105 Apr 23 15:31:01 kernel: mapped:410321 shmem:409696 pagetables:86995 bounce:0 Apr 23 15:31:01 kernel: Node 0 DMA free:7760kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:7756kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Apr 23 15:31:01 kernel: lowmem_reserve[]: 0 4024 6044 6044 Apr 23 15:31:01 kernel: Node 0 DMA32 free:17736kB min:6620kB low:8272kB high:9928kB active_anon:2256636kB inactive_anon:564276kB active_file:2088kB inactive_file:2180kB unevictable:624700kB isolated(anon):0kB isolated(file):0kB present:4120800kB mlocked:0kB dirty:0kB writeback:16kB mapped:624832kB shmem:624832kB slab_reclaimable:32124kB slab_unreclaimable:90036kB kernel_stack:32696kB pagetables:206100kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Apr 23 15:31:01 kernel: lowmem_reserve[]: 0 0 2020 2020 Apr 23 15:31:01 kernel: Node 0 Normal free:4076kB min:3320kB low:4148kB high:4980kB active_anon:372492kB inactive_anon:372696kB active_file:4240kB inactive_file:4128kB unevictable:1037768kB isolated(anon):0kB isolated(file):0kB present:2068480kB mlocked:24776kB dirty:0kB writeback:0kB mapped:1016452kB shmem:1013952kB slab_reclaimable:21164kB slab_unreclaimable:54384kB kernel_stack:4328kB pagetables:141880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Apr 23 15:31:01 kernel: lowmem_reserve[]: 0 0 0 0 Apr 23 15:31:01 kernel: Node 0 DMA: 2*4kB 1*8kB 2*16kB 1*32kB 0*64kB 2*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB = 7760kB Apr 23 15:31:01 kernel: Node 0 DMA32: 4083*4kB 12*8kB 4*16kB 3*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 17740kB Apr 23 15:31:01 kernel: Node 0 Normal: 615*4kB 11*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 4132kB Apr 23 15:31:01 kernel: 414562 total pagecache pages Apr 23 15:31:01 kernel: 1075 pages in swap cache Apr 23 15:31:01 kernel: Swap cache stats: add 22900328, delete 22899253, find 492200942/494634739 Apr 23 15:31:01 kernel: Free swap = 10471148kB Apr 23 15:31:01 kernel: Total swap = 10485040kB Apr 23 15:31:01 kernel: 1572863 pages RAM Apr 23 15:31:01 kernel: 80710 pages reserved Apr 23 15:31:01 kernel: 2926057 pages shared Apr 23 15:31:01 kernel: 1058807 pages non-shared Apr 23 15:31:01 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
skipping the long process lists
Apr 23 15:31:01 kernel: Out of memory: Kill process 1357 (oninit) score 99 or sacrifice child Apr 23 15:31:01 kernel: Killed process 1359, UID 501, (oninit) total-vm:2056000kB, anon-rss:116kB, file-rss:10132kB Apr 23 15:31:20 kernel: awk invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0 Apr 23 15:31:20 kernel: awk cpuset=/ mems_allowed=0 Apr 23 15:31:20 kernel: Pid: 22899, comm: awk Not tainted 2.6.32-358.2.1.el6.x86_64 #1 Apr 23 15:31:20 kernel: Call Trace: Apr 23 15:31:20 kernel: [] ? cpuset_print_task_mems_allowed+0x91/0xb0 Apr 23 15:31:20 kernel: [] ? dump_header+0x90/0x1b0 Apr 23 15:31:20 kernel: [] ? oom_kill_process+0x82/0x2a0 Apr 23 15:31:20 kernel: [] ? select_bad_process+0xe1/0x120 Apr 23 15:31:20 kernel: [] ? out_of_memory+0x220/0x3c0 Apr 23 15:31:20 kernel: [] ? __alloc_pages_nodemask+0x8ac/0x8d0 Apr 23 15:31:20 kernel: [] ? alloc_pages_current+0xaa/0x110 Apr 23 15:31:20 kernel: [] ? __page_cache_alloc+0x87/0x90 Apr 23 15:31:20 kernel: [] ? find_get_page+0x1e/0xa0 Apr 23 15:31:20 kernel: [] ? filemap_fault+0x1a7/0x500 Apr 23 15:31:20 kernel: [] ? xen_set_pte_at+0xaf/0x170 Apr 23 15:31:20 kernel: [] ? __do_fault+0x54/0x530 Apr 23 15:31:20 kernel: [] ? handle_pte_fault+0xf7/0xb50 Apr 23 15:31:20 kernel: [] ? retint_restore_args+0x5/0x6 Apr 23 15:31:20 kernel: [] ? xen_hypervisor_callback+0x1b/0x20 Apr 23 15:31:20 kernel: [] ? error_exit+0x2a/0x60 Apr 23 15:31:20 kernel: [] ? retint_restore_args+0x5/0x6 Apr 23 15:31:20 kernel: [] ? xen_hypervisor_callback+0x1b/0x20 Apr 23 15:31:20 kernel: [] ? __raw_callee_save_xen_pmd_val+0x11/0x1e Apr 23 15:31:20 kernel: [] ? handle_mm_fault+0x23a/0x310 Apr 23 15:31:20 kernel: [] ? __do_page_fault+0x139/0x480 Apr 23 15:31:20 kernel: [] ? xen_force_evtchn_callback+0xd/0x10 Apr 23 15:31:20 kernel: [] ? check_events+0x12/0x20 Apr 23 15:31:20 kernel: [] ? xen_force_evtchn_callback+0xd/0x10 Apr 23 15:31:20 kernel: [] ? check_events+0x12/0x20 Apr 23 15:31:20 kernel: [] ? do_page_fault+0x3e/0xa0 Apr 23 15:31:20 kernel: [] ? page_fault+0x25/0x30 Apr 23 15:31:20 kernel: Mem-Info: Apr 23 15:31:20 kernel: Node 0 DMA per-cpu: Apr 23 15:31:20 kernel: CPU 0: hi: 0, btch: 1 usd: 0 Apr 23 15:31:20 kernel: CPU 1: hi: 0, btch: 1 usd: 0 Apr 23 15:31:20 kernel: CPU 2: hi: 0, btch: 1 usd: 0 Apr 23 15:31:20 kernel: Node 0 DMA32 per-cpu: Apr 23 15:31:20 kernel: CPU 0: hi: 186, btch: 31 usd: 32 Apr 23 15:31:20 kernel: CPU 1: hi: 186, btch: 31 usd: 0 Apr 23 15:31:20 kernel: CPU 2: hi: 186, btch: 31 usd: 0 Apr 23 15:31:20 kernel: Node 0 Normal per-cpu: Apr 23 15:31:20 kernel: CPU 0: hi: 186, btch: 31 usd: 90 Apr 23 15:31:20 kernel: CPU 1: hi: 186, btch: 31 usd: 0 Apr 23 15:31:20 kernel: CPU 2: hi: 186, btch: 31 usd: 0 Apr 23 15:31:20 kernel: active_anon:659392 inactive_anon:234908 isolated_anon:0 Apr 23 15:31:20 kernel: active_file:1093 inactive_file:1129 isolated_file:0 Apr 23 15:31:20 kernel: unevictable:415617 dirty:0 writeback:0 unstable:0 Apr 23 15:31:20 kernel: free:6348 slab_reclaimable:11215 slab_unreclaimable:36129 Apr 23 15:31:20 kernel: mapped:410358 shmem:409696 pagetables:86756 bounce:0 Apr 23 15:31:20 kernel: Node 0 DMA free:7760kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:7756kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Apr 23 15:31:20 kernel: lowmem_reserve[]: 0 4024 6044 6044 Apr 23 15:31:20 kernel: Node 0 DMA32 free:14596kB min:6620kB low:8272kB high:9928kB active_anon:2264272kB inactive_anon:566132kB active_file:1476kB inactive_file:1444kB unevictable:624700kB isolated(anon):0kB isolated(file):0kB present:4120800kB mlocked:0kB dirty:0kB writeback:0kB mapped:624944kB shmem:624832kB slab_reclaimable:24184kB slab_unreclaimable:89940kB kernel_stack:32688kB pagetables:206172kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1620 all_unreclaimable? no Apr 23 15:31:20 kernel: lowmem_reserve[]: 0 0 2020 2020 Apr 23 15:31:20 kernel: Node 0 Normal free:3036kB min:3320kB low:4148kB high:4980kB active_anon:373296kB inactive_anon:373500kB active_file:2896kB inactive_file:3072kB unevictable:1037768kB isolated(anon):0kB isolated(file):2568kB present:2068480kB mlocked:24776kB dirty:0kB writeback:0kB mapped:1016488kB shmem:1013952kB slab_reclaimable:20676kB slab_unreclaimable:54576kB kernel_stack:4368kB pagetables:140852kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:9666 all_unreclaimable? no Apr 23 15:31:20 kernel: lowmem_reserve[]: 0 0 0 0 Apr 23 15:31:20 kernel: Node 0 DMA: 2*4kB 1*8kB 2*16kB 1*32kB 0*64kB 2*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB = 7760kB Apr 23 15:31:20 kernel: Node 0 DMA32: 3297*4kB 12*8kB 4*16kB 3*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 14596kB Apr 23 15:31:20 kernel: Node 0 Normal: 356*4kB 1*8kB 0*16kB 1*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 3000kB Apr 23 15:31:20 kernel: 414714 total pagecache pages Apr 23 15:31:20 kernel: 1097 pages in swap cache Apr 23 15:31:20 kernel: Swap cache stats: add 22900397, delete 22899300, find 492200999/494634809 Apr 23 15:31:20 kernel: Free swap = 10471368kB Apr 23 15:31:20 kernel: Total swap = 10485040kB Apr 23 15:31:20 kernel: 1572863 pages RAM Apr 23 15:31:20 kernel: 80710 pages reserved Apr 23 15:31:20 kernel: 2926158 pages shared Apr 23 15:31:20 kernel: 1059316 pages non-shared Apr 23 15:31:20 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
again skiping the long process list
Apr 23 15:31:20 kernel: Out of memory: Kill process 1357 (oninit) score 99 or sacrifice child Apr 23 15:31:20 kernel: Killed process 1357, UID 501, (oninit) total-vm:2056000kB, anon-rss:648kB, file-rss:1637852kB