CentOS 6: strange page allocation failure messages

Question

I set up a new Server with CentOS 6.4 final as successor for an old mysql server and I'm facing some problems with it. From time to time mysql connections are being disconnected. Furthermore the transfer of the large backup tar files to a ftp-storage sometimes fails. Both not reproducible.

While analyzing I found some strange messages that I cannot interpret in /var/log/messages.

Mar 30 13:09:24 s16838172 kernel: swapper: page allocation failure. order:1, mode:0x20
Mar 30 13:09:24 s16838172 kernel: Pid: 0, comm: swapper Not tainted 2.6.32-358.0.1.el6.x86_64 #1
Mar 30 13:09:24 s16838172 kernel: Call Trace:
Mar 30 13:09:24 s16838172 kernel: <IRQ>  [<ffffffff8112c207>] ? __alloc_pages_nodemask+0x757/0x8d0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81166ab2>] ? kmem_getpages+0x62/0x170
Mar 30 13:09:24 s16838172 kernel: [<ffffffff811676ca>] ? fallback_alloc+0x1ba/0x270
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8116711f>] ? cache_grow+0x2cf/0x320
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81167449>] ? ____cache_alloc_node+0x99/0x160
Mar 30 13:09:24 s16838172 kernel: [<ffffffff811683cb>] ? kmem_cache_alloc+0x11b/0x190
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81439c18>] ? sk_prot_alloc+0x48/0x1c0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8143acf2>] ? sk_clone+0x22/0x2e0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81489bc6>] ? inet_csk_clone+0x16/0xd0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a2ad3>] ? tcp_create_openreq_child+0x23/0x450
Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a02cd>] ? tcp_v4_syn_recv_sock+0x4d/0x310
Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a2876>] ? tcp_check_req+0x226/0x460
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8149fd6b>] ? tcp_v4_do_rcv+0x35b/0x430
Mar 30 13:09:24 s16838172 kernel: [<ffffffffa03b4557>] ? ipv4_confirm+0x87/0x1d0 [nf_conntrack_ipv4]
Mar 30 13:09:24 s16838172 kernel: [<ffffffff814a157e>] ? tcp_v4_rcv+0x4fe/0x8d0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f290>] ? ip_local_deliver_finish+0x0/0x2d0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f36d>] ? ip_local_deliver_finish+0xdd/0x2d0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f5f8>] ? ip_local_deliver+0x98/0xa0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147eabd>] ? ip_rcv_finish+0x12d/0x440
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8147f045>] ? ip_rcv+0x275/0x350
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144827b>] ? __netif_receive_skb+0x4ab/0x750
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144a658>] ? netif_receive_skb+0x58/0x60
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144a760>] ? napi_skb_finish+0x50/0x70
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144cd09>] ? napi_gro_receive+0x39/0x50
Mar 30 13:09:24 s16838172 kernel: [<ffffffffa00f933b>] ? e1000_receive_skb+0x5b/0x90 [e1000e]
Mar 30 13:09:24 s16838172 kernel: [<ffffffffa00fc601>] ? e1000_clean_rx_irq+0x241/0x4c0 [e1000e]
Mar 30 13:09:24 s16838172 kernel: [<ffffffffa0103bbd>] ? e1000e_poll+0xbd/0x380 [e1000e]
Mar 30 13:09:24 s16838172 kernel: [<ffffffffa00f9eca>] ? e1000_put_txbuf+0x6a/0xa0 [e1000e]
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8144ce23>] ? net_rx_action+0x103/0x2f0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8109b153>] ? hrtimer_get_next_event+0xc3/0x100
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81076fb1>] ? __do_softirq+0xc1/0x1e0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff810e1720>] ? handle_IRQ_event+0x60/0x170
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8100c1cc>] ? call_softirq+0x1c/0x30
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8100de05>] ? do_softirq+0x65/0xa0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81076d95>] ? irq_exit+0x85/0x90
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81516d75>] ? do_IRQ+0x75/0xf0
Mar 30 13:09:24 s16838172 kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11
Mar 30 13:09:24 s16838172 kernel: <EOI>  [<ffffffff812d388e>] ? intel_idle+0xde/0x170
Mar 30 13:09:24 s16838172 kernel: [<ffffffff812d3871>] ? intel_idle+0xc1/0x170
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81414fd7>] ? cpuidle_idle_call+0xa7/0x140
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110
Mar 30 13:09:24 s16838172 kernel: [<ffffffff814f300a>] ? rest_init+0x7a/0x80
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81c27f7b>] ? start_kernel+0x424/0x430
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81c2733a>] ? x86_64_start_reservations+0x125/0x129
Mar 30 13:09:24 s16838172 kernel: [<ffffffff81c27438>] ? x86_64_start_kernel+0xfa/0x109

This kind of message blocks appearing about 2-10 times in 5 minutes, after that they are gone for a few hours.

Can somebody help me with that? I hope its not a hardware problem.

Update: Seems to be reproducible by transferring big files over network (backups to ftp-storage). The ftp upload fails/aborts after a few GB and the stuff above appears in /var/log/messages

Thanks!

score 3 · Answer 1 · edited Mar 31 '13 at 19:33

3

Workaround for https://bugzilla.redhat.com/show_bug.cgi?id=713546

vm.min_free_kbytes = 512000
vm.zone_reclaim_mode = 1

It was also suggested in this CentOS thread as a potential workaround, http://lists.centos.org/pipermail/centos/2012-October/129844.html.

edited Mar 31 '13 at 19:33

slm

7,355
16
54
72

answered Mar 30 '13 at 13:45

Rajat

3,329
21
29

You are not authorized to access bug #713546. :-( Can you share more information about what they are talking there? I also read about zone_reclaim_mode=1 brings performance issues to database servers?? – steveh80 Mar 30 '13 at 14:20
I applied this settings to /etc/sysctl.conf and reloaded via sysctl -p. Didn't solve that problem. – steveh80 Mar 31 '13 at 09:21
https://access.redhat.com/solutions/90883 – Alpha01 Jan 07 '15 at 18:36

score 1 · Answer 2 · answered Mar 30 '13 at 15:36

1

Please upgrade to kernel-2.6.32-358.el6 equivalent for cenos. The bug has been fixed for this.

Essentially this is about memory allocation in interrupt context. If you want you might check gfp.h in include/linux. The mode 0x20 means that the allocation can't wait, it is in interrupt context, the wait bit for allocation is not set. Therefore, if it isn't allocated, it fails. The fix is quite substantial.

answered Mar 30 '13 at 15:36

Soham Chakraborty

3,534
16
24

Ok, thanks for this information. Do you know if this kernel upgrade will be available via the standard centos repos? Yum tells me nothing to update... – steveh80 Mar 30 '13 at 19:13
1

I see, I am already on 2.6.32-358.0.1.el6.x86_64. The bug seems not to be fixed in this version... – steveh80 Mar 30 '13 at 19:19
Oh, hold on a day. Let me search a bit more. – Soham Chakraborty Mar 31 '13 at 16:57

score 0 · Answer 3 · answered Mar 30 '13 at 14:46

0

Yeah this looks like a running out of memory error. But the cause might be due to a buggy driver and not necessarily that you have too little memory. Can you provide more details about what hardware your using on this box?

This related problem to bugid #713546 we can see and it's saying the same thing I am: https://bugzilla.redhat.com/show_bug.cgi?id=729229

I'd go through the hardware that is installed on this system along with making sure that all the OS related software is current and make sure that everything is at its latest versions.

Once you've confirmed that, you'll have to try and nail down what is correlated to this error showing up in the logs vs. what's running at that time.

answered Mar 30 '13 at 14:46

slm

7,355
16
54
72

Ok: Thats a dedicated server running CentOS 6.4 and everything is updated and at its latest versions (from official centos repos). Intel Xeon E3-1220, 12 GB DDR3 ECC RAM, Software Raid 1TB The only thing I can assume is, that this error comes up on heavy network traffic (transferring big backup files over network via ftp). What further do you need? – steveh80 Mar 31 '13 at 09:25
What hardware are we dealing with here? Custom box or a Dell server, or what? You're going to have to go through the box piece by piece and see if there are any open issues with the various components I'm afraid. – slm Mar 31 '13 at 11:44
I don't know. It's a dedicated root server from 1und1.de with pre installed and configured centos min system. That should be pretty standard and nothing special. – steveh80 Mar 31 '13 at 15:28
It probably wouldn't hurt to enlist 1und1.de's help here. At this point without more info about the make-up of the hardware it's a guessing game for any of us here to try and help. There are a number of patches that have addressed specific issues with Linux kernels and heavy network traffic, but they are dependent on specific hardware like this [one](http://lists.openwall.net/netdev/2011/11/28/25) or [this one](https://groups.google.com/forum/#!msg/fa.linux.kernel/xMrkt6NiMVo/7xNqXwptBAEJ). – slm Mar 31 '13 at 16:41

CentOS 6: strange page allocation failure messages

3 Answers3