openstack, bridging, netfilter and dnat

Question

In a recent upgrade (from Openstack Diablo on Ubuntu Lucid to Openstack Essex on Ubuntu Precise), we found that DNS packets were frequently (almost always) dropped on the bridge interface (br100). For our compute-node hosts, that's a Mellanox MT26428 using the mlx4_en driver module.

We've found two workarounds for this:

Use an old lucid kernel (e.g. 2.6.32-41-generic). This causes other problems, in particular the lack of cgroups and the old version of the kvm and kvm_amd modules (we suspect the kvm module version is the source of a bug we're seeing where occasionally a VM will use 100% CPU). We've been running with this for the last few months, but can't stay here forever.
With the newer Ubuntu Precise kernels (3.2.x), we've found that if we use sysctl to disable netfilter on bridge (see sysctl settings below) that DNS started working perfectly again. We thought this was the solution to our problem until we realised that turning off netfilter on the bridge interface will, of course, mean that the DNAT rule to redirect VM requests for the nova-api-metadata server (i.e. redirect packets destined for 169.254.169.254:80 to compute-node's-IP:8775) will be completely bypassed.

Long-story short: with 3.x kernels, we can have reliable networking and broken metadata service or we can have broken networking and a metadata service that would work fine if there were any VMs to service. We haven't yet found a way to have both.

Anyone seen this problem or anything like it before? got a fix? or a pointer in the right direction?

Our suspicion is that it's specific to the Mellanox driver, but we're not sure of that (we've tried several different versions of the mlx4_en driver, starting with the version built-in to the 3.2.x kernels all the way up to the latest 1.5.8.3 driver from the mellanox web site. The mlx4_en driver in the 3.5.x kernel from Quantal doesn't work at all)

BTW, our compute nodes have supermicro H8DGT motherboards with built-in mellanox NIC:

02:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)

we're not using the other two NICs in the system, only the Mellanox and the IPMI card are connected.

Bridge netfilter sysctl settings:

net.bridge.bridge-nf-call-arptables = 0
net.bridge.bridge-nf-call-iptables = 0
net.bridge.bridge-nf-call-ip6tables = 0

Since discovering this bridge-nf sysctl workaround, we've found a few pages on the net recommending exactly this (including Openstack's latest network troubleshooting page and a launchpad bug report that linked to this blog-post that has a great description of the problem and the solution)....it's easier to find stuff when you know what to search for :), but we haven't found anything on the DNAT issue that it causes.

Update 2012-09-12:

Something I should have mentioned earlier - this happens even on machines that don't have any openstack or even libvirt packages installed. Same hardware, same everything, but with not much more than the Ubuntu 12.04 base system installed.

On kernel 2.6.32-41-generic, the bridge works as expected.

On kernel 3.2.0-29-generic, using the ethernet interface, it works perfectly. Using a bridge on that same NIC fails unless net.bridge.bridge-nf-call-iptables=0

So, it seems pretty clear that the problem is either in the mellanox driver, the updated kernel's bridging code, netfilter code. or some interaction between them.

Interestingly, I have other machines (without a mellanox card) with a bridge interface that don't exhibit this problem. with NICs ranging from cheap r8169 cards to better quality broadcom tg3 Gbit cards in some Sun Fire X2200 M2 servers and intel gb cards in supermicro motherboards. Like our openstack compute nodes, they all use the bridge interfaces as their primary (or sometimes only) interface with an IP address - they're configured that way so we can run VMs using libvirt & kvm with real IP addresses rather than NAT.

So, that indicates that the problem is specific to the mellanox driver, although the blog post I mentioned above had a similar problem with some broadcom NICs that used the bnx2 driver.

Packets shouldn't be dropping like this. You problem doesn't seem to be debugged to the end. How far do the packets come? Do they enter the H/W, but do not leave it? Perhaps H/W offloading is to blame. Are they discarded before being sent? Perhaps rp_filter hates you, perhaps it is a driver/kernel buglet. — David Schmitt, Sep 10 '12 at 11:49
no, they shouldn't. tcpdump shows the DNS request packets going out, AND the reply packets coming back in. In 3.x kernels, they're dropped (not all, but most...which implies a bug) if net.bridge.bridge-nf-call-iptables=1 (the default), but not dropped if we set it to 0, but that disables the DNAT we need. in 2.6.32 they're not dropped at all, even with the default bridge-nf-call-iptables=1. — cas, Sep 10 '12 at 22:46
A seemingly trivial question, but do you have any other iptables rules besides the DNAT? — David Schmitt, Sep 11 '12 at 07:58
yeah, quite a few (and each VM started by openstack has its own chain with its own rules). we tested that other iptables rules work, the only one affected adversely by setting net.bridge.bridge-nf-call-iptables=0 is the DNAT rule, i think because its the only rule that affects packets that never leave the compute node's bridge interface. — cas, Sep 11 '12 at 14:05
yes. we've also contacted the supplier of our systems, who have contacted supermicro on our behalf. In my experience, you're better off pro-actively looking for a solution yourself than passively waiting for a vendor. maybe they will come up with a solution first. if so, great. if not, then we haven't wasted time waiting for them. — cas, Sep 12 '12 at 09:06

openstack, bridging, netfilter and dnat

0 Answers0