Weird answer of linux DHCP server to gratitious ARP causes DHCP failure

Question

This is a problem, that I ran into recently on a quite heavily configured linux server. This machine runs Samba as an Active Directory Domain Controller, a mail server, a web server, two virtual machines (using KVM/QEMU, and connecting one of their virtualized ethernet interfaces to one of the real ethernet interfaces of the machine via a virtual bridge set up with brctl) and a few more services. On a private VLAN, it also operates a DHCP server. This had been working fine, but recently it started to fail to let Apple devices in. But also a simple WiFi access point (which is configured to dynamically receive its IP via DHCP) also fails to receive an IP.

The endless loop, where devices try to obtain an IP address, is as follows (captured via tcpdump -e). There 2c:30:33:2b:68:d0 is the MAC of the remote box, and 74:d0:2b:99:52:bc is the linux server. After each packet, I have written my interpretation of that packet:

15:48:33.350358 2c:30:33:2b:68:d0 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 345: 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 2c:30:33:2b:68:d0, length 303
(DHCPDISCOVER from 2c:30:33:2b:68:d0)

15:48:34.351523 74:d0:2b:99:52:bc > 2c:30:33:2b:68:d0, ethertype IPv4 (0x0800), length 345: 172.17.9.1.67 > 172.17.9.7.68: BOOTP/DHCP, Reply, length 303
(DHCPOFFER on 172.17.9.7 to 2c:30:33:2b:68:d0 via eth0.9)

15:48:34.366366 2c:30:33:2b:68:d0 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 357: 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 2c:30:33:2b:68:d0, length 315
(DHCPREQUEST for 172.17.9.7 (172.17.9.1) from 2c:30:33:2b:68:d0 via eth0.9)

15:48:34.492289 74:d0:2b:99:52:bc > 2c:30:33:2b:68:d0, ethertype IPv4 (0x0800), length 345: 172.17.9.1.67 > 172.17.9.7.68: BOOTP/DHCP, Reply, length 303
(DHCPACK on 172.17.9.7 to 2c:30:33:2b:68:d0 via eth0.9)

15:48:34.492707 2c:30:33:2b:68:d0 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 60: Request who-has 172.17.9.7 tell 172.17.9.7, length 46
(gratuitous ARP of the newly registered box)

15:48:34.492761 74:d0:2b:99:52:bc > 2c:30:33:2b:68:d0, ethertype ARP (0x0806), length 42: Reply 172.17.9.7 is-at 74:d0:2b:99:52:bc, length 28
(this is the packet, that I don't understand)

15:48:34.526375 2c:30:33:2b:68:d0 > ff:ff:ff:ff:ff:ff, ethertype IPv4 (0x0800), length 346: 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, unknown (0x00), length 304
(DHCPDECLINE of 172.17.9.7 from 2c:30:33:2b:68:d0 via eth0.9, the box abandones the IP address 172.17.9.7, because it seems in use)

DHCP clients, that don't send the gratuitous ARP mentioned above, can register themselves fine. So I assume, that the linux server's response to that gratuitous ARP, where it answers, that the address belongs to the linux machine instead, is the problem.

I have ensured, that the IP addresses, that the DHCP server should give out, are not registered to the linux server. So I frankly don't know, why that packet is send.

All attempts to play with /proc/sys/net/ipv4/conf/eth0.9/arp_accept, /proc/sys/net/ipv4/conf/eth0.9/arp_announce and so on have failed. Even the attempt to filter out the bad ARP packets with arptables had no success. And turning ARP off totally on the interface doesn't have the desired effect, either.

Any idea, why this weird packet is created? Where I could I look further?

If one of those other services is a router, I suppose it might be a [proxy ARP](https://en.wikipedia.org/wiki/Proxy_ARP)? That could happen, for example, if two subnets separated by a router are accidentally being bridged somehow. — Harry Johnston, Sep 22 '18 at 21:59
... there's a [question and answer here](https://unix.stackexchange.com/a/410674) (about identifying which process is generating an ARP request) which recommends using sysdig. Perhaps it would also work to identify the process generating an ARP response? — Harry Johnston, Sep 22 '18 at 22:01
I bet that what you describe as "gratuitous ARP" is, actually, an "ARP-Request" issued (to the broadcast ethernet address) by your newly-assigned-172.17.9.7-IP host, to check if the given IP is already assigned to someone else over the same LAN. Actually, such host _DO_ receive an "ARP-reply" and... that's why (I guess) it DECLINE the previous DHCP assignment. So, please, triple-check such ARP-packets (`wireshark` should be a better troubleshooting tool). Also `arping` might help as well. Afterwards, let us know the findings. — Damiano Verzulli, Sep 23 '18 at 09:03
@HarryJohnston: Proxy ARP is definitely off. I thought of that, too. — Kai Petzke, Sep 24 '18 at 12:40
@DamianoVerzulli I agree with you, that the gratuitous ARP packet is issued by the box, to which 172.17.9.7 was newly assigned. I call it gratuitous ARP, because it has the same IP address both as SRC and as Request. As such, it serves a double role: Distribute the knowledge, to which MAC the newly assigned IP belongs to, and to also trigger responses by other hosts, who have that same IP assigned, to avoid double assignment. Unfortunately the linux server answers wrongly, and the box then does the (in my opinion correct) DHCPDECLINE. — Kai Petzke, Sep 24 '18 at 12:54
"Unfortunately the linux server answers wrongly": I bet it's hard to believe that your Linux box is "wrong". It's much more probable that some strange configuration is leading it to have such a strange behaviour. Please: 1) with your intended DHCP client turned OFF and DISCONNECTED from the network, can you please launch (on the Linux box) an `arping -I eth0 172.17.9.7` (where eth0 is the physical interface connecting the linux box to the LAN; 2) launching a `tcpdump -s 0 -w /tmp/sniff.dump -n -i eth0` on the linux box and repeat the DHCP assignement and.... — Damiano Verzulli, Sep 24 '18 at 17:53
.... send us the `sniff.dump` file? Please NOTE that it will capture ALL the traffic flowing trough the `eth0` interface. So, please, take **EXTRA_CARE** to **NOT** do any other activities, in order to lower the risk to have sensible data in the file. Please, AGAIN, **TAKE CARE** and **DOUBLE CHECK** it doesn't contain anything privacy-relevant — Damiano Verzulli, Sep 24 '18 at 17:54
@SusanW Yes, we found the reason after all. I think, it was related to IPsec (with the StrongSwan implementation) running on the same box, and IPsec assuming wrongly, that it was doing some bridging. I don't fully recall, if it was an admin error in the configuration of IPsec, or if the StrongSwan implementation is buggy in certain cases. Moving from StrongSwan to Wireguard resolved the issue. — Kai Petzke, Dec 27 '20 at 18:09
@KaiPetzke Wow, ok that sounds nasty! Hmm, well, that's not my situation (Arduino, no IPSec involved) - and there are a few differences in your packet traces compared to mine... But thanks very much for the reply! — SusanW, Dec 28 '20 at 13:43

Weird answer of linux DHCP server to gratitious ARP causes DHCP failure

0 Answers0