Large AXFR through dnsmasq causes dig to hang with partial results

Question

I'm attempting to set up dnsmasq as a local cache for consul. While this seems to work fine for normal digs, dnsmasq seems to only allow partial zone transfers.

My resolv.conf:

search x.domain.com y.domain.com z.domain.com domain.com
nameserver 127.0.0.1
nameserver 10.0.0.1
nameserver 10.0.0.2
nameserver 10.0.0.3
options timeout: 2 attempts: 3

My dnsmasq settings are simply set to forward requests for .consul to port 3000 on the same box, which is my consul dns port. Otherwise, I'm using dnsmasq defaults(which forwards requests to the other dns in resolv.conf).

server=/consul/127.0.0.1#3000

This works fine for normal digs and returns the result from server localhost, eg. dig consul.service.consul +short will return:

10.22.1.15
10.22.1.16
10.22.1.17

as expected. Normal DNS (forwarding to BIND dns servers) also work fine, eg. dig host.hosts.domain.com +short will return 10.22.1.23

However, when doing a zone transfer dig axfr us1.domain.com then dig will return about 700 lines and then hang, always at the same place. If I include +retry=0 dig puts a ;; connection timed out; no servers could be reached at the bottom after the 700 lines.

When doing a zone transfer from the upstream (bind) dns server, it returns all 22k results as expected.

With memory debugging turned on for dig (-m flag) it seems to hang at

del 0x7f5c8131e010 size 152 file timer.c line 390 mctx 0x17572d0

when ctrl+c is pressed, it spits out a backtrace that I managed to track down to dig thinking the request isn't finished, which I suppose is true:

dighost.c:3831: REQUIRE(sockcount == 0) failed, back trace
#0 0x7f5c802c4227 in ??
#1 0x7f5c802c417a in ??
#2 0x41212d in ??
#3 0x405e0e in ??
#4 0x7f5c7de2f445 in ??
#5 0x405e6e in ??
Aborted (core dumped)

With extra dnsmasq logging enabled, I can see this for the axfr:

Oct 04 12:17:41 hostname.hosts.domain.com dnsmasq[16055]: forwarded us1.domain.com to 10.0.0.1
Oct 04 12:17:41 hostname.hosts.domain.com dnsmasq[16055]: reply _kerberos.us1.domain.com is DOMAIN.COM
Oct 04 12:17:41 hostname.hosts.domain.com dnsmasq[16055]: reply consul-acl.prod.us1.domain.com is us4

And in the upstream bind logs:

Oct  4 12:20:07 upstreamdns named[17388]: client 10.22.10.20#42228: transfer of 'us1.domain.com/IN': AXFR started
Oct  4 12:20:07 upstreamdns named[17388]: client 10.22.10.20#42228: transfer of 'us1.domain.com/IN': AXFR ended

I suspect this is something to do with maximum packet sizes or something, but I've tried varying settings from different cache sizes, to increasing dns forwards and the edns-packet-max. It's very strange that requesting the axfr from the upstream dns works fine, but through dnsmasq it only returns a partial result before causing dig to hang.

Edit: I tried version 1.76, and also compiled the latest official commit 7cbf497da4100ea0d1c1974b59f9503e15a0cf80 with the same results.

I'm running CentOS Linux release 7.5.1804 (Core).

New Information

After doing a tcpdump of both with/without dnsmasq I can see that the response is being split into two packets. For some reason, dig never receives this second packet when querying dnsmasq, so it just hangs. The sizes of the packets are 2521 bytes and 189 bytes, if that means anything to anybody.

score 0 · Accepted Answer · answered Oct 11 '18 at 09:47

According to Prof. Simon Kelly (dnsmasq creator), this issue is caused by the zone transfer exceeding 65536 bytes, and dnsmasq doesn't implement the continuation methods used to push transfers into multiple messages.

Therefore, large zone transfers will not work, and the advised workaround is to talk directly to the upstream authoritative server.

Large AXFR through dnsmasq causes dig to hang with partial results

New Information

1 Answers1