2

I have a standalone, isolated network running mixed Windows and Linux systems, with a Windows 2008 R2 server performing AD duties and DNS.

I'm seeing 5-second delays with the use of getaddrinfo on the Linux systems.

In Wireshark I see (C->S means client to DNS server):

t=0.000   C->S Query A     foo.example.com    ID=0x1111
t=0.000   C->S Query AAAA  foo.example.com    ID=0x2222
t=0.004   S->C Response to 0x2222, No error
          (Query is echoed)
          Authoritative nameservers:
             example.com: type SOA, class IN, mname svr01.example.com
               Name: example.com
               Type: SOA
               Class: IN
               TTL: 1 hour
               Primary name server: svr01.example.com
               Refresh interval: 15 minutes
               Retry interval: 10 minutes
               Expiration limit: 1 day
               Minimum TTL: 1 hour

[5 second delay]

t=5.004   C->S Query A     foo.example.com    ID=0x1111
t=5.005   S->C Query response A  192.168.1.17'

If I make the same request again, shortly thereafter, I will see no delay, as expected:

t=0.000   C->S Query A     foo.example.com    ID=0x3333
t=0.000   C->S Query AAAA  foo.example.com    ID=0x4444
t=0.001   S->C Query response A  192.168.1.17'

I can continue to get immediate responses for some period of time. After a while (still experimenting) the delay will return.

What is going on here? If I use gethostbyname() (which only does IPv4) or nslookup foo.example.com, there is no delay.

Additional info:

  • IPv6 is disabled on the server NICs

Update:

This answer on Ask Ubuntu suggested adding

options single-request

to /etc/resolv.conf. This seemed to correct the problem for me.

However, I'm still curious:

  • What the SOA record actually means
  • Why the server doesn't respond the first time to the A query
Jonathon Reinhart
  • 446
  • 1
  • 8
  • 25
  • To clarify is this an issue on all of the Linux systems using this DNS server or only some? Depending on the number of requests generated, where they are generated, etc there could be DNS caching issues with IPv6 record lookups. Though my experience is mostly with Linux based servers running DNS servers. – Matt Jul 14 '15 at 14:18
  • All of the Linux systems (CentOS 6.2) are affected. There's no caching on the client-side (`nscd` is not enabled). – Jonathon Reinhart Jul 14 '15 at 14:20
  • And all of the Linux systems also have IPv6 disabled? On a somewhat related note is there any connected networking equipment like switches or routers that could be generating IPv6 requests? Managed switches can exhibit some odd behavior under the right circumstances. – Matt Jul 14 '15 at 14:22
  • There are only link-local addresses configured. No DHCPv6 or SLAAC. – Jonathon Reinhart Jul 14 '15 at 14:26
  • 2
    This sounds an awful lot like a buggy DNS server. notice how the client waits and then retransmits the request due to having received no response from the server. (Five seconds sounds like a too large delay, but the real problem isn't the timeout but rather that no response is generated in the first place). I suspect the condition which triggers the bug is two requests in parallel for different record types on the same domain. – kasperd Jul 14 '15 at 14:31
  • I reread the answer and realized that it solved an issue I had in the past but was not related to this one, so I deleted it. sorry about the mis post and thanks for the update. @kasperd, this seems like a reasonable root cause. It would be interesting to run a secondary DNS server on one of the Linux system to see if it exhibits the same behavior – Matt Jul 14 '15 at 14:32
  • @kasperd I agree - it seems like something is wrong with the DNS server. But I see no options to further disable IPv6 records, etc. but I'm not really a Windows server guy either. `options single-request` seemed to make the problem go away. – Jonathon Reinhart Jul 14 '15 at 14:41
  • 2
    @JonathonReinhart Disabling IPv6 may have been a valid workaround in 2003, but it is now 2015. IPv4 addresses ran out 4 years ago, and in 3 years half the world will be running IPv6. Taking any steps to disable IPv6 rather than fixing root causes means you are not doing your job. It will come back and bite you later once you have to debug a problem caused by disabling IPv6. – kasperd Jul 14 '15 at 15:43
  • @kasperd I agree with your general sentiment, but this is an isolated network, not connected to the outside world. With the amount of legacy equipment connected, keeping IPv6 disabled was by far the best route in this case. – Jonathon Reinhart Jul 14 '15 at 15:49
  • @JonathonReinhart For legacy systems so old that IPv6 is not supported, it makes no difference whether you enable it on your network or not. For legacy systems not quite that old, you should have enabled IPv6 on your network soon enough to notice such problems while you could still get them fixed. Btw questions about systems that are no longer supported are considered off-topic on this site. – kasperd Jul 14 '15 at 16:02
  • Do you have a firewall between the systems? Some firewalls (e.g. [Juniper](http://serverfault.com/a/411178/126632)) screw with DNS traffic in this manner. – Michael Hampton Jul 14 '15 at 17:26

2 Answers2

1

Your DNS server appears to be buggy. Two requests are sent to the DNS server, but it sends only a single reply. The client does what clients are supposed to do in that case, it waits a short while and then retransmits the request.

An initial delay of 5 seconds may be reasonable for non-interactive usage. But for interactive usage I would consider that to be way too high.

The proper fix would be to upgrade the DNS server to a version without the bug or to contact the vendor if no fix has been released yet. Everything else is a workaround.

Using man resolv.conf on a Ubuntu system will explain what the single-request and single-request-reopen options do. Those are two different variations of a workaround for a known bug in certain DNS servers. The drawback of those options is that it slows down name resolution by roughly a factor of two. However given that the bug appears to slow down name resolution by a factor of about 1000, you may still be better off using the workaround.

When requesting a nonexistent record you may receive a response with a SOA record instead. The reason for sending not just an error code but also a SOA record is that the SOA record contains information which will allow the negative result to be cached.

kasperd
  • 29,894
  • 16
  • 72
  • 122
  • 3
    I'm pretty sure this is _not_ the fault of the Windows DNS server. It is more than capable of handling this traffic, since Server 2008. – Michael Hampton Jul 14 '15 at 17:29
  • Shouldn't the respons with the SOA record I'm seeing also give an error code? This certainly sounds like a misconfiguration of the Windows DNS server, but I can't seem to pinpoint it. – Jonathon Reinhart Jul 14 '15 at 17:39
  • @JonathonReinhart When requesting a record on a nonexisting domain you will get an `NXDOMAIN` error code and usually a `SOA` record. But requesting a nonexisting record type on an existing domain will not give you a successful status back and usually a `SOA` record. Notice that in neither case will the `SOA` record be in the answer section of the reply. – kasperd Jul 14 '15 at 18:15
  • @MichaelHampton The `resolv.conf` documentation isn't specific about which DNS servers are buggy. All it says is: `Some appliance DNS servers` and `Some hardware`. The observed symptoms are consistent with the described bug, and the workaround appears to work. I am not saying Windows is to blame for this. I won't even claim to know for sure what OS the DNS server is really running. – kasperd Jul 14 '15 at 18:31
  • It's also consistent with an interfering firewall, which I've also seen before. The OP has yet to respond to the question I asked him about this, though. As for Windows, MS says it should work fine on 2008 and later. – Michael Hampton Jul 14 '15 at 18:35
  • @MichaelHampton Sure a firewall can interfere with traffic in more or less every imaginable way. I have even seen one configured to hijack all port 53 traffic and send it to a different DNS server than the specified destination IP address. – kasperd Jul 14 '15 at 18:41
0

The correct way to interpret your packet capture is that you're seeing dropped reply packets for both the A and AAAA record responses.

The SOA record seems to be confusing you and is worth elaboration:

  • The SOA record is actually in the authority section, not the answer section.
  • NXDOMAIN means "there are no records that have that name". If there are other records with the same name, but different types, the response you will see is NOERROR with zero records in the answer section.
  • What you're seeing is a NOERROR response with zero answers and an authority section telling you what zone that answer came from. You can ignore the SOA component entirely. This reply is telling you that there is no AAAA record.

Now that we've established that the AAAA reply is a correctly formatted packet and what you should be seeing in this scenario, it changes the context of what we're you're looking at entirely. You are seeing cases where A record replies are being lost, in addition to AAAA replies being lost. Your research suggests that AAAA responses are being lost more frequently, but not exclusively.

Based on the information supplied, we're not going to be able to explain what is going on here. You need to set up packet captures on the DNS servers themselves and identify the following factors:

  1. Do the queries associated with the missing replies actually arrive at the DNS server?
  2. If the queries are arriving at the DNS server, are replies actually being sent?
  3. If the server is not sending replies, is your DNS server having to get this information from a different DNS server that is taking a long time to respond? (times out on the initial attempt, but has the query in cache for the second attempt) Are you seeing heavy enough query load to overflow your socket queue?
  4. If the server is sending the replies, what devices between the server and the client could be losing the packets? Does one of your DNS servers have a routing problem compared to the others? Does it seem like packets are being lost from all of the DNS servers, suggesting a network problem somewhere between the client and server?

As you can see, there's a lot of things that could be going on here. You're going to need to narrow in on the problem to rule out possibilities. I apologize for this answer not being conclusive, but this was far more than could be covered within a few comments. Feel free to update your question.

Andrew B
  • 31,858
  • 12
  • 90
  • 128
  • Thanks for your explanation. This problem is so consistent in nature, that I'm sure it's not random overflowing of packet queues, etc. It's a fairly small network, with only one DNS server, and nothing complicated going on in terms of routing. I'll try to get a packet capture at the server to see if it's seeing all of the requests (which I'm ~sure it is), and if it's sending responses. – Jonathon Reinhart Jul 14 '15 at 18:19
  • It's *more frequent*, but not exclusive to `AAAA` records. That's rather telling to me if your packet capture is accurate. Can you review the update I just made to item #3? If this DNS server (recursive) is having to talk to another DNS server (authoritative) for the answer, it's possible that you're looking at a communication problem between the two. `AAAA` records would be impacted more frequently than `A` records because there is a higher probability of the `A` records being in cache. – Andrew B Jul 14 '15 at 18:24