7

After upgrading BIND to 9.8.2rc1-RedHat-9.8.2-0.37.rc1.el6_7.2 in a few caching nameservers I've noticed it's doing lots of outgoing NS queries, without changes to incoming traffic volume or patterns. As a result, the servers are consuming much more CPU and network bandwidth which has led to performance and capacity issues.

This did not happen with the previously installed version 9.8.2rc1-RedHat-9.8.2-0.30.rc1.el6_6.1 or 9.8.2-0.30.rc1.el6_6.3 (the last version on CentOS 6.6), and I could see the change in the graphs matching the time of the upgrade.

The graphs are below, the brown band corresponds to NS queries. The breaks are due to the server restart after upgrading BIND.

incoming queries: queries_in

outgoing queries: queries_out

A tcpdump shows thousands of queries/sec asking for NS records for every queried hostname. This is odd, as I expected to see a NS query for the domain (example.com) and not the host (www.example.com).

16:19:42.299996 IP xxx.xxx.xxx.xxx.xxxxx > 198.143.63.105.53:  45429% [1au] NS? e2svi.x.incapdns.net. (49)
16:19:42.341638 IP xxx.xxx.xxx.xxx.xxxxx > 198.143.61.5.53:    53265% [1au] NS? e2svi.x.incapdns.net. (49)
16:19:42.348086 IP xxx.xxx.xxx.xxx.xxxxx > 173.245.59.125.53:  38336% [1au] NS? www.e-monsite.com. (46)
16:19:42.348503 IP xxx.xxx.xxx.xxx.xxxxx > 205.251.195.166.53: 25752% [1au] NS? moneytapp-api-us-1554073412.us-east-1.elb.amazonaws.com. (84)
16:19:42.367043 IP xxx.xxx.xxx.xxx.xxxxx > 205.251.194.120.53: 24002% [1au] NS? LB-lomadee-adservernew-678401945.sa-east-1.elb.amazonaws.com. (89)
16:19:42.386563 IP xxx.xxx.xxx.xxx.xxxxx > 205.251.194.227.53: 40756% [1au] NS? ttd-euwest-match-adsrvr-org-139334178.eu-west-1.elb.amazonaws.com. (94)

tcpdump of a client's request shows:

## client query
17:30:05.862522 IP <client> > <my_server>.53: 1616+ A? cid-29e117ccda70ff3b.users.storage.live.com. (61)

    ## recursive resolution (OK)
    17:30:05.866190 IP <my_server> > 134.170.107.24.53: 64819% [1au] A? cid-29e117ccda70ff3b.users.storage.live.com. (72)
    17:30:05.975450 IP 134.170.107.24.53 > <my_server>: 64819*- 1/0/1 A 134.170.111.24 (88)

    ## garbage NS queries
    17:30:05.984892 IP <my_server> > 134.170.107.96.53: 7145% [1au] NS? cid-29e117ccda70ff3b.users.storage.live.com. (72)
    17:30:06.105388 IP 134.170.107.96.53 > <my_server>: 7145- 0/1/1 (158)

    17:30:06.105727 IP <my_server> > 134.170.107.72.53: 36798% [1au] NS? cid-29e117ccda70ff3b.users.storage.live.com. (72)
    17:30:06.215747 IP 134.170.107.72.53 > <my_server>: 36798- 0/1/1 (158)

    17:30:06.218575 IP <my_server> > 134.170.107.48.53: 55216% [1au] NS? cid-29e117ccda70ff3b.users.storage.live.com. (72)
    17:30:06.323909 IP 134.170.107.48.53 > <my_server>: 55216- 0/1/1 (158)

    17:30:06.324969 IP <my_server> > 134.170.107.24.53: 53057% [1au] NS? cid-29e117ccda70ff3b.users.storage.live.com. (72)
    17:30:06.436166 IP 134.170.107.24.53 > <my_server>: 53057- 0/1/1 (158)

## response to client (OK)
17:30:06.438420 IP <my_server>.53 > <client>: 1616 1/1/4 A 134.170.111.24 (188)

I thought this could be a cache population issue, but it did not subside even after the server has been running for a week.

Some details:

  • The issue didn't happen in CentOS 6.6 x86_64 fully patched
  • The servers are running CentOS 6.7 x86_64 (fully patched, as of 2015-08-13).
  • BIND is running in a chroot'ed environment with extra arguments ROOTDIR=/var/named/chroot ; OPTIONS="-4 -n4 -S 8096"
  • redacted named.conf contents below

What is going on here? Is there a way to change the configuration to avoid this behaviour?

acl xfer {
(snip)
};

acl bogusnets {
0.0.0.0/8; 1.0.0.0/8; 2.0.0.0/8; 192.0.2.0/24; 224.0.0.0/3;
};

acl clients {
(snip)
};

acl privatenets {
127.0.0.0/24; 10.0.0.0/8; 172.16.0.0/12; 192.168.0.0/16;
};

acl ops {
(snip)
};

acl monitoring {
(snip)
};

include "/etc/named.root.key";
key rndckey {
        algorithm       hmac-md5;
        secret          (snip);
};

key "monitor" {
        algorithm hmac-md5;
        secret (snip);
};

controls { inet 127.0.0.1 allow { localhost; } keys { rndckey; };
           inet (snip) allow { monitoring; } keys { monitor; }; };

logging {
        channel default_syslog { syslog local6; };
        category lame-servers { null; };
        channel update_debug {
                 file "/var/log/named-update-debug.log";
                 severity  debug 3;
                 print-category yes;
                 print-severity yes;
                 print-time     yes;
        };
        channel security_info    {
                 file "/var/log/named-auth.info";
                 severity  info;
                 print-category yes;
                 print-severity yes;
                 print-time     yes;
        };
        channel querylog{
                file "/var/log/named-querylog" versions 3 size 10m;
                severity info;
                print-category yes;
                print-time     yes;
        };

        category queries { querylog; };
        category update { update_debug; };
        category security { security_info; };
        category query-errors { security_info; };
};

options {
        directory "/var/named";
        pid-file "/var/run/named/named.pid";
        statistics-file "/var/named/named.stats";
        dump-file "/var/named/named_dump.db";
        zone-statistics yes;
        version "Not disclosed";

        listen-on-v6 { any; };
        allow-query { clients; privatenets; };
        recursion yes;                             // default
        allow-recursion { clients; privatenets; };
        allow-query-cache { clients; privatenets; };
        recursive-clients 10000;
        resolver-query-timeout 5;
        dnssec-validation no;
        querylog no;

        allow-transfer { xfer; };
        transfer-format many-answers;
        max-transfer-time-in 10;
        notify yes;                                // default

        blackhole { bogusnets; };

        response-policy {
                zone "rpz";
                zone "netrpz";
        };
};

include "/etc/named.rfc1912.zones";
include "/etc/named.zones";

statistics-channels { inet (snip) port 8053 allow { ops; }; inet 127.0.0.1 port 8053 allow { 127.0.0.1; }; };

zone "rpz" { type slave; file "slaves/rpz"; masters { (snip) }; };
zone "netrpz" { type slave; file "slaves/netrpz"; masters { (snip) }; };
André Fernandes
  • 959
  • 7
  • 24
  • Can you verify the servers are actually still caching? This behaviour looks like a pure forwarding name server. – tomstephens89 Aug 13 '15 at 16:06
  • Also, please can you tell us if you're logging the QUERY your DNS is **RECEIVING** from legitimate clients? A brief description (should it be needed) is available [here](https://help.ubuntu.com/community/BIND9ServerHowto#Logging). I'm asking, 'cause from your tcpdump it looks like you've obfuscated your DNS main-IP address, while it should be even more interesting to check "who" is asking "what", to it. – Damiano Verzulli Aug 13 '15 at 16:15
  • 2
    Also, please can you elaborate a bit more about the upgrade you've performed? Are you absolutely sure that **none** of the configuration files have been replaced by the ones included in the new version/RPMs? – Damiano Verzulli Aug 13 '15 at 16:25
  • If no config has changed, I suppose you'll want to look really close at Redhat's changelog between these versions as to what could possibly be relevant to this problem. It would probably have been easier to tell what change caused it if the updates had been installed as they were released instead of leaping a head a year's worth of updates at once. – Håkan Lindqvist Aug 13 '15 at 16:31
  • My C6.7 bind isn't doing this, so I have to think it may be configuration, in which case we could really do with seeing the (unredacted) config files. – MadHatter Aug 13 '15 at 16:56
  • to answer some of the comments: yes the servers are caching; none of the configs /etc/named.conf or /etc/named.conf were changed with the upgrade; added details to the question. – André Fernandes Aug 13 '15 at 17:08
  • As for 2nd tcpdump (client-request), please: 1) confirm the response never got back or 2) add ALL the packets related to the name resolution process (even better, add the tcpdump command you launched for the capture) – Damiano Verzulli Aug 13 '15 at 19:39
  • added query responses to the tcpdump output in the question. the command was simply `tcpdump -i eth0 -nn port 53` – André Fernandes Aug 14 '15 at 10:03

1 Answers1

7

The change in behaviour seems to be related to this changelog (from RedHat's site):

2015-02-19 12:00:00
Tomas Hozza <thozza@redhat.com> 32:9.8.2-0.35.rc1:
- Enable RPZ-NSIP and RPZ-NSDNAME during compilation (#1176476)

NSDNAME enables a filtering policy based on authoritative nameserver, one can write for example:

a.ns.facebook.com.rpz-nsdname CNAME .

which blocks responses for any records which have a.ns.facebook.com as authoritative server.

We had a stray entry at the top of our RPZ zone file:

ns.none.somewhere.rpz-nsdname   CNAME   .

Removing this entry makes the behaviour stop.

Unfortunately adding any NSDNAME directive will trigger the same behaviour again.

According to this article, in BIND 9.10 the CPU consumption of the RPZ feature is optimized. A patch for this will only be available in RHEL7.

André Fernandes
  • 959
  • 7
  • 24