4

I've done absolutely nothing to my BIND configuration, but looks like Debian Jessie upgrade has broke it. Maybe some new options got introduced to it, or the old things now work differently, but I cannot find what's going wrong.

I got SERVFAIL in my /var/log/bind/bind.log all the time.

I've checked my zones with named-checkzone and they are all 'OK'. I've disabled IPv6 system-wide. I recreated rndc key and even created /etc/rndc.conf. Nothing works.

Here are some configs:

/etc/bind/named.conf

include "/etc/bind/named.conf.options";
include "/etc/bind/named.conf.log";
include "/etc/bind/named.conf.local";
//include "/etc/bind/named.conf.default-zones";

acl localhost_acl {
        127.0.0.0/8;
};

acl internal_10_acl {
        192.168.10.0/24;
};

acl internal_150_acl {
        192.168.150.0/24;
};

acl vpn_acl {
        192.168.200.2;
        192.168.200.5;
};

key "rndc-key" {
algorithm hmac-md5;
secret "somesecretkey==";
};

controls {
inet 127.0.0.1 port 953
       allow { 127.0.0.1; } keys { "rndc-key"; };
};

/etc/bind/named.conf.options

options {
        directory "/var/cache/bind";
        dnssec-validation auto;
        auth-nxdomain no;    # conform to RFC1035
        listen-on-v6 { none; };
        listen-on {
                127.0.0.1;
                192.168.10.1;
                192.168.150.1;
                192.168.200.1;
        };
        allow-transfer { none; };
        max-recursion-queries 200;
};

/etc/bind/named.conf.log

logging {

    channel update_debug {

            file "/var/log/bind/update_debug.log" versions 3 size 100k;
            severity debug;
            print-severity  yes;
            print-time      yes;

    };

    channel security_info {

            file "/var/log/bind/security_info.log" versions 1 size 100k;
            severity debug;
            print-severity  yes;
            print-time      yes;

    };

    channel bind_log {

            file "/var/log/bind/bind.log" versions 3 size 1m;
            severity debug;
            print-category  yes;
            print-severity  yes;
            print-time      yes;

    };

    category default { bind_log; };
    category lame-servers { security_info; };
    category update { update_debug; };
    category update-security { update_debug; };
    category security { security_info; };

};

/etc/bind/named.conf.local (this is a long one):

// 1
view "internal_10_view" {

        allow-query-on { 127.0.0.1; 192.168.10.1; };
        allow-query { localhost_acl; internal_10_acl; };
        match-clients { localhost_acl; internal_10_acl; };

        zone "myhost.tld" {
                type master;
                file "/etc/bind/db.myhost.tld_10";
        };

        zone "168.192.in-addr.arpa" {
                type master;
                notify no;
                file "/etc/bind/db.192.168.10";
        };

        // formerly named.conf.default-zones

        zone "." {
                type hint;
                file "/etc/bind/db.root";
        };

        zone "localhost" {
                type master;
                file "/etc/bind/db.local";
        };

        zone "127.in-addr.arpa" {
                type master;
                file "/etc/bind/db.127";
        };

        zone "0.in-addr.arpa" {
                type master;
                file "/etc/bind/db.0";
        };

        zone "255.in-addr.arpa" {
                type master;
                file "/etc/bind/db.255";
        };

        // formerly zones.rfc1918

        zone "10.in-addr.arpa"      { type master; file "/etc/bind/db.empty"; };
        zone "16.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "17.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "18.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "19.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "20.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "21.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "22.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "23.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "24.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "25.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "26.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "27.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "28.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "29.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "30.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "31.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };

};

// 2
view "internal_150_view" {

        allow-query-on { 192.168.150.1; };
        allow-query { internal_150_acl; };
        match-clients { internal_150_acl; };

        zone "myhost.tld" {
                type master;
                file "/etc/bind/db.myhost.tld_150";
        };

        zone "168.192.in-addr.arpa" {
                type master;
                notify no;
                file "/etc/bind/db.192.168.150";
        };

        // formerly named.conf.default-zones

        zone "." {
                type hint;
                file "/etc/bind/db.root";
        };

        zone "localhost" {
                type master;
                file "/etc/bind/db.local";
        };

        zone "127.in-addr.arpa" {
                type master;
                file "/etc/bind/db.127";
        };

        zone "0.in-addr.arpa" {
                type master;
                file "/etc/bind/db.0";
        };

        zone "255.in-addr.arpa" {
                type master;
                file "/etc/bind/db.255";
        };

        // formerly zones.rfc1918

        zone "10.in-addr.arpa"      { type master; file "/etc/bind/db.empty"; };
        zone "16.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "17.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "18.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "19.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "20.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "21.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "22.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "23.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "24.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "25.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "26.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "27.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "28.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "29.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "30.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "31.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };

};

// 3
view "vpn_view" {

        allow-query-on { 192.168.200.1; };
        allow-query { vpn_acl; };
        match-clients { vpn_acl; };

        zone "myhost.tld" {
                type master;
                file "/etc/bind/db.myhost.tld_vpn";
        };

        // formerly named.conf.default-zones

        zone "." {
                type hint;
                file "/etc/bind/db.root";
        };

        zone "localhost" {
                type master;
                file "/etc/bind/db.local";
        };

        zone "127.in-addr.arpa" {
                type master;
                file "/etc/bind/db.127";
        };

        zone "0.in-addr.arpa" {
                type master;
                file "/etc/bind/db.0";
        };

        zone "255.in-addr.arpa" {
                type master;
                file "/etc/bind/db.255";
        };

        // formerly zones.rfc1918

        zone "10.in-addr.arpa"      { type master; file "/etc/bind/db.empty"; };
        zone "16.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "17.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "18.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "19.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "20.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "21.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "22.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "23.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "24.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "25.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "26.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "27.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "28.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "29.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "30.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };
        zone "32.172.in-addr.arpa"  { type master; file "/etc/bind/db.empty"; };

        // somedomain.tld
        zone "somedomain.tld" {
                type forward;
                forward first;
                forwarders { 192.168.34.110; 192.168.34.100; };
        };

};

/etc/rndc.conf

key "rndc-key" {
        algorithm hmac-md5;
        secret "somesecretkey==";
};

options {
        default-key "rndc-key";
        default-server 127.0.0.1;
        default-port 953;
};

me@jessie:~$ sudo netstat -lnptu | grep "named\W*$"

tcp        0      0 192.168.10.1:53         0.0.0.0:*               LISTEN      1871/named      
tcp        0      0 127.0.0.1:53            0.0.0.0:*               LISTEN      1871/named      
tcp        0      0 127.0.0.1:953           0.0.0.0:*               LISTEN      1871/named      
udp        0      0 192.168.200.1:53        0.0.0.0:*                           1871/named      
udp        0      0 192.168.10.1:53         0.0.0.0:*                           1871/named      
udp        0      0 127.0.0.1:53            0.0.0.0:*                           1871/named 

me@jessie:~$ ps aux | grep named

bind      5843  0.0  1.0 297780 84412 ?        Ssl  00:52   0:16 /usr/sbin/named -f -u bind -4

me@jessie:/etc/bind$ named -V

BIND 9.9.5-9-Debian (Extended Support Version) <id:f9b8a50e> built by make with '--prefix=/usr' '--mandir=/usr/share/man' '--infodir=/usr/share/info' '--sysconfdir=/etc/bind' '--localstatedir=/var' '--enable-threads' '--enable-largefile' '--with-libtool' '--enable-shared' '--enable-static' '--with-openssl=/usr' '--with-gssapi=/usr' '--with-gnu-ld' '--with-geoip=/usr' '--with-atf=no' '--enable-ipv6' '--enable-rrl' '--enable-filter-aaaa' 'CFLAGS=-fno-strict-aliasing -fno-delete-null-pointer-checks -DDIG_SIGCHASE -O2'                                                                                
compiled by GCC 4.9.2                                                                                                   
using OpenSSL version: OpenSSL 1.0.1k 8 Jan 2015                                                                        
using libxml2 version: 2.9.2    

me@jessie's_client:~$ dig @192.168.10.1 launchpad.net

; <<>> DiG 9.9.5-9-Debian <<>> @192.168.10.1 launchpad.net
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 19673
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;launchpad.net.                 IN      A

;; Query time: 0 msec
;; SERVER: 192.168.10.1#53(192.168.10.1)
;; WHEN: Thu May 07 23:29:38 MSK 2015
;; MSG SIZE  rcvd: 42

And finally some logs at /var/log/bind/bind.log

07-May-2015 22:52:49.287 resolver: debug 1: createfetch: _xmpp-server._tcp.pandion.im SRV
07-May-2015 22:52:49.287 resolver: debug 1: createfetch: . NS
07-May-2015 22:52:49.954 resolver: debug 1: createfetch: _xmpp-server._tcp.pandion.im SRV
07-May-2015 22:52:50.353 resolver: debug 1: createfetch: launchpad.net A
07-May-2015 22:52:51.288 resolver: debug 1: createfetch: _xmpp-server._tcp.pandion.im SRV
07-May-2015 22:52:51.575 query-errors: debug 1: client 127.0.0.1#47208 (pandion.im): view internal_10_view: query failed (SERVFAIL) for pandion.im/IN/AAAA at query.c:7004
07-May-2015 22:52:53.138 query-errors: debug 1: client 127.0.0.1#55548 (_jabber._tcp.none.su): view internal_10_view: query failed (SERVFAIL) for _jabber._tcp.none.su/IN/SRV at query.c:7004
07-May-2015 22:52:53.955 resolver: debug 1: createfetch: _jabber._tcp.pandion.im SRV
07-May-2015 22:52:54.622 resolver: debug 1: createfetch: _jabber._tcp.pandion.im SRV
07-May-2015 22:52:55.353 query-errors: debug 1: client 192.168.10.2#37375 (launchpad.net): view internal_10_view: query failed (SERVFAIL) for launchpad.net/IN/A at query.c:7004
07-May-2015 22:52:55.354 resolver: debug 1: createfetch: launchpad.net A
07-May-2015 22:52:55.956 resolver: debug 1: createfetch: _jabber._tcp.pandion.im SRV

/var/log/bind/security_info.log

07-May-2015 00:45:26.055 warning: using built-in root key for view vpn_view
07-May-2015 12:31:37.603 warning: using built-in root key for view internal_10_view
07-May-2015 12:31:37.769 warning: using built-in root key for view internal_150_view
07-May-2015 12:31:37.773 warning: using built-in root key for view vpn_view
07-May-2015 12:31:44.859 warning: using built-in root key for view internal_10_view
07-May-2015 12:31:44.865 warning: using built-in root key for view internal_150_view
07-May-2015 12:31:44.871 warning: using built-in root key for view vpn_view
07-May-2015 12:31:46.005 warning: using built-in root key for view internal_10_view
07-May-2015 12:31:46.011 warning: using built-in root key for view internal_150_view
07-May-2015 12:31:46.016 warning: using built-in root key for view vpn_view
07-May-2015 12:31:47.108 warning: using built-in root key for view internal_10_view
07-May-2015 12:31:47.114 warning: using built-in root key for view internal_150_view
07-May-2015 12:31:47.121 warning: using built-in root key for view vpn_view
07-May-2015 12:31:48.946 warning: using built-in root key for view internal_10_view
07-May-2015 12:31:48.951 warning: using built-in root key for view internal_150_view
07-May-2015 12:31:48.957 warning: using built-in root key for view vpn_view
07-May-2015 14:07:39.729 warning: using built-in root key for view internal_10_view
07-May-2015 14:07:39.737 warning: using built-in root key for view internal_150_view
07-May-2015 14:07:39.743 warning: using built-in root key for view vpn_view
07-May-2015 14:12:05.871 warning: using built-in root key for view internal_10_view
07-May-2015 14:12:05.880 warning: using built-in root key for view internal_150_view
07-May-2015 14:12:05.890 warning: using built-in root key for view vpn_view
07-May-2015 14:27:07.630 warning: using built-in root key for view internal_10_view
07-May-2015 14:27:07.638 warning: using built-in root key for view internal_150_view
07-May-2015 14:27:07.644 warning: using built-in root key for view vpn_view

Any suggestions what might be wrong?

Neurotransmitter
  • 468
  • 1
  • 6
  • 17
  • You mention that you have checked your zones with `named-checkzone`, can you clarify if the names that you get `SERVFAIL` errors for are in your own zones or if those errors are encountered when looking up other names? Also, what does your logging configuration look like? I get the feeling that maybe you only have some specific categories in that log, possibly removing the log entries that hint at the reason for the failures? – Håkan Lindqvist May 07 '15 at 20:21
  • The names in my zones are successfully resolving, but the other names (on the internet, such as `launchpad.net`) not. I've just added `/etc/bind/named.conf.log` right now, check the updated question. – Neurotransmitter May 07 '15 at 20:24
  • Ok, so if the failures relate to `security` or `lame-servers` (or `update-*` but that seems irrelevant to the question) it wouldn't be in that log. Can you check that, just to make sure you don't actually have helpful things being logged? – Håkan Lindqvist May 07 '15 at 20:37
  • Sorry, I'm not really skillful (otherwise there won't be this question) with BIND. Can you please clarify what you ask me to do? To change some options in the `/etc/bind/named.conf.log`? – Neurotransmitter May 07 '15 at 20:41
  • Can you add the following to the options section of named.conf and see if this fixes the problem? I'll provide an answer explaining the fix if it works. `max-recursion-queries 200;` – Andrew B May 07 '15 at 20:43
  • @AndrewB just tried. No luck: `/etc/bind/named.conf:41: unknown option 'max-recursion-queries'` – Neurotransmitter May 07 '15 at 20:44
  • 1
    Well, you have configured named to log `security` to a separate file, so you may want to have a look there as a first step. Also, you have configured it to throw away all log messages for `lame-servers`, you may want to at least temporarily undo that in case that is actually relevant to your problems. – Håkan Lindqvist May 07 '15 at 20:46
  • Strange, that suggests that Jessie's version of BIND does not include the upstream fix for [CVE-2014-8500](https://kb.isc.org/article/AA-01216/). I'll do some testing on my own. Please edit the output of `named -V` into your question. – Andrew B May 07 '15 at 20:49
  • @HåkanLindqvist I've just changed `lame-servers` to `security_info` and checked `/var/log/bind/security_info.log`, nothing strange to me there (check the updated question). – Neurotransmitter May 07 '15 at 20:55
  • @AndrewB just added output of `named -V` to the question. Noticed `--enable-ipv6` though I have disabled it in multiple places. Is it right or I should disable it in one more? – Neurotransmitter May 07 '15 at 20:57
  • I've been able to reproduce your SERVFAIL problem in my lab consistently when that `max-recursion-queries` option is not set. Adding the option fixes it, and I find it very unlikely that your version of BIND does not have the CVE-2014-8500 fix. Please upload a copy of your config to a webserver so that I can see where you're adding that option to the config. – Andrew B May 07 '15 at 20:59
  • @TranslucentCloud Did you actually put `max-recursion-queries` inside the `options` section when you tried Andrew's suggestion? – Håkan Lindqvist May 07 '15 at 20:59
  • @HåkanLindqvist no, I've added it to the `named.conf` as was suggested. Think I should place it in the `named.conf.options` instead. After testing I keep getting `SERVFAIL` along with the new `info: error (network unreachable)` error. – Neurotransmitter May 07 '15 at 21:01
  • @TranslucentCloud Well, he suggested you add it to the `options` section but you happen to have that split out into a separate file which you `include`. So yes, that's where you'd add it. – Håkan Lindqvist May 07 '15 at 21:04
  • If it may be of interest, yesterday I've downgraded a bunch of packages from `testing` to `jessie` versions. These testing packages migrated from earlier `Wheezy` installation, but now I wanted to go all stable. Maybe BIND is thoroughly connected to some package, downgrade of which broke it? – Neurotransmitter May 07 '15 at 21:10
  • By the way `(network unreachable)` status being showed on IPv6 addresses. – Neurotransmitter May 07 '15 at 21:17
  • @TranslucentCloud Was it actually started with `-4` then? Or is that part of your configuration not working? (Is the /etc/defaults file even used when the service is started via systemd?) – Håkan Lindqvist May 07 '15 at 21:22
  • Nah, it's pretty squarely because you're running a newer version of BIND with the CVE-2014-8500 fix. Full explanation below. – Andrew B May 07 '15 at 21:23
  • 1
    @HåkanLindqvist if I understood it right, `named -V` merely shows what options was used during the build. I have `-4` option in `/etc/default/bind9` and `listen-on-v6 { none; };` in `/etc/bind/named.conf.options` so I believe IPv6 is not used. – Neurotransmitter May 07 '15 at 21:28
  • @TranslucentCloud Yes, but I asked if `/etc/default/bind9` is even used when the service is started by systemd rather than sysvinit. Ie, does the running `named` process have `-4` in its command line? – Håkan Lindqvist May 07 '15 at 21:30
  • @HåkanLindqvist seems `/etc/default/bind9` is actually ignored. I've just made `ps aux | grep named` and there is no `-4`: `/usr/sbin/named -f -u bind`. – Neurotransmitter May 07 '15 at 21:32
  • I've figured it out how to provide `-4` option. It should be done via editiong of `/etc/systemd/system/multi-user.target.wants/bind9.service`. At least now `(network unreachable)` error vanished, thanks for heads up. – Neurotransmitter May 07 '15 at 21:41
  • Well, this is weird. Yesterday I've left all as it was and went for a sleep. Today I've checked BIND's logs and seems like `SERVFAIL` vanished at around 5 A.M. Since then all is crystal clear and no errors at all. BIND works flawlessly. I run a number of public services (e.g. XMPP/Jabber server) and wonder if some user of `ejabberd` could probably abused my internal `bind9` sending malformed DNS-queries? – Neurotransmitter May 08 '15 at 10:04
  • You might want to check myhost.tld and make sure it's populated with valid data. Also, your db.root. Make sure it is populated with valid data. You might want to also check /etc/network/interfaces and make sure all is in order. –  May 21 '15 at 23:28
  • @techies as I already mentioned in the comments, the issue is resolved. – Neurotransmitter May 22 '15 at 13:23

2 Answers2

5

This one is a real pain to troubleshoot if you aren't familiar with the new max-recursion-queries option or why it was added.

CVE-2014-8500 was identified in late 2014 as impacting multiple nameserver products, including BIND. The exploit allows a malicious nameservers to craft a chain of referrals that will be followed infinitely, eventually leading to resource exhaustion. ISC's fix for this issue was to add an upper limit on how many levels of recursion the server is willing to perform on behalf of a single query. The ceiling is controlled by a new max-recursion-queries option that defaults to 75.

As it turns out 75 levels of recursion is not very friendly to an empty nameserver cache -- which you will always have after a full process restart. There are many domains that will fail to resolve with this default due to how many levels of referrals end up being traversed between a requested record and . (root). The pandion.im. domain happens to be one of those, and it probably has something to do with the glueless delegation from the TLD. Here's an excerpt from dig +trace +additional pandion.im:

im.                     172800  IN      NS      ns4.ja.net.
im.                     172800  IN      NS      hoppy.iom.com.
im.                     172800  IN      NS      barney.advsys.co.uk.
im.                     172800  IN      NS      pebbles.iom.com.
ns4.ja.net.             172800  IN      A       193.62.157.66
hoppy.iom.com.          172800  IN      A       217.23.163.140
barney.advsys.co.uk.    172800  IN      A       217.23.160.50
pebbles.iom.com.        172800  IN      A       80.168.83.242
ns4.ja.net.             172800  IN      AAAA    2001:630:0:47::42
;; Received 226 bytes from 199.7.83.42#53(199.7.83.42) in 29 ms

pandion.im.             259200  IN      NS      ed.ns.cloudflare.com.
pandion.im.             259200  IN      NS      jill.ns.cloudflare.com.
;; Received 81 bytes from 80.168.83.242#53(80.168.83.242) in 98 ms

The nameservers for im. are delegating pandion.im. to Cloudflare's nameservers without providing IP address glue. On an empty cache, this means that the server has to initiate a separate referral traversal to obtain the IP address of those nameservers, and all of those referrals count against the maximum number of recursions for the original query. At that point the query will only succeed if the server already knows the IP addresses of those nameservers from other queries:

# service named restart && sleep 1 && dig @localhost pandion.im | grep status
Checking named config:
Stopping named:                                        [  OK  ]
Starting named:                                        [  OK  ]
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 63173

Trying again, this time with attempts to look up those nameservers before pandion.im.:

# service named restart && sleep 1 && dig @localhost ed.ns.cloudflare.com jill.ns.cloudflare.com pandion.im | grep status
Checking named config:
Stopping named:                                        [  OK  ]
Starting named:                                        [  OK  ]
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 26428
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30491
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22162

Long story short, this problem is very non-intuitive to identify, especially since it will seem to eventually "go away" over time if the process is left running. One of our partners has recommended a value of 200 based on real world usage scenarios. Start with 200, and season to taste if it's too high for your liking.

Andrew B
  • 31,858
  • 12
  • 90
  • 128
  • Thanks for this detailed explanation, hope it will help someone else. If I understood you correctly you suggest to enable this option (what I did) just wait and see if the error will vanish? – Neurotransmitter May 07 '15 at 21:30
  • 2
    Based on my testing, your SERVFAIL problem should immediately go away if you tune `max-recursion-queries` to 200. This setting needs to be added within your `options { };` block, which is why you saw a syntax error the first time. – Andrew B May 07 '15 at 21:31
  • @TranslucentCloud Can you still reproduce the problem? It ought to stop happening immediately if this was the reason. Fwiw, I tried to reproduce the above examples on Debian 8 but couldn't, so I'm not sure. – Håkan Lindqvist May 07 '15 at 21:32
  • @Håkan Interesting. I didn't have a Debian VM available, but being able to reproduce this on RHEL6 was enough for me since it should be the same code. My odds are 0/10 when trying to resolve that domain on an empty cache, even when I switch to using `rndc flush` instead of a process bounce. Immediate fix if I uncomment my `max-recursion-queries 200;` line and flush+reload. – Andrew B May 07 '15 at 21:38
  • I've set this option to `200` (you can check my updated config in the question body), but unfortunately the `SERVFAIL` is still here. Maybe I need to make `flush` you mentioned? How do I do this? – Neurotransmitter May 07 '15 at 21:43
  • @TranslucentCloud Saddening. :( Make sure you didn't forget to restart the daemon one last time, but I'm pretty sure you wouldn't have missed that at this point. (also, `rndc reload` or `service named reload` wouldn't be enough here as you also need to flush the failure out of cache) – Andrew B May 07 '15 at 21:44
  • @TranslucentCloud RE: your edit, a flush automatically happens when you perform a full process restart (`service named restart`) or run `rndc flush`. A flush does *not* happen on a config reload, which is done via either syntax mentioned in my last comment. – Andrew B May 07 '15 at 21:47
  • Restarted a daemon plenty of times and even made `rndc flush`, but still no luck. Anyway thanks for the support guys it is really appreciated. Time to go to bed now. – Neurotransmitter May 07 '15 at 21:53
  • @TranslucentCloud False positive then, sorry about that. I guess that means we have multiple potential problems in play here. – Andrew B May 07 '15 at 21:54
  • @AndrewB I don't know what is actually stopped my `SERVFAIL`s this morning, probably your solution or lack of some uncertain malicious activity of other services in my server, but I thank you for your time and efforts and mark this answer as a solution. – Neurotransmitter May 08 '15 at 10:08
0

Check your logging directory permissions.

I found in my logs:

Jun  5 18:46:38 xsystem named[1116]: isc_stdio_open '/var/log/named.debug.log' failed: permission denied
Jun  5 18:46:38 xsystem named[1116]: configuring logging: permission denied
Jun  5 18:46:38 xsystem named[1116]: loading configuration: permission denied
Jun  5 18:46:38 xsystem named[1116]: exiting (due to fatal error)

And in researching, it appears my /var/log got changed to group read-only. Before the upgrade, it was set to group adm read-write. And since I have both bind and /var/log using the adm group, it failed.

Solution: sudo chmod g+w /var/log and restart bind (named)

p.s. My bind logs to /var/log not /var/log/named/

sebix
  • 4,175
  • 2
  • 25
  • 45
Dave
  • 1
  • 1