3

I'm trying to setup a cluster using a private network on subnet 10. One machine has two interfaces, one to connect to the regular network and the other to connect to all the nodes on subnet 10. This CentOS 6 machine (let's call it "zaza.domain.com") runs DHCP, DNS and currently both of these are managed by Cobbler, which may or may not be part of the problem (although disabling it and doing everything manually still gives me problems).

If I SSH into zaza, and then try to SSH from zaza into node1, I get a warning message like follows:

[root@zaza ~]# ssh node1
reverse mapping checking getaddrinfo for node1.cluster.local [10.69.0.1] failed - POSSIBLE BREAK-IN ATTEMPT! 

I still get a password prompt and can still login OK.

I know from sshd warning, "POSSIBLE BREAK-IN ATTEMPT!" for failed reverse DNS and "POSSIBLE BREAK-IN ATTEMPT!" in /var/log/secure — what does this mean? and a bunch of other searching that the cause of this error typically is a PTR record not being set. However, it is set - consider the following:

[root@zaza ~]# nslookup node1.cluster.local   
Server:     10.69.0.69   
Address:    10.69.0.69#53

Name:   node1.cluster.local   
Address: 10.69.0.1

[root@zaza ~]# nslookup 10.69.0.1   
Server:     10.69.0.69   
Address:    10.69.0.69#53

1.0.69.10.in-addr.arpa  name = node1.cluster.local.

The 10.69.0.69 IP address is second interface of zaza.

If I try a different tool like dig, to actually view the PTR record I get the following output:

[root@zaza ~]# dig ptr 1.0.69.10.in-addr.arpa    
; <<>> DiG 9.8.2rc1-RedHat-9.8.2-0.47.rc1.el6_8.4 <<>> ptr 69.0.69.10.in-addr.arpa
;; global options: +cmd
;; Got answer:   
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 29499   
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 1, ADDITIONAL: 1

;; QUESTION SECTION:
;1.0.69.10.in-addr.arpa.    IN  PTR

;; ANSWER SECTION:  
1.0.69.10.in-addr.arpa. 300 IN  PTR node1.cluster.local.

;; AUTHORITY SECTION:  
10.in-addr.arpa.    300 IN  NS  zaza.cluster.local.

;; ADDITIONAL SECTION:   zaza.cluster.local.    300 IN  A   10.69.0.69

;; Query time: 0 msec
;; SERVER: 10.69.0.69#53(10.69.0.69)
;; WHEN: Wed Mar  1 17:05:44 2017   
;; MSG SIZE  rcvd: 110

It looks to me like the PTR record is set, so I don't know why SSH would throw a hissy fit when I try to connect to one of the node machines. To give all the information, here's the relevant config files, spoilered to make things look just a tad more readable...

/etc/named.conf

[root@zaza ~]# cat /etc/named.conf 
options {
          listen-on port 53 { any; };
          directory       "/var/named";
          dump-file       "/var/named/data/cache_dump.db";
          statistics-file "/var/named/data/named_stats.txt";
          memstatistics-file "/var/named/data/named_mem_stats.txt";
          allow-query     { any; }; # was localhost
          recursion yes;

          # setup DNS forwarding
          forwarders {1.2.3.4;}; # Real IP goes in here
};

logging {
        channel default_debug {
                file "data/named.run";
                severity dynamic;
        };
};

zone "cluster.local." {
    type master;
    file "cluster.local";

    # these two lines allow DNS querying
    allow-update { any; };
    notify no;
};

zone "10.in-addr.arpa." {
    type master;
    file "10";

    # these two lines allow DNS querying
    allow-update { any; };
    notify no;
};

/var/named/cluster.local

[root@zaza ~]# cat /var/named/cluster.local 
$TTL 300
@                       IN      SOA     zaza.cluster.local. nobody.example.com. (
                                        2017030100   ; Serial
                                        600         ; Refresh
                                        1800         ; Retry
                                        604800       ; Expire
                                        300          ; TTL
                                        )

                        IN      NS      zaza.cluster.local.

zaza     IN  A     10.69.0.69



node1  IN  A     10.69.0.1;
node2  IN  A     10.69.0.2;

/var/named/10

[root@zaza ~]# cat /var/named/10 
$TTL 300
@                       IN      SOA     zaza.cluster.local. root.zaza.cluster.local. (
                                        2017030100   ; Serial
                                        600         ; Refresh
                                        1800         ; Retry
                                        604800       ; Expire
                                        300          ; TTL
                                        )

                        IN      NS      zaza.cluster.local.

69.0.69 IN  PTR  zaza.cluster.local.



1.0.69  IN  PTR  node1.cluster.local.
2.0.69  IN  PTR  node2.cluster.local.

If you have any ideas, it'd be much appreciated!

Biggles
  • 35
  • 1
  • 6
  • On zaza, what output do you see when you run `getent hosts node1.cluster.local`? – Andrew B Mar 01 '17 at 21:09
  • I saw none, which surprised me. But if I did `getent hosts node1` I did get a result of `10.69.0.1 node1.cluster.local`. From this I realised the problem was something to do with /etc/nsswitch.conf and I've posted the whole answer below. Cheers for pointing me in the right direction. – Biggles Mar 06 '17 at 13:11

1 Answers1

0

It was all about Avahi and the .local domain and nothing to do with PTR records.

I did a bunch more searching having realised that resolution of the host worked, but that host by FQDN was failing. This eventually led me https://superuser.com/questions/704785/ping-cant-resolve-hostname-but-nslookup-can and from it, I was linked to http://www.lowlevelmanager.com/2011/09/fix-linux-dns-issues-with-local.html which solved everything for me.

Ultimately the problem is that in /etc/nsswitch.conf there's a line that says:
hosts: files mdns4_minimal [NOTFOUND=return] dns
By changing this to read:
hosts: files dns
The problem disappeared and I no longer got the error about possible break-in attempts.

Another solution I tested was simply to rename the domain, since this behaviour is specific to the .local domain. By renaming cluster.local to cluster.bob, the error message also disappeared.

Another solution would be to move Avahi from .local to something like .alocal so that the multicast DNS doesn't apply to the .local domain and the default nsswitch configuration would seem to work. I suppose removing the [NOTFOUND=return] parameter would also work as it would stop multicast DNS from ending the lookup if a .local host wasn't found, however that's probably a bad idea.

Ultimately this was an edge case that came about because I didn't fully understand the significance of the .local domain, I just viewed it as a good convention for an internal network.

Biggles
  • 35
  • 1
  • 6