1

My current setup: 2 NFS servers sharing the same directory with identical content, 1 keepalived server as SLB (or rather for failover in this scenario), and 1 NFSv4 client mounting through VIP. All systems run CentOS 6 (2.6.32-573.26.1.el6.x86_64). And because this is a testing environment, all machines are on the same subnet (192.168.66.xx). For reference, the IPs are as below.

99 VIP
100 nfs01
101 nfs02
102 client
103 keepalived01

The NFS servers are configured as such:

/root/share 192.168.66.0/24(ro,fsid=0,sync,no_root_squash

As for keepalived, I am running it in DR mode (NAT mode fails to work at all).

vrrp_instance nfs {
        interface eth0
                state MASTER
                virtual_router_id 51
                priority 103    # 103 on master, 101 on backup
                authentication {
                        auth_type PASS
                        auth_pass hiServer
                }
        virtual_ipaddress {
                192.168.66.99/24 dev eth0
        }
}

virtual_server 192.168.66.99 2049 {
    delay_loop 6
    lb_algo wlc
    lb_kind DR
    protocol TCP

    real_server 192.168.66.100 2049 {
            weight 100
            TCP_CHECK {
                    connect_timeout 6
                    connect_port 2049
            }
    }

    real_server 192.168.66.101 2049 {
            weight 102
            TCP_CHECK {
                    connect_port 2049
                    connect_timeout 6
            }
    }
}

Lastly the client mounts via this command:

mount -t nfs4 192.168.66.99:/ /nfsdata

The NFSv4 mount seems to function, though I haven't stress-tested it. One thing I notice is a period of time during failover, i.e. I shut down one of the NFS servers forcing keepalived to move service to another NFS server, that the client will seem to hang for some time before responding. I believe this is due to the 90-second grace period.

The problem that nags me is that on the NFS servers, this line of log keeps showing every or so seconds, flooding the logs.

kernel: nfsd: peername failed (err 107)!

I've tried using tcpdump to see what is causing the traffic and spotted repeating exchanges beteween the NFS server and the keepalived server. At first I thought iptables could be the culprit, but flushing them on both machines does not stop the error.

If there is a way to suppress the error I may call it a day (is there?), but my curiosity questions: does the NFS server have a reason to try to communicate with the keepalived server in this scenario? Or perhaps there is something fundamentally wrong when setting up NFS HA this way, even though it seems to work?

yongtw123
  • 121
  • 1
  • 3

1 Answers1

1

Upon further inspection, the error kernel: nfsd: peername failed (err 107)! appears approximately every 6 seconds. The number seems to correspond to the connection_timeout option in the conf file, and indeed by stopping keepalived service, the error stops appearing altogether.

It seems by using TCP_CHECK on port 2049, the NFS servers will log the "bad" connection attempts since keepalived is not sending NFS messages according to protocol.

In the end I use MISC_CHECK instead to check for NFS servers' health (with a custom shell script calling rpcinfo).

yongtw123
  • 121
  • 1
  • 3