1

I set up a NFSv4 server and a client both running Debian 11.3 on Linux 5.10.0-13. It basically works i.e., I see all files with correct permissions and can open, modify, etc. However, opening a file causes a 5 second delay.

The server exports directories like this in /etc/exports:

/path/to          gss/krb5(rw,sync,fsid=0,crossmnt,no_subtree_check)
/path/to/NFS      gss/krb5(rw,sync,no_subtree_check)

The NFS4 export is mounted using this kind of /etc/fstab entry:

server:/NFS      /path/to/nfs                 nfs4 defaults,auto,sec=krb5,proto=tcp,port=2049,nolock 0 3

The following data refers to opening a file in emacs and close emacs as soon as it is displayed.

Analyzing strace I see that the corresponding openat() syscall takes quite stable 5.1 seconds! Other openat(), stat(), ... syscalls pertaining to NFS execute within microseconds. This is what I see consistently when doing e.g. ls, which doesn't feel different from the local file system. This is an excerpt of syscalls relating to the file in question prepended by the time spent in the syscall:

0.000014        openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000005        readlinkat(AT_FDCWD, "/path/to/nfs/file", 0x7fff202f6880, 1024) = -1 EINVAL (Das Argument ist ungültig)
0.000006        openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 14
0.000005        readlinkat(AT_FDCWD, "/path/to/nfs/file", 0x7fff202f64d0, 1024) = -1 EINVAL (Das Argument ist ungültig)
0.000005        openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 14
5.108682        openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC) = 14
0.000006        faccessat(AT_FDCWD, "/path/to/nfs/file", W_OK) = 0
0.000005        stat("/path/to/nfs/file", {st_mode=S_IFREG|0640, st_size=2192, ...}) = 0
0.000007        faccessat(AT_FDCWD, "/path/to/nfs/file,v", F_OK) = -1 ENOENT (Datei oder Verzeichnis nicht gefunden)
0.000009        openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000008        openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000008        openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000008        openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000008        openat(AT_FDCWD, "/path/to/nfs/file", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = -1 ENOTDIR (Ist kein Verzeichnis)
0.000012        stat("/path/to/nfs/file", {st_mode=S_IFREG|0640, st_size=2192, ...}) = 0
0.000015        write(14, "/path/to/nfs/file"..., 90) = 90 

In wireshark I see the following:

No. Time        Source      Destination Protocol Length Info
7   0.623097    client.ip   server.ip   NFS      276    V4 Call (Reply In 8) GETATTR FH: 0xecf8d891
8   0.624231    server.ip   client.ip   NFS      376    V4 Reply (Call In 7) GETATTR
9   0.624239    client.ip   server.ip   TCP      72     951 → 2049 [ACK] Seq=601 Ack=917 Win=4176 Len=0 TSval=1071244404 TSecr=3950562910
10  0.624696    client.ip   server.ip   NFS      344    V4 Call (Reply In 11) OPEN DH: 0xecf8d891/
11  0.625669    server.ip   client.ip   NFS      452    V4 Reply (Call In 10) OPEN StateID: 0xb42f
12  0.625693    client.ip   server.ip   TCP      72     951 → 2049 [ACK] Seq=873 Ack=1297 Win=4176 Len=0 TSval=1071244405 TSecr=3950562911
13  5.742166    client.ip   server.ip   NFS      340    V4 Call (Reply In 14) CLOSE StateID: 0xb42f
14  5.743331    server.ip   client.ip   NFS      232    V4 Reply (Call In 13) SEQUENCE | PUTFH Status: NFS4ERR_STALE
15  5.743359    client.ip   server.ip   TCP      72     951 → 2049 [ACK] Seq=1141 Ack=1457 Win=4176 Len=0 TSval=1071249523 TSecr=3950568029

I do not know whether this NFS4ERR_STALE condition is a hint to the issue. According to the RFC7530 it indicates that the file system object has been removed. Well, the file in the delayed openat() definitely is not deleted. So, I'm unsure what it refers to. However, it also shows the 5.1 second delay in between 12 and 13.

I must admit that I do not quite understand what I see here. However, I see the same kind of delays with other programs as well i.e., this is no quirk of emacs. Saving in libreoffice can even freeze until the file is accessed by another program.

Since I've seen somewhere that there are issues with krb5p in some environments, I reduced to krb5, but this did not change anything.

Both client and server run gssproxy. On the client I see debugging entries of nfsidmap, and after setting sysctl sunrpc.rpc_debug=0xFFFF I see the following for the emacs scenario:

[423544.865600] RPC:       gss_get_mic_v2
[423544.865628] RPC:       xs_tcp_send_request(200) = 0
[423544.867049] RPC:       xs_data_ready...
[423544.867309] RPC:       gss_verify_mic_v2
[423545.373665] RPC:       gss_get_mic_v2
[423545.373691] RPC:       xs_tcp_send_request(200) = 0
[423545.374692] RPC:       xs_data_ready...
[423545.374748] RPC:       gss_verify_mic_v2
[423545.375009] RPC:       gss_get_mic_v2
[423545.375025] RPC:       xs_tcp_send_request(268) = 0
[423545.375909] RPC:       xs_data_ready...
[423545.375957] RPC:       gss_verify_mic_v2
[423550.467227] RPC:       gss_get_mic_v2
[423550.467307] RPC:       xs_tcp_send_request(216) = 0
[423550.468395] RPC:       xs_data_ready...
[423550.468513] RPC:       gss_verify_mic_v2
[423550.468621] RPC:       gss_get_mic_v2
[423550.468646] RPC:       gss_get_mic_v2
[423550.468689] RPC:       xs_tcp_send_request(264) = 0
[423550.469403] RPC:       xs_data_ready...
[423550.469541] RPC:       gss_verify_mic_v2
[423550.469564] RPC:       gss_verify_mic_v2
[423550.472794] RPC:       gss_get_mic_v2
[423550.472849] RPC:       xs_tcp_send_request(208) = 0
[423550.473758] RPC:       xs_data_ready...
[423550.473903] RPC:       gss_verify_mic_v2
[423550.474234] RPC:       gss_get_mic_v2
[423550.474290] RPC:       xs_tcp_send_request(200) = 0
[423550.475147] RPC:       xs_data_ready...
[423550.475257] RPC:       gss_verify_mic_v2

I have no idea how to exactly interpret this log, but it also clearly shows the 5 second delay, and it seems to me like all RPC calls are processed within 100 ms.

Any ideas what is going on, or at least how to further pin down the issue?

Lars Hanke
  • 281
  • 2
  • 15
  • Is the client's rpc.gssd logging anything in debug mode? Is the server's gssproxy (or rpc.svcgssd, whichever Debian 11 uses) logging anything? If you run `sysctl sunrpc.rpc_debug=0xFFFF` on the client, does the kernel start logging anything to dmesg? Your exports file is using obsolete syntax; `*(sec=krb5:krb5i:krb5p,rw)` is the normal way to write it. – user1686 Apr 15 '22 at 15:40
  • Thanks @user1686 for your help. I did not yet re-write the server's export syntax. However, from the RPC debug info I suspect a client side issue, and I neither have any idea what this should be. – Lars Hanke Apr 18 '22 at 08:42
  • Does the client run rpc.gssd? gssproxy only replaces the server-side rpc.svcgssd, but doesn't fully replace rpc.gssd on clients. – user1686 Apr 18 '22 at 08:46
  • Yes, the client runs rpc.gssd. – Lars Hanke Apr 18 '22 at 09:24

1 Answers1

1

I have solved the issue when the VPN to my second site broke down some time ago. The Kerberos slave server used by the NFS server had its DNS server pointing to the other site i.e., through a slow VPN. Still puzzeling since the VPN is not that slow and I did not see 5 second delays e.g., during logon, but after changing the DNS to the local slave the issue disappeared.

Meanwhile I rolled out the new server and it runs without issues.

Lars Hanke
  • 281
  • 2
  • 15