4

Does anyone understand the NFS protocol (Version 4) or the Linux NFS server implementation of it well enough to explain how setclientid negotiates authentication?

Background:

I have a CentOS 7 server that runs an NFS4 server with Kerberos authentication, the Kerberos integration is provided by PowerBroker Identity Services Open. This server is in production and is being used on a daily basis. I have been trying to create an identical server, but I just don't seem to be able to. I configured everything on the old server about six months ago, and I believe I've documented what I've done rather well - but it just doesn't seem to work anymore. In addtion to that I've tried to simplify the setup as much as possible, and it just won't behave. Under no circumstances am I able to mount any exports using Kerberos. I am not sure why this is, but after troubleshooting this for what seems like an eternity, I believe I have identified where the problem occurs.

Observed setclientid behaviour:

Below are two snippes (names and IPs replaced) from the syslog on a test system I tried mounting from, it runs Ubuntu 16.04 Desktop (ubuntu-1604x, 192.168.1.30).

Both servers run CentOS 7 and are seemingly identical (I have tried using different versions of software on Server B without luck)

Server A (192.168.1.125):

Nov  8 19:36:04 ubuntu-1604x kernel: [316817.783379] NFS call  setclientid auth=RPCSEC_GSS, 'Linux NFSv4.0 192.168.1.30/192.168.1.125 tcp'
Nov  8 19:36:04 ubuntu-1604x kernel: [316817.854777] NFS reply setclientid: 0
Nov  8 19:36:04 ubuntu-1604x kernel: [316817.854781] NFS call  setclientid_confirm auth=RPCSEC_GSS, (client ID 11cfcd5b138b872c)
Nov  8 19:36:04 ubuntu-1604x kernel: [316817.855413] NFS reply setclientid_confirm: 0

Server B (192.168.1.250):

Nov  8 19:34:59 ubuntu-1604x kernel: [316752.414152] NFS call  setclientid auth=RPCSEC_GSS, 'Linux NFSv4.0 192.168.1.30/192.168.1.250 tcp'
Nov  8 19:34:59 ubuntu-1604x kernel: [316752.455212] NFS reply setclientid: -13
Nov  8 19:34:59 ubuntu-1604x kernel: [316752.455217] NFS call  setclientid auth=RPCSEC_GSS, 'Linux NFSv4.0 192.168.1.30/192.168.1.250 tcp'
Nov  8 19:34:59 ubuntu-1604x kernel: [316752.490325] NFS reply setclientid: -13
Nov  8 19:34:59 ubuntu-1604x kernel: [316752.490358] NFS call  setclientid auth=UNIX, 'Linux NFSv4.0 192.168.1.30/192.168.1.250 tcp
Nov  8 19:34:59 ubuntu-1604x kernel: [316752.490768] NFS reply setclientid: 0
Nov  8 19:34:59 ubuntu-1604x kernel: [316752.490772] NFS call  setclientid_confirm auth=UNIX, (client ID 1580e45bf5783af5)
Nov  8 19:34:59 ubuntu-1604x kernel: [316752.491354] NFS reply setclientid_confirm: 0

Question:

The bottom one results in a 'Permission denied' error which makes perfect sense to me as the export only allows Kerberos authentication, but I can not understand the -13 reply and the switching to a different authentication mechanism. Can anyone explain this to me?

Edit: After 7 updates and errors that seemingly have been fixed, the question still that remains the same as above.

Sources I've looked into:

As well as various posts I've found scattered across the internet, I've looked into RFC 7530, 7531 and 7931. I haven't read the whole documents in detail yet, but I've carefully read the part on 'setclientid' and 'setclientid_confirm' in RFC 7530 multiple times, and I just can't get the above to make sense - I think my understanding is flawed.

Update

I believe the -13 is an RPC error. Looking at the source code of nfs4proc I gather that the -13 in the output comes from rpc_run_task. Looking into rpc_run_task I gather it runs rpc_execute - I don't know what exactly is being executed, but presumably the implementation follow RFC 5531 which conveniently has 14 different authentication errors, of which number 13 is 'no credentials for user', an error I've seen rpc.gssd output multiple times throughout testing.

Update 2

The error I've seen from rpc.gssd is:

ERROR: No credentials found for connection to server nfsserver.example.org

Thinking back, I've only seen it in testing on one host, that I never confirmed could connect to another server. I can't find it documented anywhere, but I see a lot of references to another RPC error 13 elsewhere: RPC error 13 (RPC was aborted due to timeout) Given that I can mount the other server just fine, could this perhaps be the error I am getting?

Update 3

I ran strace on the rpc.gssd process on both servers and did another mount attempt. Server A outputted nothing, presumably because I've connected before. Server B outputs the following:

strace: Process 843 attached
epoll_wait(4, [{EPOLLIN, {u32=8, u64=8}}], 32, -1) = 1
read(8, "\r\0\0\0\0\2\0\0\0\0\0\0\20\0\0\0info\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 80
inotify_rm_watch(8, 13)                 = -1 EINVAL (Invalid argument)
inotify_add_watch(8, "nfsd4_cb/clnt3", IN_CREATE|IN_DELETE) = 14
openat(7, "nfsd4_cb/clnt3", O_RDONLY)   = 3
openat(3, "gssd", O_RDWR|O_NONBLOCK)    = -1 ENOENT (No such file or directory)
openat(3, "krb5", O_RDWR|O_NONBLOCK)    = -1 ENOENT (No such file or directory)
close(3)                                = 0
read(8, 0x7ffd2945e280, 4096)           = -1 EAGAIN (Resource temporarily unavailable)
epoll_wait(4,

I located the folder mentioned (nfsd4_cb) in /var/lib/nfs/rpc_pipefs and the contents are not identical on the two servers. The working one has a long list of folders, they seem to represent clients that have previously connected/are currently connected - possibly all assigned clientids. Perhaps this is a loose end.

I do see a gssd directory in /var/lib/nfs/rpc_pipefs, so the 'No such file or directory' message is confusing. Edit: I suppose it could be looking in a different folder, I don't know what the first argument of openat means, and it is different. Does anyone understand this?

Update 4

I captured the network traffic from my testing and inspected the setclientid packet. The credentials that are sent to server A are of the flavour RPCSEC_GSS, but the credentials that are sent server B are of the flavour 'AUTH_UNIX'. In the logfile it looked like 3 setclientid calls were being made and the server replied, but only one is sent - so the 'rejection' must happen locally on the client.

I'm using the exact same mount command with the hostname replaced.

Update 5

With all RPC debug output enabled, I looked through the log file and found what I believe to be the output of the RPC task that SETCLIENTID started (if my understanding of this is correct, that is. See 'Update').

So comparing the output of a working mount and the one that fails, I believe I've tracked down where this RPC thing goes wrong, however I don't understand exactly what is going on.

Snippet from working mount:

Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150180] NFS call  setclientid auth=RPCSEC_GSS, 'Linux NFSv4.0 192.168.1.30/192.168.1.125 tcp'
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150181] RPC:       new task initialized, procpid 12556
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150182] RPC:       allocated task ffff880034f92500
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150183] RPC:  2251 __rpc_execute flags=0x5280
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150184] RPC:  2251 call_start nfs4 proc SETCLIENTID (sync)
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150185] RPC:  2251 call_reserve (status 0)
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150186] RPC:  2251 reserved req ffff880211697600 xid 29a970ab
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150187] RPC:       wake_up_first(ffff880130755190 "xprt_sending")
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150187] RPC:  2251 call_reserveresult (status 0)
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150188] RPC:  2251 call_refresh (status 0)
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150190] RPC:       gss_create_cred for uid 0, flavor 390004
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150191] RPC:       gss_create_upcall for uid 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.150196] RPC:       __gss_find_upcall found nothing
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171925] RPC:       __gss_find_upcall found msg ffff8800db521000
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171938] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171939] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171944] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171944] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171950] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171951] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171954] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171954] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171957] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171958] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171961] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171962] RPC:       krb5_encrypt returns 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171964] RPC:       gss_import_sec_context_kerberos: returning 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171965] RPC:       gss_fill_context Success. gc_expiry 4438759401 now 4431124651 timeout 30539 acceptor
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171968] RPC:       gss_pipe_downcall returning 80
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171985] RPC:       gss_create_upcall for uid 0 result 0
Nov 11 10:57:48 ubuntu-1604x kernel: [544916.171987] RPC:  2251 refreshing RPCSEC_GSS cred ffff8802121513c0

Snippet from non working mount:

Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738317] NFS call  setclientid auth=RPCSEC_GSS, 'Linux NFSv4.0 192.168.1.30/192.168.1.250 tcp'
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738318] RPC:       new task initialized, procpid 12672
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738319] RPC:       allocated task ffff880158862200
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738320] RPC:  2280 __rpc_execute flags=0x5280
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738322] RPC:  2280 call_start nfs4 proc SETCLIENTID (sync)
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738323] RPC:  2280 call_reserve (status 0)
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738324] RPC:  2280 reserved req ffff8801b5ef2800 xid ab6b57ae
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738325] RPC:       wake_up_first(ffff880130751990 "xprt_sending")
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738326] RPC:  2280 call_reserveresult (status 0)
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738327] RPC:  2280 call_refresh (status 0)
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738328] RPC:       gss_create_cred for uid 0, flavor 390004
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738330] RPC:       gss_create_upcall for uid 0
Nov 11 10:59:41 ubuntu-1604x kernel: [545029.738332] RPC:       __gss_find_upcall found nothing
Nov 11 10:59:42 ubuntu-1604x kernel: [545029.835573] RPC:       __gss_find_upcall found msg ffff8801b5ef3400
Nov 11 10:59:42 ubuntu-1604x kernel: [545029.835576] RPC:       gss_fill_context returns error 13
Nov 11 10:59:42 ubuntu-1604x kernel: [545029.835580] RPC:       gss_pipe_downcall returning 16
Nov 11 10:59:42 ubuntu-1604x kernel: [545029.835593] RPC:       gss_create_upcall for uid 0 result -13
Nov 11 10:59:42 ubuntu-1604x kernel: [545029.835597] RPC:  2280 call_refreshresult (status -13)
Nov 11 10:59:42 ubuntu-1604x kernel: [545029.835598] RPC:  2280 call_refreshresult: refresh creds failed with error -13

The -13 error must be this one, and it could very well be RPCSEC_GSS_CREDPROBLEM = 13, /* no credentials for user */ referenced in RFC 5531. __gss_find_upcall found msg differs from the working one, but I'm not sure what that means. With the working mount, it is explicitly logged that the credentials are validated later on, so it can't be that the credentials fail validation, it must be that they are not found.

Nov 11 10:57:48 ubuntu-1604x kernel: [544916.172735] RPC:  2251 validating RPCSEC_GSS cred ffff8802121513c0

The mount command used is the EXACT same, except for the hostname. How does the hostname influence the search for 'credentials'? Where is it looking? Does anyone have any idea what is going on?

Update 6

Right before the setclientid NFS call, there is some traffic to the KDC server. A 'TGS-REQ' request is sent, and the server replies with a 'KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN'. I don't understand this well enough to figure out what exactly is being sent to the KDC, and the traffic is encrypted. Both servers have the same SPNs (with their own hostname in it). Does anyone understand this well enough to help determine how the hostname of my mount command affects the TGS-REQ that is sent? I do not have access to the KDC, so I can't troubleshoot there.

Update 7

The above error was caused by a wrong reverse DNS record. Having fixed that, the KDC now returns a ticket upon receiving the TGS-REQ. But the observed behaviour is the same, the mount fails, the NFS setclientid output in the log is the exact same, the RPC output looks the same. Something appears to be wrong with the ticket that is received - does anyone have any clue on how to debug this?

Update 8

The ticket appears to be the correct one. I used Wireshark to capture the traffic and compared the working one with the non working one again. What I notice is that before the setclientid call, 'NULL' calls are being made.

First one with no credentials, then one with what appears to be the correct credentials. Both servers receive a reply to the first NULL call with no credentials, but only the working server receives a reply to the second one, the other one tries 4 times with no replies.

What are these NULL calls I see, and how can I debug the lack of a reply? The destination port is port 2049, and other traffic to 2049 does get a reply.

Strace of rpc.gssd on the servers still give the same result as in Update 3.

Furthermore i used lsof to confirm the working directory of rpc.gssd:

rpc.gssd    892                 root  cwd       DIR               0,39         0      12726 /var/lib/nfs/rpc_pipefs

Here is the result of an ls -la on that folder:

total 0
dr-xr-xr-x. 11 root root   0 Nov 14 14:33 .
drwxr-xr-x.  6 root root 179 Nov 14 15:17 ..
dr-xr-xr-x.  2 root root   0 Nov 14 14:33 cache
dr-xr-xr-x.  3 root root   0 Nov 14 14:33 gssd
dr-xr-xr-x.  2 root root   0 Nov 14 14:33 lockd
dr-xr-xr-x.  2 root root   0 Nov 14 14:33 mount
dr-xr-xr-x.  2 root root   0 Nov 14 14:33 nfs
dr-xr-xr-x.  2 root root   0 Nov 14 14:33 nfsd
dr-xr-xr-x.  2 root root   0 Nov 14 22:31 nfsd4_cb
dr-xr-xr-x.  2 root root   0 Nov 14 14:33 portmap
dr-xr-xr-x.  2 root root   0 Nov 14 14:33 statd

There is a gssd folder, so why the ENOENT error?

openat(3, "gssd", O_RDWR|O_NONBLOCK)    = -1 ENOENT (No such file or directory)
openat(3, "krb5", O_RDWR|O_NONBLOCK)    = -1 ENOENT (No such file or directory)

Edit: The above also happens on the working server, when first mounting.

Tobias
  • 86
  • 1
  • 7

0 Answers0