2 machines are mounting the same NFS folder(which is on a single 3rd machine).
Files get uploaded from either client server to this mount, usually in chunks. Server A might handle a chunk, then server B the next, all with start and end points defined so it all ads up in the end.
Still, there's been a few instances where if you run md5sum
from serverA, you get a different result than with serverB.
Though in reality, the file is on the NFS server and there should be only 1 version pushed to all clients, far as I'm aware.
And it's not fixing itself over time.
I'm currently assuming this is a race condition, related to the chunks not being added in order and NFS caching, you might get one of the servers thinking the file is at a certain length while it isn't, causing a lot of 0000 0000
padding to be added.
So, how come this is happening? Is there a mount setting I need to use to prevent this? Is there a way to tell the NFS server to re-sync the file to all clients?
And just in general, how should this be dealt with?
EDIT: Mounting options on the clients:
machine1:~$ nfsstat -m
/mnt/dirA from <SERVER_IP>:/dirA
Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP>
/mnt/dirB from <SERVER_IP>:/dirB
Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP>
machine1:~$ cat /proc/mounts | grep <SERVER_IP>
<SERVER_IP>:/dirA /mnt/dirA nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP> 0 0
<SERVER_IP>:/dirB /mnt/dirB nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP> 0 0
machine2:~$ nfsstat -m
/mnt/dirA from <SERVER_IP>:/dirA
Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP>
/mnt/dirB from <SERVER_IP>:/dirB
Flags: rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP>
<SERVER_IP>:/dirA /mnt/dirA nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP> 0 0
<SERVER_IP>:/dirB /mnt/dirB nfs4 rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,acregmin=1,acregmax=1,acdirmin=1,acdirmax=1,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=<LOCALHOST_IP>,local_lock=none,addr=<SERVER_IP> 0 0
EDIT2: The machines are both Ubuntu 18.04, fresh install, the md5sum tool is version 8.28 on both machines.
EDIT3:
I found this note that I kept on the files. I performed an xxd
to get the hexdump, from both machines, from the mount to the local filesystem of the machines.
As to be certain it was captured from the point of view of the individual machines.
As you can see, according to machine01, there's empty padding in the file, but not according to machine02.
This is the result:
root@machine01:/home/kdguser# grep -C 5 '2ddd5000' output01
2ddd4fb0: 0a78 95ff c53e e2c4 f79a db05 0a59 d7d1 .x...>.......Y..
2ddd4fc0: 85a8 1192 26a6 a25a d741 db3c a61f e72e ....&..Z.A.<....
2ddd4fd0: 4d0b 97b6 93cc 7845 6ef4 0cca f9aa 9390 M.....xEn.......
2ddd4fe0: 9f00 bacd 707f 2398 f419 e49e 8073 67fb ....p.#......sg.
2ddd4ff0: 89f5 9450 99f5 808f 4b21 3154 f97f 1271 ...P....K!1T...q
2ddd5000: 0000 0000 0000 0000 0000 0000 0000 0000 ................
2ddd5010: 0000 0000 0000 0000 0000 0000 0000 0000 ................
2ddd5020: 0000 0000 0000 0000 0000 0000 0000 0000 ................
2ddd5030: 0000 0000 0000 0000 0000 0000 0000 0000 ................
2ddd5040: 0000 0000 0000 0000 0000 0000 0000 0000 ................
2ddd5050: ba34 fb76 5af3 69d2 9af0 4711 8a0c eae8 .4.vZ.i...G.....
root@machine02:/home/kdguser# grep -C 5 '2ddd5000' output02
2ddd4fb0: 0a78 95ff c53e e2c4 f79a db05 0a59 d7d1 .x...>.......Y..
2ddd4fc0: 85a8 1192 26a6 a25a d741 db3c a61f e72e ....&..Z.A.<....
2ddd4fd0: 4d0b 97b6 93cc 7845 6ef4 0cca f9aa 9390 M.....xEn.......
2ddd4fe0: 9f00 bacd 707f 2398 f419 e49e 8073 67fb ....p.#......sg.
2ddd4ff0: 89f5 9450 99f5 808f 4b21 3154 f97f 1271 ...P....K!1T...q
2ddd5000: c969 a259 431e 2a17 12b4 8365 07cb 5e56 .i.YC.*....e..^V
2ddd5010: fa61 327f eb63 1b13 bc30 eb4b c8f0 af14 .a2..c...0.K....
2ddd5020: 6ebe 3f79 9012 7ece 1662 e104 be19 b249 n.?y..~..b.....I
2ddd5030: 9b9c f61d 180b e92a b93b 9980 aba4 ba41 .......*.;.....A
2ddd5040: 0929 fece fc8a 5309 3883 2562 fe2a 459a .)....S.8.%b.*E.
2ddd5050: ba34 fb76 5af3 69d2 9af0 4711 8a0c eae8 .4.vZ.i...G.....
While the actual file is the one as seen from machine02. Yet machine01 is showing something else.
EDIT4: Just to be clear, the length of the files is identical, the md5 is different on each client.