On a NFS mount (standard options on RedHat 5.6 with ancient 2.6.18 kernel), it seems to me that big and multiple write operations delay smaller read operations. For example, doing a simple ls
in a directory will takes seconds (or minutes) if there's a cp
or dd
running concurrently.
The issue is a bit mitigated because Linux caches metadata for a few seconds, but when there's a lot of data to write, the NFS mount becomes unusable.
At first I though that this was only a NFS server issue but running something like this:
for((i=0; i<60; i++)) do
strace -f -t -o strace.$i.log time stat /mnt/nfs/data > out.$i.log 2>&1
sleep 1
if ((i == 30)); then
dd if=/dev/zero of=/mnt/nfs/data bs=1M count=1000 &
fi
done
wait
and a tcpdump in parallel tells me the following:
1) whenever the dd
starts, the next stat
that does a cache miss takes 15s
23261 16:41:24 munmap(0x2ad024d0e000, 4096) = 0
23261 16:41:24 lstat("/mnt/fermat_emctest/data", {st_mode=S_IFREG|0600, st_size=1048576000, ...}) = 0
23261 16:41:40 open("/proc/filesystems", O_RDONLY) = 3
2) the tcpdump shows that while the dd
is running and WRITE
calls are issued, not a single GETATTR
is sent.
Given that RPC is async, I would have expected to see GETATTR
calls multiplexed with the WRITE
, but that's not the case.
It's not the GETATTR
that's slow (it takes a few us when submitted), it's the kernel that queues it after all the WRITE
s.
That's why the stat
takes ages, because it waits for the kernel to submit the GETATTR
call.
Am I right ?
This looks like a bufferbloat issue, the kernel is starving my stat
because the client-side operations queue for this mount (server ?) is full.
I think this is somehow related to my other question How to achieve multiple NFS/TCP connections to the same server? .
Is there a way to tune the kernel NFS ops queue ?