Is this a Linux NFS client bufferbloat?

Question

On a NFS mount (standard options on RedHat 5.6 with ancient 2.6.18 kernel), it seems to me that big and multiple write operations delay smaller read operations. For example, doing a simple ls in a directory will takes seconds (or minutes) if there's a cp or dd running concurrently. The issue is a bit mitigated because Linux caches metadata for a few seconds, but when there's a lot of data to write, the NFS mount becomes unusable.

At first I though that this was only a NFS server issue but running something like this:

for((i=0; i<60; i++)) do
  strace -f -t -o strace.$i.log time stat /mnt/nfs/data > out.$i.log 2>&1
  sleep 1
  if ((i == 30)); then
    dd if=/dev/zero of=/mnt/nfs/data bs=1M count=1000 &
  fi
done

wait

and a tcpdump in parallel tells me the following:

1) whenever the dd starts, the next stat that does a cache miss takes 15s

23261 16:41:24 munmap(0x2ad024d0e000, 4096) = 0
23261 16:41:24 lstat("/mnt/fermat_emctest/data", {st_mode=S_IFREG|0600, st_size=1048576000, ...}) = 0
23261 16:41:40 open("/proc/filesystems", O_RDONLY) = 3

2) the tcpdump shows that while the dd is running and WRITE calls are issued, not a single GETATTR is sent. Given that RPC is async, I would have expected to see GETATTR calls multiplexed with the WRITE, but that's not the case. It's not the GETATTR that's slow (it takes a few us when submitted), it's the kernel that queues it after all the WRITEs.

That's why the stat takes ages, because it waits for the kernel to submit the GETATTR call.

Am I right ? This looks like a bufferbloat issue, the kernel is starving my stat because the client-side operations queue for this mount (server ?) is full.

I think this is somehow related to my other question How to achieve multiple NFS/TCP connections to the same server? .

Is there a way to tune the kernel NFS ops queue ?

Any chance you could try kernel 3.10 from http://elrepo.org/tiki/kernel-lt ? There have been lots of change to writeback, throttling, etc that probably weren't backported to EL5. — sciurus, Apr 24 '14 at 16:07
It looks like exactly this "bug" https://bugzilla.redhat.com/show_bug.cgi?id=688232 I am not sure if the bug is specific to GETATTR/WRITE, I mean it's possible that the kernel is holding the GEATTR NFS call on purpose and not because of queue management. I am going to redo the test on 2 different files and directories. — Benoît, Apr 25 '14 at 08:03
One comment though: on another client with many NFS mounts to the save server, when this kind of write workload happens, the access is stuck on all other mounts, so that would go with the queue/bufferbloat/writeback explanation. Yes I need to test with newer kernels. — Benoît, Apr 25 '14 at 08:15

score 2 · Answer 1 · answered Apr 26 '14 at 10:32

OK, here's my answer.

Related to https://bugzilla.redhat.com/show_bug.cgi?id=688232 with kernel 2.6.18 and 2.6.32 as shipped with RedHat (I haven't time to re-validate this with vanilla newer kernels), on a NFS client (v3 / tcp / default mount options), when one is writing to a file, the kernel also needs to update the timestamps of this file. While the file is being written, if another process wants the metadata of this file (such as when doing a stat on this file or ls -l in its parent directory), this reader process gets delayed by the kernel until the write is finished.

At the NFS level, I can see that the kernel will issue the GETATTR call only after all (I am not sure on this, but in my tests up to 5GiB, the stat time seemed to match the dd time) the WRITE. The bigger the write is, the longer is the wait.

With a slow NFS server or a server with a lot of RAM, that delay can be minutes. When the stat(2) gets put to sleep, one can monitor /proc/meminfo for NFS_Unstable or Writeback which shows how much data is flight.

I am not sure why the kernel does this, but at least now I understand the behaviour. So there's no bufferbloat, but some operations are serialized.

Is this a Linux NFS client bufferbloat?

1 Answers1