NFS client has unbalanced read and write speeds

Question

I've got a NetApp as my nfs server, and two Linux servers as the nfs clients. The problem is that the newer of the two servers has extremely differing read and write speeds whenever it is doing read and writes simultaneously to the nfs server. Separately, reads and writes look great for this new server. The older server does not have this issue.

Old host: Carp

Sun Fire x4150 with w/ 8 cores, 32 GB RAM

SLES 9 SP4

Network driver: e1000

me@carp:~> uname -a
Linux carp 2.6.5-7.308-smp #1 SMP Mon Dec 10 11:36:40 UTC 2007 x86_64 x86_64 x86_64 GNU/Linux

New host: Pepper

HP ProLiant Dl360P Gen8 w/ 8 cores, 64 GB RAM

CentOS 6.3

Network driver: tg3

me@pepper:~> uname -a
Linux pepper 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

I'll jump to some graphs illustrating the read/write tests. Heres pepper and its unbalanced read/write:

pepper throughput

and here is carp, lookin' good:

carp throughput

The tests

Here are the read/write tests I am running. I've run these separately and they look great on pepper, but when run together (using the &), the write performance remains solid while the read performance suffers greatly. The test file is twice the size of the RAM (128 GB for pepper, and 64 GB was used for carp).

# write
time dd if=/dev/zero of=/mnt/peppershare/testfile bs=65536 count=2100000 &
# read 
time dd if=/mnt/peppershare/testfile2 of=/dev/null bs=65536 &

The NFS server hostname is nfsc. The Linux clients have a dedicated NIC on a subnet thats separate from anything else (i.e. different subnet than primary IP). Each Linux client mounts an nfs share from server nfsc to /mnt/hostnameshare.

nfsiostat

Heres a 1-minute sample during pepper's simul r/w test:

me@pepper:~> nfsiostat 60

nfsc:/vol/pg003 mounted on /mnt/peppershare:

   op/s         rpc bklog
1742.37            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                 49.750         3196.632         64.254        0 (0.0%)           9.304          26.406
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                1642.933        105628.395       64.293        0 (0.0%)           3.189         86559.380

I don't have nfsiostat on the old host carp yet, but working on it.

/proc/mounts

me@pepper:~> cat /proc/mounts | grep peppershare 
nfsc:/vol/pg003 /mnt/peppershare nfs rw,noatime,nodiratime,vers=3,rsize=65536,wsize=65536,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.x.x.x,mountvers=3,mountport=4046,mountproto=tcp,local_lock=none,addr=172.x.x.x 0 0

me@carp:~> cat /proc/mounts | grep carpshare 
nfsc:/vol/pg008 /mnt/carpshare nfs rw,v3,rsize=32768,wsize=32768,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,timeo=60000,retrans=3,hard,tcp,lock,addr=nfsc 0 0

Network card settings

me@pepper:~> sudo ethtool eth3
Settings for eth3:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 4
        Transceiver: internal
        Auto-negotiation: on
        MDI-X: off
        Supports Wake-on: g
        Wake-on: g
        Current message level: 0x000000ff (255)
        Link detected: yes

me@carp:~> sudo ethtool eth1
Settings for eth1:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: umbg
        Wake-on: g
        Current message level: 0x00000007 (7)
        Link detected: yes

Offload settings:

me@pepper:~> sudo ethtool -k eth3
Offload parameters for eth3:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off

me@carp:~> # sudo ethtool -k eth1
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on

Its all on a LAN with a gigabit switch at full duplex between the nfs clients and nfs server. On another note, I see quite a bit more IO wait on the CPU for pepper than carp, as expected since I suspect its waiting on nfs operations.

I've captured packets with Wireshark/Ethereal, but I'm not strong in that area, so not sure what to look for. I don't see a bunch of packets in Wireshark that are highlighted in red/black, so thats about all I looked for :). This poor nfs performance has manifested in our Postgres environments.

Any further thoughts or troubleshooting tips? Let me know if I can provide further information.

UPDATE

Per @ewwhite's comment, I tried two different tuned-adm profiles, but no change.

To the right of my red mark are two more tests. The first hill is with the throughput-performance and the second is with enterprise-storage.

pepper adm tuned

nfsiostat 60 of enterprise-storage profile

nfsc:/vol/pg003 mounted on /mnt/peppershare:

   op/s         rpc bklog
1758.65            0.00
read:             ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                 51.750         3325.140         64.254        0 (0.0%)           8.645          24.816
write:            ops/s            kB/s           kB/op         retrans         avg RTT (ms)    avg exe (ms)
                1655.183        106416.517       64.293        0 (0.0%)           3.141         159500.441

Update 2

sysctl -a for pepper

Can you test using the guidelines I [posted here](http://serverfault.com/questions/430955/openfiler-iscsi-performance/431112#431112) and report back? — ewwhite, Feb 01 '13 at 15:20
OK I switched profiles with `tuned-adm profile throughput-performance` and just fired the `dd` tests again. — Banjer, Feb 01 '13 at 15:34
Enterprise Storage... Try it first. Look at the chart in the link or check some of my [other answers about it](http://serverfault.com/search?q=user%3A13325+tuned-adm). — ewwhite, Feb 01 '13 at 15:34
Have tried setting `rsize=32768,wsize=32768` option for pepper's nfs mount? you might consider optimizing those parameters. — Daniel t., Feb 01 '13 at 16:23
@Danielt. Yes, I've tried those exact settings, as well as dropping it down to 16384, but no change in results. Updated with `sysctl -a` output. Any particular settings I should focus on? — Banjer, Feb 01 '13 at 16:29
I've tried playing with many different sysctl, ethool, and cpu settings, and nothing has even the tiniest change in the issue I'm seeing. I ran tests after each small change. Thats strange to me, that no change is shown with all the various settings I've tried. I get the feeling that its a hardware and/or driver issue. I've got SLES 11 SP2 running on the same model (HP Dl360p G8), and its postgres performance has also suffered. I have not benchmarked it throughly though. I'm going to try totally different hardware with the same OS as pepper (CentOS 6.3) and settings to see what happens. — Banjer, Feb 01 '13 at 20:05
I'm having some breakthrough results using the `noac` mount option, which is helping to totally balance out the read/writes, but still maintain ~ 100 MB/s throughput. I'll provide a full report once I'm done testing. — Banjer, Feb 03 '13 at 13:49

score 6 · Accepted Answer · answered Feb 07 '13 at 20:01

Adding the noac nfs mount option in fstab was the silver bullet. The total throughput has not changed and is still around 100 MB/s, but my read and writes are much more balanced now, which I have to imagine will bode well for Postgres and other applications.

enter image description here

You can see I marked the various "block" sizes I used when testing, i.e. the rsize/wsize buffer size mount options. I found that an 8k size had the best throughput for the dd tests, surprisingly.

These are the nfs mounts options I'm now using, per /proc/mounts:

nfsc:/vol/pg003 /mnt/peppershare nfs rw,sync,noatime,nodiratime,vers=3,rsize=8192,wsize=8192,namlen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,noac,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.x.x.x,mountvers=3,mountport=4046,mountproto=tcp,local_lock=none,addr=172.x.x.x 0 0

FYI, the noac option man entry:

ac / noac

Selects whether the client may cache file attributes. If neither option is specified (or if ac is specified), the client caches file attributes.

To improve performance, NFS clients cache file attributes. Every few seconds, an NFS client checks the server's version of each file's attributes for updates. Changes that occur on the server in those small intervals remain undetected until the client checks the server again. The noac option prevents clients from caching file attributes so that applications can more quickly detect file changes on the server.

In addition to preventing the client from caching file attributes, the noac option forces application writes to become synchronous so that local changes to a file become visible on the server immediately. That way, other clients can quickly detect recent writes when they check the file's attributes.

Using the noac option provides greater cache coherence among NFS clients accessing the same files, but it extracts a significant performance penalty. As such, judicious use of file locking is encouraged instead. The DATA AND METADATA COHERENCE section contains a detailed discussion of these trade-offs.

I read mixed opinions on attribute caching around the web, so my only thought is that its an option that is necessary or plays well with a NetApp NFS server and/or Linux clients with newer kernels (>2.6.5). We didn't see this issue on SLES 9 which has a 2.6.5 kernel.

I also read mixed opinions on rsize/wise, and usually you take the default, which currently for my systems is 65536, but 8192 gave me the best tests results. We'll be doing some benchmarks with postgres too, so we'll see how these various buffer sizes fare.

NFS client has unbalanced read and write speeds

Old host: Carp

New host: Pepper

UPDATE

Update 2

1 Answers1

Linked