11

I am wondering why NFS v4 would be so much faster than NFS v3 and if there are any parameters on v3 that could be tweaked.

I mount a file system

sudo mount  -o  'rw,bg,hard,nointr,rsize=1048576,wsize=1048576,vers=4'  toto:/test /test

and then run

 dd if=/test/file  of=/dev/null bs=1024k

I can read 200-400MB/s but when I change version to vers=3, remount and rerun the dd I only get 90MB/s. The file I'm reading from is an in memory file on the NFS server. Both sides of the connection are Solaris and have 10GbE NIC. I avoid any client side caching by remounting between all tests. I used dtrace to see on the server to measure how fast data is being served via NFS. For both v3 and v4 I changed:

 nfs4_bsize
 nfs3_bsize

from default 32K to 1M (on v4 I maxed at 150MB/s with 32K) I've tried tweaking

  • nfs3_max_threads
  • clnt_max_conns
  • nfs3_async_clusters

to improve the v3 performance, but no go.

On v3 if I run four parallel dd's the throughput goes down from 90MB/s to 70-80MBs which leads me to believe the problem is some shared resource and if so, then I'm wondering what it is and if I can increase that resource.

dtrace code to get window sizes:

#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option defaultargs

inline string ADDR=$$1;

dtrace:::BEGIN
{
       TITLE = 10;
       title = 0;
       printf("starting up ...\n");
       self->start = 0;
}

tcp:::send, tcp:::receive
/   self->start == 0  /
{
     walltime[args[1]->cs_cid]= timestamp;
     self->start = 1;
}

tcp:::send, tcp:::receive
/   title == 0  &&
     ( ADDR == NULL || args[3]->tcps_raddr == ADDR  ) /
{
      printf("%4s %15s %6s %6s %6s %8s %8s %8s %8s %8s  %8s %8s %8s  %8s %8s\n",
        "cid",
        "ip",
        "usend"    ,
        "urecd" ,
        "delta"  ,
        "send"  ,
        "recd"  ,
        "ssz"  ,
        "sscal"  ,
        "rsz",
        "rscal",
        "congw",
        "conthr",
        "flags",
        "retran"
      );
      title = TITLE ;
}

tcp:::send
/     ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
{
    nfs[args[1]->cs_cid]=1; /* this is an NFS thread */
    this->delta= timestamp-walltime[args[1]->cs_cid];
    walltime[args[1]->cs_cid]=timestamp;
    this->flags="";
    this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags);
    printf("%5d %14s %6d %6d %6d %8d \ %-8s %8d %6d %8d  %8d %8d %12d %s %d  \n",
        args[1]->cs_cid%1000,
        args[3]->tcps_raddr  ,
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        this->delta/1000,
        args[2]->ip_plength - args[4]->tcp_offset,
        "",
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        this->flags,
        args[3]->tcps_retransmit
      );
    this->flags=0;
    title--;
    this->delta=0;
}

tcp:::receive
/ nfs[args[1]->cs_cid] &&  ( ADDR == NULL || args[3]->tcps_raddr == ADDR ) /
{
    this->delta= timestamp-walltime[args[1]->cs_cid];
    walltime[args[1]->cs_cid]=timestamp;
    this->flags="";
    this->flags= strjoin((( args[4]->tcp_flags & TH_FIN ) ? "FIN|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_SYN ) ? "SYN|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_RST ) ? "RST|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_PUSH ) ? "PUSH|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_ACK ) ? "ACK|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_URG ) ? "URG|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_ECE ) ? "ECE|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags & TH_CWR ) ? "CWR|" : ""),this->flags);
    this->flags= strjoin((( args[4]->tcp_flags == 0 ) ? "null " : ""),this->flags);
    printf("%5d %14s %6d %6d %6d %8s / %-8d %8d %6d %8d  %8d %8d %12d %s %d  \n",
        args[1]->cs_cid%1000,
        args[3]->tcps_raddr  ,
        args[3]->tcps_snxt - args[3]->tcps_suna ,
        args[3]->tcps_rnxt - args[3]->tcps_rack,
        this->delta/1000,
        "",
        args[2]->ip_plength - args[4]->tcp_offset,
        args[3]->tcps_swnd,
        args[3]->tcps_snd_ws,
        args[3]->tcps_rwnd,
        args[3]->tcps_rcv_ws,
        args[3]->tcps_cwnd,
        args[3]->tcps_cwnd_ssthresh,
        this->flags,
        args[3]->tcps_retransmit
      );
    this->flags=0;
    title--;
    this->delta=0;
}

Output looks like ( not from this particular situation):

cid              ip  usend  urecd  delta     send     recd      ssz    sscal      rsz     rscal    congw   conthr     flags   retran
  320 192.168.100.186    240      0    272      240 \             49232      0  1049800         5  1049800         2896 ACK|PUSH| 0
  320 192.168.100.186    240      0    196          / 68          49232      0  1049800         5  1049800         2896 ACK|PUSH| 0
  320 192.168.100.186      0      0  27445        0 \             49232      0  1049800         5  1049800         2896 ACK| 0
   24 192.168.100.177      0      0 255562          / 52          64060      0    64240         0    91980         2920 ACK|PUSH| 0
   24 192.168.100.177     52      0    301       52 \             64060      0    64240         0    91980         2920 ACK|PUSH| 0

some headers

usend - unacknowledged send bytes
urecd - unacknowledged received bytes
ssz - send window
rsz - receive window
congw - congestion window

planning on taking snoop's of the dd's over v3 and v4 and comparing. Have already done it but there was too much traffic and I used a disk file instead of a cached file which made comparing timings meaningless. Will run other snoop's with cached data and no other traffic between boxes. TBD

Additionally the network guys say there is no traffic shaping or bandwidth limiters on the connections.

Kyle Hailey
  • 275
  • 3
  • 10
  • 2
    Well for one thing nfsv4 runs on tcp by default instead of udp. – Phil Hollenback Aug 29 '11 at 23:58
  • 3
    AFAIK, solaris, unlike linux, mounts tcp by default even on v3. For v3 tests I also explicitly "proto=tcp" in some of tests but had the same performance on v3 with or without including "proto=tcp" – Kyle Hailey Aug 30 '11 at 00:10
  • Have you already enabled jumbo frames on the switching infrastructure and server NICs? – polynomial Aug 30 '11 at 01:16
  • yes, jumbo frames are set up, and verified. With dtrace I can see the packet sizes. – Kyle Hailey Aug 30 '11 at 04:33
  • You might want to review documentation about the protocol differences between the two and see if anything jumps out at you. – mtinberg Aug 31 '11 at 17:43
  • 1
    Actually, Linux also defaults to mounting with tcp – janneb Aug 31 '11 at 18:57
  • You need to provide the results what you've measured with Dtrace for anyone to make more sense out of your problem. NFSv4 should not provide more throughput or anything performance related - if then it should be marginally slower. In fact Sun used to recommend in 2010AD that one should use NFSv3 if performance is the main goal. As a side note: what is the file system that is exported? – pfo Aug 31 '11 at 23:02
  • I've used both zfs and ufs for the tests - results were the same in both cases. – Kyle Hailey Sep 05 '11 at 23:15
  • dtrace code posted with question now – Kyle Hailey Sep 07 '11 at 00:44

1 Answers1

4

NFS 4.1 (minor 1) is designed to be a faster and more efficient protocol and is recommended over previous versions, especially 4.0.

This includes client-side caching, and although not relevant in this scenario, parallel-NFS (pNFS). The major change is that is that the protocol is now stateful.

http://www.netapp.com/us/communities/tech-ontap/nfsv4-0408.html

I think it is the recommended protocol when using NetApps, judging by their performance documentation. The technology is similar to Windows Vista+ opportunistic locking.

NFSv4 differs from previous versions of NFS by allowing a server to delegate specific actions on a file to a client to enable more aggressive client caching of data and to allow caching of the locking state. A server cedes control of file updates and the locking state to a client via a delegation. This reduces latency by allowing the client to perform various operations and cache data locally. Two types of delegations currently exist: read and write. The server has the ability to call back a delegation from a client should there be contention for a file. Once a client holds a delegation, it can perform operations on files whose data has been cached locally to avoid network latency and optimize I/O. The more aggressive caching that results from delegations can be a big help in environments with the following characteristics:

  • Frequent opens and closes
  • Frequent GETATTRs
  • File locking
  • Read-only sharing
  • High latency
  • Fast clients
  • Heavily loaded server with many clients
Steve-o
  • 829
  • 6
  • 12
  • Thanks for the pointers on NFS 4.1 though I AFAIK they we are on 4.0 – Kyle Hailey Sep 07 '11 at 00:43
  • 1
    Actually, the client side caching changes came in with 4.0, and may be the biggest performance difference, for writing, as you can see from the v4 extract - "NFSv4 ... delegate ... to a client". Just noticed the question was about reading. I'm not sure how relevant most of this is for that case. – Peter Jan 22 '15 at 21:50