Slow copying between NFS/CIFS directories on same server

Question

Please bear with me, I know it's a lot to read. This problem may be applicable to others, so it would be great to have an answer. I had to give away the bounty because it was going to expire.

When I copy to or from my NFS server (Debian) from a client (Ubuntu), it maxes out the gigabit. But, when I copy between two directories on the same server, it's speed bounces around between < 30MB/sec up to over 100MB/sec. Most of the time it's around 50MB/sec.

The same copy performed directly on the NFS server (local disks) I get 100-150 MB/sec, sometimes more. A file copy between this NFS export and a CIFS share exported from the same directory on the same server is just as slow and a copy between two directories over CIFS on the same server is slow. iperf shows bidirectional speed is 941Mb/940Mb between the client and server.

I made sure NFS is using async on the server. I also disabled sync on the ZFS dataset and tried removing the ZFS cache and log devices.

I've tested on a very fast ZFS striped mirror of 4x2TB disks, with an SSD for log and cache devices.

NFS server specs:

Debian 8.2 core 4Ghz AMD-FX
32GB ram
ZFS raid 10, SSD cache/log
17GB ARC
4x2GB WD red drives
Intel 82574L NIC

Test client:

Ubuntu 15.04, Core2Quad 2.4Ghz
8GB ram
SSD
Intel 82574L NIC

This is how things are currently set up. /pool2/Media is the share I've been testing with.

/etc/fstab on client:

UUID=575701cc-53b1-450c-9981-e1adeaa283f0 /               ext4        errors=remount-ro,discard,noatime,user_xattr 0       1
UUID=16e505ad-ab7d-4c92-b414-c6a90078c400 none            swap    sw              0       0 
/dev/fd0        /media/floppy0  auto    rw,user,noauto,exec,utf8 0       0
tmpfs    /tmp    tmpfs   mode=1777       0       0


igor:/pool2/other     /other        nfs         soft,bg,nfsvers=4,intr,rsize=65536,wsize=65536,timeo=50,nolock
igor:/pool2/Media       /Media          nfs     soft,bg,nfsvers=4,intr,rsize=65536,wsize=65536,timeo=50,nolock,noac
igor:/pool2/home        /nfshome        nfs     soft,bg,nfsvers=4,intr,rsize=65536,wsize=65536,timeo=50,nolock

/etc/exports on server (igor):

#LAN
/pool2/home 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)
/pool2/other 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)
/pool2/Media 192.168.1.0/24(rw,async,no_subtree_check,no_root_squash)
/test 192.168.1.0/24(rw,async,no_subtree_check,no_root_squash)

#OpenVPN
/pool2/home 10.0.1.0/24(rw,sync,no_subtree_check,no_root_squash)
/pool2/other 10.0.1.0/24(rw,sync,no_subtree_check,no_root_squash)
/pool2/Media 10.0.1.0/24(rw,sync,no_subtree_check,no_root_squash)

zpool status:

  pool: pool2
 state: ONLINE
  scan: scrub repaired 0 in 6h10m with 0 errors on Sat Oct  3 08:10:26 2015
config:

        NAME                                                 STATE     READ WRITE CKSUM
        pool2                                                ONLINE       0     0     0
          mirror-0                                           ONLINE       0     0     0
            ata-WDC_WD20EFRX-68AX9N0_WD-WMC300004469         ONLINE       0     0     0
            ata-WDC_WD20EFRX-68EUZN0_WD-WCC4MLK57MVX         ONLINE       0     0     0
          mirror-1                                           ONLINE       0     0     0
            ata-WDC_WD20EFRX-68AX9N0_WD-WCC1T0429536         ONLINE       0     0     0
            ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M0VYKFCE         ONLINE       0     0     0
        logs
          ata-KINGSTON_SV300S37A120G_50026B7751153A9F-part1  ONLINE       0     0     0
        cache
          ata-KINGSTON_SV300S37A120G_50026B7751153A9F-part2  ONLINE       0     0     0

errors: No known data errors

  pool: pool3
 state: ONLINE
  scan: scrub repaired 0 in 3h13m with 0 errors on Sat Oct  3 05:13:33 2015
config:

        NAME                                        STATE     READ WRITE CKSUM
        pool3                                       ONLINE       0     0     0
          ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5PSCNYV  ONLINE       0     0     0

errors: No known data errors

/pool2 bonnie++ on server:

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
igor            63G   100  99 187367  44 97357  24   325  99 274882  27 367.1  27

Bonding

I tried bonding and with a direct connection, balance-rr bonding, I get 220MB/sec read and 117MB/sec write, 40-50MB/sec copy.

iperf with bonding

[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  1.10 GBytes   942 Mbits/sec  707             sender
[  4]   0.00-10.00  sec  1.10 GBytes   941 Mbits/sec                  receiver
[  6]   0.00-10.00  sec  1.06 GBytes   909 Mbits/sec  672             sender
[  6]   0.00-10.00  sec  1.06 GBytes   908 Mbits/sec                  receiver
[SUM]   0.00-10.00  sec  2.15 GBytes  1.85 Gbits/sec  1379             sender
[SUM]   0.00-10.00  sec  2.15 GBytes  1.85 Gbits/sec                  receiver

Bonnie++ with bonding over NFS

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
haze            16G  1442  99 192941  16 89157  15  3375  96 179716  13  6082  77

With the ssd cache/log removed, copying over NFS, iostat shows this

sdb               0.80     0.00   67.60  214.00  8561.60 23689.60   229.06     1.36    4.80   14.77    1.64   1.90  53.60
sdd               0.80     0.00   54.60  214.20  7016.00 23689.60   228.46     1.37    5.14   17.41    2.01   2.15  57.76
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sde               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sda               1.60     0.00  133.00  385.20 17011.20 45104.00   239.73     2.24    4.31   12.29    1.56   1.57  81.60
sdf               0.40     0.00  121.40  385.40 15387.20 45104.00   238.72     2.36    4.63   14.29    1.58   1.62  82.16
sdg               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdh               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00

TMPFS

I exported a tmpfs over NFS and did a file copy - The speed was 108MB/sec. Local from the server, it is 410MB/sec.

zvol mounted over NFS

The speed bounces around between < 50MB/sec up to > 180MB/sec, but averages out to about 100MB/sec. This is about what I'm looking for. This zvol is on the same pool (pool2) as I've been testing on. This really makes me think this is more of a ZFS dataset/caching type issue.

Raw disk read test

Using this command

dd if=/dev/disk/by-id/ata-WDC_WD20EFRX-68AX9N0_WD-WMC300004469 of=/dev/null bs=1M count=2000

I get 146-148MB/sec for all 4 disks

Slow, uneven disk usage in pool

Thanks to a very helpful person on the ZFS mailing list, I know what to do to get more even usage of the disks.

The reason for ZFS to prefer mirror-1 is that it seems to be added after mirror-0 had been filled quite a bit, now ZFS is trying to rebalance the fill level.

In case you want to get rid of that and have some time: Iteratively zfs send the datasets of the pool to new datasets on itself, then destroy the source, repeat until pool is rebalanced.

I've fixed this, the data is level across all disks now This has resulted in a 75MB/sec copy speed over NFS. And 118MB/sec local.

The question

My question(s). If you can answer any one of the questions I will accept your answer:

How can my problem be solved? (slow copy over NFS, but not local)
If you can't answer #1, can you try this on your comparable NFS server with ZFS on Linux and tell me the results so I have something to compare it to?
If you can't answer #1 or #2, can you try the same testing on a similar but non-ZFS server over NFS?

I tried adding a second network card to the server, same problem with that one. Then mounted the same NFS share from both interfaces, can copied from one to the other. It didn't make a difference. — Ryan Babchishin, Oct 01 '15 at 05:13
I created an NFS mount for a second, raid-1 array root fs, capable of writing at 100MB/sec copying from the original array on the same server and mounted it from the client. I copied large files from the original share to this new share, and it's still slow. — Ryan Babchishin, Oct 01 '15 at 05:21
I tried from a laptop with 802.11ac at 866Mbps. The file copy was even slower, at 11MB/sec - 45MB/sec. — Ryan Babchishin, Oct 03 '15 at 16:14
For NFS: what are the export options on the server and what are the mount options on the client? — Mark Wagner, Oct 08 '15 at 19:53
I forgot to mention, I also tried bypassing the switch in in case it was the problem, with a direct cable. — Ryan Babchishin, Oct 08 '15 at 20:33
It could be that you are maxing out your nic. Try these if possible: 1) from client write to nfs server and another process to read from another dir from nfs server (simul copy), 2) if you have to nics (or more) on both server and client try bonding. This might also be you solution if nic is bottleneck. — Manwe, Oct 12 '15 at 06:17
I'm glad you asked... that test came out weird. 93.4MB/sec read, 44MB/sec write. Write speed without the concurrent read is 97.5MB/sec. This shouldn't be maxing out my cards, as they should be capable of 1Gb in both directions simultaneously, unless there's something wrong. I can try bonding, I had thought about it. I just thought it was pointless. — Ryan Babchishin, Oct 12 '15 at 07:09
That actually supports my hypotesis a bit. It could be that tcp/ip + nfs overhead when writing is causing the problem (as downstream is congested with read). Try actimeo=300 or something as ridiculous. Also are you writing small files or large one? jumbo frames might help (or slow down). I'd also try changing wsize and rsize (smaller and larger) to see if it impacts your workload — Manwe, Oct 12 '15 at 10:40
I've tried 32768, 65536 and default rsize and wsize and jumbo frames, no help. The files I'm testing with are large 1GB+ which were creating by combining videos together to avoid compression from affecting the speed. Switching from noac to actimeo=300 seems like it bumped up to a more consistent 70MB/sec, not bouncing around as much. atop shows about 650Mb/650Mb for in/out usage on the card during testing. According to atop, the card is still not maxed, though I don't know how accurate that is. — Ryan Babchishin, Oct 12 '15 at 14:04
Maybe the speed is not better, just inconsistent. I decided to move a bunch of large files around (for different reasons) and it is way slow again < 50MB/sec. — Ryan Babchishin, Oct 12 '15 at 15:13
I'm running out of ideas, but one more thing you could try. I assume that you cannot reformat the pool2, so create a large enough image file, format it with lets say ext2, mount and share with nfs. Altough loop image is slower than real mount it should still eliminate zfs index access as only blocks change in image. — Manwe, Oct 13 '15 at 05:46
You have many different questions here - please edit your question and number them clearly and ask the right question for each number. Else the answer will be just "42". — Nils, Oct 14 '15 at 20:15
@Nils I have 2 very clear questions that are numbered at the bottom of the question. — Ryan Babchishin, Oct 14 '15 at 20:24
@Manwe Thank you, that narrowed things down a lot. I've updated the question. — Ryan Babchishin, Oct 14 '15 at 20:25
This is getting to big and complicated. There seems to be a ZFS issue, that may be affecting the NFS performance. I'm thinking of opening a new question that is more specific to the ZFS issue, then returning to this question when I have that sorted out. — Ryan Babchishin, Oct 20 '15 at 22:13

Craig Estey · Answer 1 · 2015-10-15T06:39:37.960

Hmm ... I did notice a few issues and I think I found a smoking gun or two. But, first I'll ask a few questions and make assumptions about your probable answers. I will present some data that will seem irrelevant at first, but I promise, it will be worth the read. So, please, wait for it ... :-)

I'm assuming that by raid10, you have four drives total stripe+redundant.
And, that you're using Linux autoraid (vs. a hardware raid controller).
I'm also assuming that all SATA ports can transfer independently of one another at full transfer speed, bidirectionally, and that all SATA ports are of equally high speed. That is, if you've got a single SATA adapter/controller it is fully capable of running all disks connected to it at rated speed.
I'm also assuming you've got the latest SATA spec drives + controller. That is, 6.0Gb/s. That's 600MB/sec. To be conservative, let's assume we get half that, or 300MB/sec
The client-to-server is NIC limited (at 100MB/s), so it can't stress the drives enough.
In order to go faster than the NIC, when doing NFS-to-NFS, I'm assuming you're using localhost, so you can go beyond NIC limited speeds (which I think you said you did bonding to show that's not an issue)

ISSUE #1. Your reported transfer rates for even the fast local-to-local seem low. With disks that fast, I would expect better than 150MB/s. I have a 3-disk raid0 system that only does 3.0Gb/s [adapter limited] and I can get 450 MB/s striped. Your disks/controller are 2x the speed of mine, so, I would expect [because of the striping] for you to get 300MB/sec not just 150MB/sec for local-to-local. Or, maybe even 600MB/sec [minus FS overhead which might cut it in half for sake of discussion]

From your zpool information, I noticed that your disk configuration is Western Digital and it is:

mirror-0
  ata-WDC_WD20EFRX-68AX9N0
  ata-WDC_WD20EFRX-68EUZN0
mirror-1
  ata-WDC_WD20EFRX-68AX9N0
  ata-WDC_WD20EFRX-68EUZN0

Now let's compare this to your iostat information. It might be nice to have iostat info on all drives for all tests, but I believe I can diagnose the problem with just what you provided
sdb and sdd are maxed out
As you noted, this is strange. I would expect all drives to have balanced usage/stats in a raid10. This is the [my] smoking gun.
Combining the two. The maxed out drives are a slightly different model than the ones that are not. I presume the zpool order is sda/sdb sdc/sdd [but it might be reversed]
sda/sdc are 68AX9N0
sdb/sdd are 68EUZN0

ISSUE #2. From a google search on WD20EFRX + 68AX9N0 + 68EUZN0, I found this page: http://forums.whirlpool.net.au/archive/2197640

It seems that the 68EUZN0 drives can park their heads after about 8 seconds whereas the other is smarter about this [or vice versa].

So, given NFS caching + FS caching + SSD caching, the underlying drives may be going idle and parking their heads. My guess is that the extra layer of caching of NFS is what tips it over the edge.

You can test this by varying the FS sync options, maybe sync is better than async. Also, if you can, I'd rerun the tests with SSD caching off. The idea is to ensure that parking does not occur and see the results.

As mentioned on the web page, there are some utilities that can adjust the park delay interval. If that's the option, be sure to research it thoroughly.

UPDATE:

Your problem can be viewed as a throughput problem through a store-and-forward [with guaranteed delivery] network. Note, I'm not talking about the NIC or equiv.

Consider that an I/O operation is like a packet containing a request (e.g. read/write, buf_addr, buf_len) that gets stored in a struct. This request packet/struct gets passed between the various cache layers: NFS, ZFS, device driver, SATA controller, hard disk. At each point, you have an arrival time at the layer, and a departure time when the request is forwarded to the next layer.

In this context, the actual disk transfer speed, when the transfer actually happens is analogous to the link speed. When most people consider disks, they only consider transfer speed and not when the transfer was actually initiated.

In a network router, packets arrive, but they aren't always forwarded immediately, even if the outbound link is clear. Depending on router policy, the router may delay the packet for a bit, hoping that some more packets will arrive from other sources [or from the same source if UDP], so the router can aggregate the smaller packets into a large one that can be transmitted on the outbound link more efficiently.

For disks, this "delay" could be characterized by a given FS layer's cache policy. In other words, if a request arrives at a layer at time T, instead of it departing the layer at T+1 and arriving at the next layer at T+1, it could depart/arrive at T+n. An FS cache layer might do this, so that it can do seek order optimization/sorting.

The behavior you're seeing is very similar to a TCP socket that reduced its window because of congestion.

I think it's important to split up the testing. Right now, you're doing read and write. And, you don't know which is the limiting factor/bottleneck. I think it would be helpful to split up the tests into read or write. A decent benchmark program will probably do this. What I'm advocating is a more sophisticated version of [these are just rough examples, not the exact arguments to use]:

For write, time dd if=/dev/zero of=/whatever_file count=64g
For read, time dd if=/whatever of=/dev/null count=64g

The reason for 64GB is that's 2x your physical ram and eliminates block cache effects. Do the sync command between tests.

Apply this on local FS and repeat on NFS.

Also, do the read test on each of /dev/{sda,sdb,sdc,sdd}

Do iostat during these tests.

Note that doing the read test on the physical raw disk gives you a baseline/maximum for how fast the hardware can actually go. The raw device reads should approximate the maximum capabilities of the transfer specs of your drives. Expected write speed should be similar for a hard disk. If not, why not? All disks should test at about the same speed. What I'm going for here is the reason for why only two disks are maxed out in your previous tests.

Doing the math, with 32GB and assuming a maximum transfer speed of 600MB/sec, it would take a minimum of 50 seconds to fill/flush that. So, what is the park timeout set to?

Also, you can vary things a bit by reducing the amount of physical ram the kernel will allow via the mem= boot parameter. Try something like mem=8g to see what effect it has. There are also some /proc entries that can adjust the block layer cache flush policy.

Also, my FSes are ext4 and mounted with noatime. You may want to consider zfs set atime=off ...

Also, watch the system log. Sometimes, a drive reports a sense error and the system reconfigures it to use a lower transfer speed.

Also, take a look at the SMART data for the drives. Do you see anything unusual? Excessive soft retries on a given drive (e.g.).

Like I've said, the local disk performance is much less than I expect. I think that problem needs to solved first before tackling the entire system with NFS. If the raid disks all had balanced utilization and was in the ballpark, I'd be less concerned about it.

My system [which also has WDC disks] is not configured for NFS (I use rsync a lot). I've got some pressing stuff I have to do for next 1-2 days. After that, I'll have the time to try it [I'd be curious myself].

UPDATE #2:

Good catch on the ZFS unbalance issue. This helps explain my "issue #1". It might explain NFS's flakiness as well if the rebalance operations somehow confused NFS with regard to latency/timing, causing the "TCP window/backoff" behavior--not super high probability but a possibility none the less.

With rsync testing no need/desire to use NFS. If you can ssh into the server, rsync and NFS are redundant. With NFS, just use cp, etc. To do rsync, go directly to the underlying ZFS via ssh. This will work even without an NFS mount [here's the rsync config I use]:

export RSYNC_SSH="/usr/bin/ssh"
export SSH_NOCOMPRESS=1
rsync /wherever1 server:/zfsmount/whatever

Doing this localhost or bonded may get the performance into what you expect (sans the ZFS unbalance issue). If so, it clearly narrows the problem to NFS itself.

I've perused some of the kernel source for NFS. From what little I looked at, I didn't like what I saw regarding timeliness. NFS started back in 80's when links were slow, so it [still] has a lot of code to try to conserve NIC bandwidth. That is, only "commit" [to] an action when absolutely necessary. Not necessarily what we want. In my fanciful network router policy analogy, NFS's cache would seem to be the one with the "T+n" delay.

I'd recommend doing whatever you can to disable NFS's cache and have it pass its request to ZFS ASAP. Let ZFS be the smart one and NFS be the "dumb pipe". NFS caching can only be generic in nature (e.g. it won't even know that the backing store is a RAID or too much about the special characteristics of the base FS it's mounted on). ZFS has intimate knowledge of the RAID and the disks that compose it. Thus, ZFS's cache can be much more intelligent about the choices.

I'd say try to get NFS to do a sync mount--that might do the trick. Also, I saw something about noatime, so turn on that option as well. There may be other NFS tuning/mount options. Hopefully, if NFS is the usual suspect, it can be reconfigured to work well enough.

If, on the other hand, no option brings NFS to heel, would rsync over ssh be a viable alternative? What is the actual use case? It seems that you're using NFS as a conduit for large bulk transfers that need high performance (vs. [say] just as an automount point for user home directories). Is this for things like client backup to server, etc?

Ok, I'm going through all of that. But to start, I think I adjusted the idle3 thing a long time ago. Regardless, the drives Load_Cycle_Count are as follows: 24, 59, 35, 191. I have no idea why only 2 disks seem to be under load, so thanks for your comparison. That was a file copy on the same array correct? I really should have checked that out a long time ago with the zfs mailing list, however I have the same performance issue with a Linux MD raid array as well. I'm going to try and rule that issue out now. — Ryan Babchishin, Oct 13 '15 at 14:05
While trying to rule out the array as a performance problem, I formatted an SSD with ext4 and mounted over NFS. When testing it locally on the server, I was surprised to get < 1MB/sec and up to 150MB/sec doing a file copy. Total throughput worked out to be 62.88MB/sec. Over NFS, it was slightly less. I checked the client computer, which also has an SSD and the results are the same, 54MB/sec. Not really related, but do SSDs normally perform this way? I removed log/cache from pool2 and got 58MB/sec copy over NFS anyways. Local, I get 242MB/sec, which might actually be better than before. — Ryan Babchishin, Oct 13 '15 at 14:35
I never thought about connecting to NFS over loopback, so I did. Not much better. The speed bounced between 17-130MB/sec, averaging to 73MB/sec. Are you able to test yours over NFS to compare? — Ryan Babchishin, Oct 13 '15 at 15:03
Idle3 timer set to 138 (0x8a) is set on sdb and sdf, sda and sdd are disabled. So I disabled it on all drives. Retested with rsync over NFS and the performance is the same. — Ryan Babchishin, Oct 15 '15 at 01:42
212MB/sec write 64GB with dd, 297MB/sec read local on the server — Ryan Babchishin, Oct 15 '15 at 01:43
I've added the individual disk speeds to the original question — Ryan Babchishin, Oct 15 '15 at 04:25
@RyanBabchishin Thanks for the +50--I hope I've _truly_ earned it. I've just updated my answer, based on your latest data. suggestion on rsync testing and NFS config as well as a new question. I'd continue to keep SSD off [keeps things simpler during test]. How long to finish the rebalance? I'll continue to be around until your problem is solved. — Craig Estey, Oct 15 '15 at 06:47
Yes thanks. You put a ton of effort into helping me. I've rebalanced and verified the load is balanced now. I had to copy everything off and recreate the pool. It took a long time. New speeds: 76.25MB/s copy over NFS, 105MB/sec local. I still need to do lots more testing to make sure those numbers are accurate. — Ryan Babchishin, Oct 18 '15 at 01:22
That is confirmed, 75.19MB/sec rewrite/copy according to bonnie++, pv and rsync. I'm going to look some more into tweaking the zpool performance. I'm wondering how to tell if my SATA controller is fast enough for all these disks. You're right, they aren't as fast as I would expect. — Ryan Babchishin, Oct 20 '15 at 05:36
Use case: The server provides mostly large video files to clients over NFS and Samba, on a gigabit network (and now bonded for one system). Sometimes these files need to be copied by users from one directory to another. That is where it gets painful. The idea is this can be doing via the GUI of the client OS (Windows, Linux). — Ryan Babchishin, Oct 20 '15 at 05:42

Slow copying between NFS/CIFS directories on same server

1 Answers1

Linked