ZFS pool slow sequential read

Question

I have a related question about this problem, but it got too complicated and too big, so I decided I should split up the issue into NFS and local issues. I have also tried asking about this on the zfs-discuss mailing list without much success.

Slow copying between NFS/CIFS directories on same server

Outline: How I'm setup and what I'm expecting

I have a ZFS pool with 4 disks. 2TB RED configured as 2 mirrors that are striped (RAID 10). On Linux, zfsonlinux. There are no cache or log devices.
Data is balanced across mirrors (important for ZFS)
Each disk can read (raw w/dd) at 147MB/sec in parallel, giving a combined throughput of 588MB/sec.
I expect about 115MB/sec write, 138MB/sec read and 50MB/sec rewrite of sequential data from each disk, based on benchmarks of a similar 4TB RED disk. I expect no less than 100MB/sec read or write, since any disk can do that these days.
I thought I'd see 100% IO utilization on all 4 disks when under load reading or writing sequential data. And that the disks would be putting out over 100MB/sec while at 100% utilization.
I thought the pool would give me around 2x write, 2x rewrite, and 4x read performance over a single disk - am I wrong?
NEW I thought a ext4 zvol on the same pool would be about the same speed as ZFS

What I actually get

I find the read performance of the pool is not nearly as high as I expected

bonnie++ benchmark on pool from a few days ago

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
igor            63G    99  99 232132  47 118787  27   336  97 257072  22  92.7   6

bonnie++ on a single 4TB RED drive on it's own in a zpool

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
igor            63G   101  99 115288  30 49781  14   326  97 138250  13 111.6   8

According to this the read and rewrite speeds are appropriate based on the results from a single 4TB RED drive (they are double). However, the read speed I was expecting would have been about 550MB/sec (4x the speed of the 4TB drive) and I would at least hope for around 400MB/sec. Instead I am seeing around 260MB/sec

bonnie++ on the pool from just now, while gathering the below information. Not quite the same as before, and nothing has changed.

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
igor            63G   103  99 207518  43 108810  24   342  98 302350  26 256.4  18

zpool iostat during write. Seems OK to me.

                                                 capacity     operations    bandwidth
pool                                          alloc   free   read  write   read  write
--------------------------------------------  -----  -----  -----  -----  -----  -----
pool2                                         1.23T  2.39T      0  1.89K  1.60K   238M
  mirror                                       631G  1.20T      0    979  1.60K   120M
    ata-WDC_WD20EFRX-68AX9N0_WD-WMC300004469      -      -      0   1007  1.60K   124M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4MLK57MVX      -      -      0    975      0   120M
  mirror                                       631G  1.20T      0    953      0   117M
    ata-WDC_WD20EFRX-68AX9N0_WD-WCC1T0429536      -      -      0  1.01K      0   128M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M0VYKFCE      -      -      0    953      0   117M

zpool iostat during rewrite. Seems ok to me, I think.

                                                 capacity     operations    bandwidth
pool                                          alloc   free   read  write   read  write
--------------------------------------------  -----  -----  -----  -----  -----  -----
pool2                                         1.27T  2.35T   1015    923   125M   101M
  mirror                                       651G  1.18T    505    465  62.2M  51.8M
    ata-WDC_WD20EFRX-68AX9N0_WD-WMC300004469      -      -    198    438  24.4M  51.7M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4MLK57MVX      -      -    306    384  37.8M  45.1M
  mirror                                       651G  1.18T    510    457  63.2M  49.6M
    ata-WDC_WD20EFRX-68AX9N0_WD-WCC1T0429536      -      -    304    371  37.8M  43.3M
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M0VYKFCE      -      -    206    423  25.5M  49.6M

This is where I wonder what's going on

zpool iostat during read

                                                 capacity     operations    bandwidth
pool                                          alloc   free   read  write   read  write
--------------------------------------------  -----  -----  -----  -----  -----  -----
pool2                                         1.27T  2.35T  2.68K     32   339M   141K
  mirror                                       651G  1.18T  1.34K     20   169M  90.0K
    ata-WDC_WD20EFRX-68AX9N0_WD-WMC300004469      -      -    748      9  92.5M  96.8K
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4MLK57MVX      -      -    623     10  76.8M  96.8K
  mirror                                       651G  1.18T  1.34K     11   170M  50.8K
    ata-WDC_WD20EFRX-68AX9N0_WD-WCC1T0429536      -      -    774      5  95.7M  56.0K
    ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M0VYKFCE      -      -    599      6  74.0M  56.0K

iostat -x during the same read operation. Note how IO % is not at 100%.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.60     0.00  661.30    6.00 83652.80    49.20   250.87     2.32    3.47    3.46    4.87   1.20  79.76
sdd               0.80     0.00  735.40    5.30 93273.20    49.20   251.98     2.60    3.51    3.51    4.15   1.20  89.04
sdf               0.50     0.00  656.70    3.80 83196.80    31.20   252.02     2.23    3.38    3.36    6.63   1.17  77.12
sda               0.70     0.00  738.30    3.30 93572.00    31.20   252.44     2.45    3.33    3.31    7.03   1.14  84.24

zpool and test dataset settings:

atime is off

compression is off

ashift is 0 (autodetect - my understanding was that this was ok)

zdb says disks are all ashift=12

module - options zfs zvol_threads=32 zfs_arc_max=17179869184

sync = standard

Edit - Oct, 30, 2015

I did some more testing

dataset bonnie++ w/recordsize=1M = 226MB write, 392MB read much better

dataset dd w/record size=1M = 260MB write, 392MB read much better

zvol w/ext4 dd bs=1M = 128MB write, 107MB read why so slow?

dataset 2 processess in parallel = 227MB write, 396MB read

dd direct io makes no different on dataset and on zvol

I am much happier with the performance with the increased record size. Almost every file on the pool is way over 1MB. So I'll leave it like that. The disks are still not getting 100% utilization, which makes me wonder if it could still be much faster. And now I'm wondering why the zvol performance is so lousy, as that is something I (lightly) use.

I am happy to provide any information requested in the comments/answers. There is also tons of information posted in my other question: Slow copying between NFS/CIFS directories on same server

I am fully aware that I may just not understand something and that this may not be a problem at all. Thanks in advance.

To make it clear, the question is: Why isn't the ZFS pool as fast as I expect? And perhaps is there anything else wrong?

No tuning, I'd suspect... Did you adjust ashift for your disks? Any zfs.conf settings? Is atime on/off? Any weird sync settings? — ewwhite, Oct 24 '15 at 03:41
See this: http://www.tomshardware.com/reviews/red-wd20efrx-wd30efrx-nas,3248-5.html WD Red drives have abysmal seek times. They stream fine, but under real-world usage they're going to have to seek, and your IO stats show enough IO operations/sec that the seek time is almost certainly impacting your performance. Create a zvol and use `dd` to see what kind of performance you get. You might also want to try direct IO as you are getting up into streaming speeds where the double buffering from caching can impact performance. FWIW, 3/4 of theoretical total raw 4-disk read performance is good. — Andrew Henle, Oct 28 '15 at 02:23
(ran out of space) You also have enough disks that single-threaded IO operations may not be enough to keep your disks fully busy. That may explain your `%util` numbers. — Andrew Henle, Oct 28 '15 at 02:26
@AndrewHenle Thank you. That all sounds very reasonable. I'll look into that now. — Ryan Babchishin, Oct 28 '15 at 02:28
@AndrewHentle I've added some more info to the question. I still wonder if it can go faster. After looking at those benchmarks you found I am pretty disappointed, but not totally convinced that is what is going on. More like, concerned... — Ryan Babchishin, Oct 30 '15 at 07:33

Ryan Babchishin · Accepted Answer · 2016-09-05T03:32:18.967

I managed to get speeds very close to the numbers I was expecting.

I was looking for 400MB/sec and managed 392MB/sec. So I say that is problem solved. With the later addition of a cache device, I managed 458MB/sec read (cached I believe).

1. This at first was achieved simply by increasing the ZFS dataset recordsize value to 1M

zfs set recordsize=1M pool2/test

I believe this change just results in less disk activity, thus more efficient large synchronous reads and writes. Exactly what I was asking for.

Results after the change

bonnie++ = 226MB write, 392MB read
dd = 260MB write, 392MB read
2 processes in parallel = 227MB write, 396MB read

2. I managed even better when I added a cache device (120GB SSD). The write is a tad slower, I'm not sure why.

Version  1.97       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
igor            63G           208325  48 129343  28           458513  35 326.8  16

The trick with the cache device was to set l2arc_noprefetch=0 in /etc/modprobe.d/zfs.conf. It allows ZFS to cache streaming/sequential data. Only do this if your cache device is faster than your array, like mine.

After benefiting from the recordsize change on my dataset, I thought it might be a similar way to deal with poor zvol performance.

I came across severel people mentioning that they obtained good performance using a volblocksize=64k, so I tried it. No luck.

zfs create -b 64k -V 120G pool/volume

But then I read that ext4 (the filesystem I was testing with) supports options for RAID like stride and stripe-width, which I've never used before. So I used this site to calculate the settings needed: https://busybox.net/~aldot/mkfs_stride.html and formatted the zvol again.

mkfs.ext3 -b 4096 -E stride=16,stripe-width=32 /dev/zvol/pool/volume

I ran bonnie++ to do a simple benchmark and the results were excellent. I don't have the results with me unfortunately, but they were atleast 5-6x faster for writes as I recall. I'll update this answer again if I benchmark again.

If I could give you an extra +1 for coming back almost a year later and writing such a detailed answer, I would. Thanks! — Jed Daniels, Sep 05 '16 at 02:41

zfstester · Answer 2 · 2021-11-23T18:07:50.840

zfs writes aren't really fast but not bad. zfs reads are extremely slow take a look by your own: 1) #Preparation: cd /mytestpool/mytestzfs;for f in urf{0..9};do dd if=/dev/urandom of=$f bs=1M count=102400;done; #Get a directory path with lots of subdirs and files (of ~50GB) and check size with eg: du -sh /mytestpool/mytestzfs/appsdir 2) reboot 3) time cat /mytestpool/mytestzfs/urf0 >/dev/null; date;for f in /mytestpool/mytestzfs/urf{1..9};do cat $f >/dev/null & wait;done;date ; time tar cf - /mytestpool/mytestzfs/appsdir|cat - >/dev/null 4) #Look at iostat, iotop or zpool iostat: you see to much there ! 5) After reads are done take a calculator and divide singlefilesize/sec, divide 9x singlefilessize/sec and divide directorysize/sec. That's what you get out of your zfs when disks will get more and more full of data and more as you have memory.

Welcome to Server Fault! It looks like you may have a workable solution to the question, but as written it is difficult to discover. Please [edit](https://serverfault.com/posts/1084361/edit) your question and reformat it using complete sentences, separate paragraphs, inline code, and code blocks, as appropriate. — Paul, Nov 24 '21 at 18:46

score 0 · Answer 3 · answered Oct 30 '15 at 09:02

Your results are perfectly reasonable, while your expectation are not: you overstate the read performance improvement given by RAID1 (and, by extension, by RAID10). The point is that a 2-way mirroring give at most 2x the read speed/IOPs of the single disk, but real world performance can be anywhere between 1x-2x.

Let's clarify with an example. Imagine to have a system with 2-way mirror, with each disk capable of 100 MB/s (sequential) and 200 IOPS. With a queue depth of 1 (max one single, outstanding request) this array will have no advantage over a single disk: RAID1 splits IO requests on the two disk's queue, but it does not split a single request over two disks (at least, any implementation I saw behave in this manner). On the other side, if your IO queue is bigger (eg: you have 4/8 outstanding requests), total disk throughput will be significantly higher than single disk.

A similar point can be done for RAID0, but in this case what determines the average improvements is a function of not only queue size, but IO request size also: if your average IO size is lower than chunk size, then it will not be striped on two (or more) disks, but it will be served by a single one. Your results with the increased Bonnie++ recordsize show this exact behavior: striping greatly benefits from bigger IO size.

Now should be clear that combining the two RAID level in a RAID10 array will not lead to linear performance scaling, but it sets an upper limit for it. I'm pretty sure that if you run multiple dd/bonnie++ instances (or use fio to directly manipulate IO queue) you will have results more in-line with your original expectation, simply because you will tax your IO array in a more complete manner (multiple oustanding sequential/random IO requests), rather than loading it of single, sequential IO requests alone.

My expectations were almost identical to what I got - 400MB/sec. I get 392MB/sec. Seems reasonable. very reasonable. I also ran ***multiple dd and bonnie++ processes*** in parallel and saw no performance improvement at all. You have not explained why the zvol performance is so poor either. — Ryan Babchishin, Oct 30 '15 at 09:07
You get 392 MB/s only using Bonnie++ with a large recordsize (>= 1MB/s), and I explained you why. EXT4 over ZVOL is a configuration that I never tested, so I left it for other people to comment. — shodanshok, Oct 30 '15 at 09:26

ZFS pool slow sequential read

3 Answers3