ZFS: good read but poor write speeds

Question

I'm in charge of downloading and processing large amounts of financial data. Each trading day, we have to add around 100GB.

To handle this amount of data, we rent a virtual server (3 cores, 12 GB ram) and a 30 TB block device from our university's data center.

On the virtual machine I installed Ubuntu 16.04 and ZFS on Linux. Then, I created a ZFS pool on the 30TB block device. The main reason for using ZFS is the compression feature as the data is nicely compressible (~10%). Please don't be too hard on me for not following the golden rule that ZFS wants to see bare metal, I am forced to use the infrastructure as it is.

The reason for posting is that I face a problem of poor write speeds. The server is able to read data with about 50 MB/s from the block device but writing data is painfully slow with about 2-4 MB/s.

Here is some information on the pool and the dataset:

zdb

tank:
version: 5000
name: 'tank'
state: 0
txg: 872307
pool_guid: 8319810251081423408
errata: 0
hostname: 'TAQ-Server'
vdev_children: 1
vdev_tree:
    type: 'root'
    id: 0
    guid: 8319810251081423408
    children[0]:
        type: 'disk'
        id: 0
        guid: 13934768780705769781
        path: '/dev/disk/by-id/scsi-3600140519581e55ec004cbb80c32784d-part1'
        phys_path: '/iscsi/disk@0000iqn.2015-02.de.uni-konstanz.bigdisk%3Asn.606f4c46fd740001,0:a'
        whole_disk: 1
        metaslab_array: 30
        metaslab_shift: 38
        ashift: 9
        asize: 34909494181888
        is_log: 0
        DTL: 126
        create_txg: 4
features_for_read:
    com.delphix:hole_birth
    com.delphix:embedded_data

zpool get all

NAME  PROPERTY                    VALUE                       SOURCE
tank  size                        31,8T                       -
tank  capacity                    33%                         -
tank  altroot                     -                           default
tank  health                      ONLINE                      -
tank  guid                        8319810251081423408         default
tank  version                     -                           default
tank  bootfs                      -                           default
tank  delegation                  on                          default
tank  autoreplace                 off                         default
tank  cachefile                   -                           default
tank  failmode                    wait                        default
tank  listsnapshots               off                         default
tank  autoexpand                  off                         default
tank  dedupditto                  0                           default
tank  dedupratio                  1.00x                       -
tank  free                        21,1T                       -
tank  allocated                   10,6T                       -
tank  readonly                    off                         -
tank  ashift                      0                           default
tank  comment                     -                           default
tank  expandsize                  255G                        -
tank  freeing                     0                           default
tank  fragmentation               12%                         -
tank  leaked                      0                           default
tank  feature@async_destroy       enabled                     local
tank  feature@empty_bpobj         active                      local
tank  feature@lz4_compress        active                      local
tank  feature@spacemap_histogram  active                      local
tank  feature@enabled_txg         active                      local
tank  feature@hole_birth          active                      local
tank  feature@extensible_dataset  enabled                     local
tank  feature@embedded_data       active                      local
tank  feature@bookmarks           enabled                     local
tank  feature@filesystem_limits   enabled                     local
tank  feature@large_blocks        enabled                     local

zfs get all tank/test

NAME       PROPERTY               VALUE                  SOURCE
tank/test  type                   filesystem             -
tank/test  creation               Do Jul 21 10:04 2016   -
tank/test  used                   19K                    -
tank/test  available              17,0T                  -
tank/test  referenced             19K                    -
tank/test  compressratio          1.00x                  -
tank/test  mounted                yes                    -
tank/test  quota                  none                   default
tank/test  reservation            none                   default
tank/test  recordsize             128K                   default
tank/test  mountpoint             /tank/test             inherited from tank
tank/test  sharenfs               off                    default
tank/test  checksum               on                     default
tank/test  compression            off                    default
tank/test  atime                  off                    local
tank/test  devices                on                     default
tank/test  exec                   on                     default
tank/test  setuid                 on                     default
tank/test  readonly               off                    default
tank/test  zoned                  off                    default
tank/test  snapdir                hidden                 default
tank/test  aclinherit             restricted             default
tank/test  canmount               on                     default
tank/test  xattr                  on                     default
tank/test  copies                 1                      default
tank/test  version                5                      -
tank/test  utf8only               off                    -
tank/test  normalization          none                   -
tank/test  casesensitivity        mixed                  -
tank/test  vscan                  off                    default
tank/test  nbmand                 off                    default
tank/test  sharesmb               off                    default
tank/test  refquota               none                   default
tank/test  refreservation         none                   default
tank/test  primarycache           all                    default
tank/test  secondarycache         all                    default
tank/test  usedbysnapshots        0                      -
tank/test  usedbydataset          19K                    -
tank/test  usedbychildren         0                      -
tank/test  usedbyrefreservation   0                      -
tank/test  logbias                latency                default
tank/test  dedup                  off                    default
tank/test  mlslabel               none                   default
tank/test  sync                   disabled               local
tank/test  refcompressratio       1.00x                  -
tank/test  written                19K                    -
tank/test  logicalused            9,50K                  -
tank/test  logicalreferenced      9,50K                  -
tank/test  filesystem_limit       none                   default
tank/test  snapshot_limit         none                   default
tank/test  filesystem_count       none                   default
tank/test  snapshot_count         none                   default
tank/test  snapdev                hidden                 default
tank/test  acltype                off                    default
tank/test  context                none                   default
tank/test  fscontext              none                   default
tank/test  defcontext             none                   default
tank/test  rootcontext            none                   default
tank/test  relatime               off                    default
tank/test  redundant_metadata     all                    default
tank/test  overlay                off                    default
tank/test  com.sun:auto-snapshot  true                   inherited from tank

Can you give me a hint what I could do to improve the write speeds?

Update 1

After your comments about the storage system I went to the IT department. The guy told me that the logical block which the vdev exports is actually 512 B.

This is the output of dmesg:

[    8.948835] sd 3:0:0:0: [sdb] 68717412272 512-byte logical blocks: (35.2 TB/32.0 TiB)
[    8.948839] sd 3:0:0:0: [sdb] 4096-byte physical blocks
[    8.950145] sd 3:0:0:0: [sdb] Write Protect is off
[    8.950149] sd 3:0:0:0: [sdb] Mode Sense: 43 00 10 08
[    8.950731] sd 3:0:0:0: [sdb] Write cache: enabled, read cache: enabled, supports DPO and FUA
[    8.985168]  sdb: sdb1 sdb9
[    8.987957] sd 3:0:0:0: [sdb] Attached SCSI disk

So 512 B logical blocks but 4096 B physical block?!

They provide me some temporary file system to which I can backup the data. Then, I will first test the speed on the raw device before setting up the pool from scratch. I will send an update.

Update 2

I destroyed the original pool. Then I ran some speed tests using dd, the results are ok - around 80MB/s in both directions.

As a further check I created an ext4 partition on the device. I copied a large zip file to this ext4 partition and the average write speed is around 40MB/s. Not great, but enough for my purposes.

I continued by creating a new storage pool with the following commands

zpool create -o ashift=12 tank /dev/disk/by-id/scsi-3600140519581e55ec004cbb80c32784d
zfs set compression=on tank
zfs set atime=off tank
zfs create tank/test

Then, I again copied a zip file to the newly create test file system. The write speed is poor, just around 2-5 MB/s.

Any ideas?

Update 3

tgx_syncis blocked when I copy the files. I opened a ticket on the github repository of ZoL.

Do we know anything about how the storage device is connected to the VM? Also, you don't appear to have compression enabled. — ewwhite, Jul 21 '16 at 12:03
They say it is 10GbE. On the test file system I disabled compression on purpose to be not bound by the CPU. The results are approximately the same, however, no matter whether compression is enabled or not. — BayerSe, Jul 21 '16 at 12:07
Network throughput would only be of concern if you do not get more than 110 MB/s, which is far beyond your current speed. You need to ask them about the kind of storage subsystem, the maximum, average and minimum expected performance for random and sequential access, and the blocksize on which it is aligned. — user121391, Jul 21 '16 at 12:12
What's the raw disk write performance? Can you test that? Because if the raw disk can't meet your performance requirements, there's no file system in the universe that will save you. — Andrew Henle, Jul 21 '16 at 13:28
@AndrewHenle In the IT department they tested the read speed of the raw disk using `dd`. It is about 90 MB/s (as opposed to abut 40-50 MB/s) on the file system. I'll add write speed results. — BayerSe, Jul 21 '16 at 14:46
@BayerSe `dd` testing will test *sequential* read/write performance. Sequential operations like that are often coalesced into large blocks to/from the actual disk(s) via caching and the use of either read-ahead or write-behind. File system access can be extremely random and in small blocks, which doesn't lend itself to caching or read-ahead. A disk system can give good large-block sequential performance while still having abysmal random, small-block performance - especially *write* performance. `dd` testing is an easy *start*, because if it's poor, everything else will also be poor. — Andrew Henle, Jul 21 '16 at 15:05
"So 512 B logical blocks but 4096 B physical block?!" That is (was) not that uncommon - newer disks used 4k bytes sectors internally, but presented 512 bytes to the operating system, known as "4k/512e" ("4k emulated") as opposed to the older 512/512 ("512 native") or the newer 4k/4k ("4k native"). — user121391, Jul 21 '16 at 16:07
Any progress on this? I have the same issue on arch/armv7. Somehow it seems neither CPU bound (frequency governor does not scale up) nor IO bound (the same crappy 4 M/s write speed for both an hdd as well as an emmc-backed loop device). Is your Ubuntu guest 32 bits or 64 bits? (What does `uname -a` say?) — not-a-user, Sep 20 '17 at 08:54
@not-a-user `Linux TAQ-Server 4.4.0-92-generic #115-Ubuntu SMP Thu Aug 10 09:04:33 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux` is the output. I never directly solved the problem, but after re-creating the block device and applying some settings discussed here: http://list.zfsonlinux.org/pipermail/zfs-discuss/2016-July/025979.html, the problem vabished. — BayerSe, Sep 20 '17 at 09:48
@BayerSe So you are experiencing acceptable write speeds in the same environment currently? Would you mind sharing your now-working settings? — not-a-user, Sep 20 '17 at 14:49
@not-a-user This is what I ised at the end: https://gist.github.com/BayerSe/393b4664d42b85ade63660fb1f357482 — BayerSe, Sep 20 '17 at 15:53
For 30tb, ZFS needs FAR more RAM than 12GB. You should be at 48GB minimum. Math for it - 8GB baseline + 30GB (1GB per managed TB) = 38GB But 38 isn't a sensible size so your next stop is 48GB. https://serverfault.com/questions/569354/freenas-do-i-need-1gb-per-tb-of-usable-storage-or-1gb-of-memory-per-tb-of-phys — Rob Pearson, Jun 19 '18 at 16:18

score 2 · Answer 1 · answered Jul 21 '16 at 12:09

2

You have set ashift=0, which causes slow write speeds when you have HD drives that use 4096 byte sectors. Without ashift, ZFS doesn't properly align writes to sector boundaries -> hard disks need to read-modify-write 4096 byte sectors when ZFS is writing 512 byte sectors.

Use ashift=12 to make ZFS align writes to 4096 byte sectors.

You also need to check that the alignment of your partition is correct with respect to the actual hard disk in use.

answered Jul 21 '16 at 12:09

Tero Kilkanen

34,499
3
38
58

1

The storage is abstracted. It's probably an export from a SAN. The ashift may not make a difference here. – ewwhite Jul 21 '16 at 12:14
I'm confused. The `zdb`command says `ashift=0`, `zpool get all` says it's `9`. What is the correct value? And what could I ask the IT guys to figure out whether `ashift=12` would be the correct value? – BayerSe Jul 21 '16 at 12:30
1

Actually `zdb` tells that the iSCSI device has `ashift=9` and `zpool get all` says it is `0`. I don't actually know what is the minimum write block used when ashift is 0. You can try both `ashift=9` and `ashift=12`. You need to ask, what is the minimum block size for the storage system that doesn't trigger read-modify-write during writes. – Tero Kilkanen Jul 21 '16 at 12:58
4

No real answers are possible without detailed knowledge of what that iSCSI disk actually is. For all we know it's a 37-disk RAID5 array of mixed 5400 and 7200 RPM SATA drives with a per-disk segment size of 1 MB that's been partitioned into 137 LUNs that are all utterly misaligned. If something like that is true (and I've seen incompetent SAN setups like that all too often), the OPs task likely hopeless. If the system is still in test and the ZFS file system can be safely destroyed, raw disk write performance using something like Bonnie or even just `dd` would be a good data point to have. – Andrew Henle Jul 21 '16 at 13:25
2

@BayerSe `ashift=0` simply means "try to autodetect". Unfortunately, I think the default when autodetection fails is 9, whereas with modern disks, 12 would probably be better unless 9 is known to be a good value. I have gotten into the habit of always specifying ashift explicitly unless it's a pool where I *really* don't care about performance at all. – user Jul 22 '16 at 12:54

ZFS: good read but poor write speeds

1 Answers1