0

Sorry for the title, but it's very short summary of BS that I'm looking into. Situation

I'm using ZoL 2.1.5 (from jonathonf's ppa) on Ubuntu (tried 20.04&22.04)

I have following NVMe disk

  • Kingston KC2500 1TB (/dev/nvme0n1) formatted as 512 (with nvme format -l 0)
  • Samsung 983 DCT M.2 960GB (/dev/nvme6n1) formatted as 512 with nvme format -l 0)

The following pastebin contains all commands, here is short output:

RAW device:

fio -name=rndw8k32 -ioengine=libaio -direct=1 -buffered=0 -invalidate=1 -filesize=30G -numjobs=1 -bs=8k -iodepth=32 -rw=randwrite -filename=/dev/nvme0n1
WRITE: bw=1600MiB/s (1678MB/s), 1600MiB/s-1600MiB/s (1678MB/s-1678MB/s), io=30.0GiB (32.2GB), run=19202-19202msec

fio -name=rndw8k32 -ioengine=libaio -direct=1 -buffered=0 -invalidate=1 -filesize=30G -numjobs=1 -bs=**8k** -iodepth=32 -rw=randwrite -filename=/dev/nvme6n1
WRITE: bw=1180MiB/s (1237MB/s), 1180MiB/s-1180MiB/s (1237MB/s-1237MB/s), io=30.0GiB (32.2GB), run=26031-26031msec

Now to create stripe out of first disk:

zpool create -o ashift=9 -O compression=lz4 -O atime=off -O recordsize=64k nvme /dev/nvme0n1
fio -name=rndw8k32 -ioengine=libaio -direct=1 -buffered=0 -invalidate=1 -filesize=30G -numjobs=1 -bs=**8k** -iodepth=32 -rw=**randwrite** -filename=/nvme/temp.tmp
WRITE: bw=147MiB/s (154MB/s), 147MiB/s-147MiB/s (154MB/s-154MB/s), io=30.0GiB (32.2GB), run=209618-209618msec

Ok, maybe record size is to blame:

zpool create -o ashift=9 -O compression=lz4 -O atime=off -O recordsize=8k nvme /dev/nvme0n1
fio -name=rndw8k32 -ioengine=libaio -direct=1 -buffered=0 -invalidate=1 -filesize=30G -numjobs=1 -bs=8k -iodepth=32 -rw=randwrite -filename=/nvme/temp.tmp
WRITE: bw=349MiB/s (366MB/s), 349MiB/s-349MiB/s (366MB/s-366MB/s), io=30.0GiB (32.2GB), run=87922-87922msec

What the actually hell? The same picture is on 2nd NVMe. If I use recordsize=64k and fio bs=64k I get normal speed. If I use recordsize=64 and fio bs=8k i get bullshit speed. If i use recordsize=8k and fio bs=8k i get bullshit speed.

https://pastebin.com/0RH6gLM9

Maybe problem is that I'm using file and comparing file vs device? Well, ext4 give me:

For 8k block

fio -name=rndw8k32 -ioengine=libaio -direct=1 -buffered=0 -invalidate=1 -filesize=30G -numjobs=1 -bs=8k -iodepth=32 -rw=randwrite -filename=/mnt/temp.tmp
WRITE: bw=569MiB/s (597MB/s), 569MiB/s-569MiB/s (597MB/s-597MB/s), io=30.0GiB (32.2GB), run=53989-53989msec

For 64k block

fio -name=rndw8k32 -ioengine=libaio -direct=1 -buffered=0 -invalidate=1 -filesize=30G -numjobs=1 -bs=64k -iodepth=32 -rw=randwrite -filename=/mnt/tmp.tmp
WRITE: bw=2137MiB/s (2241MB/s), 2137MiB/s-2137MiB/s (2241MB/s-2241MB/s), io=30.0GiB (32.2GB), run=14373-14373msec

Just in case i have also tested it after reformating NVMe with

nvme format /dev/nvme0n1 -l 1

and using ashift=12 bs=8k gives me:

zpool create -o ashift=12 -O compression=lz4 -O atime=off -O recordsize=8k nvme /dev/nvme0n1 -f
fio -name=rndw8k32 -ioengine=libaio -direct=1 -buffered=0 -invalidate=1 -filesize=30G -numjobs=1 -bs=8k -iodepth=32 -rw=randwrite -filename=/nvme/temp.tmp
WRITE: bw=192MiB/s (202MB/s), 192MiB/s-192MiB/s (202MB/s-202MB/s), io=30.0GiB (32.2GB), run=159853-159853msec

and using ashift=12 bs=64k gives me:

zpool create -o ashift=12 -O compression=lz4 -O atime=off -O recordsize=64k nvme /dev/nvme0n1 -f
fio -name=rndw8k32 -ioengine=libaio -direct=1 -buffered=0 -invalidate=1 -filesize=30G -numjobs=1 -bs=8k -iodepth=32 -rw=randwrite -filename=/nvme/temp.tmp    
WRITE: bw=495MiB/s (519MB/s), 495MiB/s-495MiB/s (519MB/s-519MB/s), io=30.0GiB (32.2GB), run=62035-62035msec

details: https://pastebin.com/GDGgSMmR

So, what am I missing in my tests? How come that ZFS making my nvme THAT much slower? Just in case whole NVMe is zeroed before tests (like day prior).

  • Zeroing an SSD? An SSD is not an HDD, zeroing is useless, perform a full TRIm or a secure erase to reset it but zeroing may have the opposite effect. aFIK flash memory is internally organized in much larger sections than 4 or 8 KB therefore performance drop is not unexpected. Also for smaller block size the command overhead increases. – Robert Aug 05 '22 at 21:24
  • @Robert "Zeroing an SSD?" Yes, before using nvme format -l 0, not like it's gonna hurt "aFIK flash memory is internally organized in much larger sections than 4 or 8 KB" Erm, the native block size of NVMe is 4k, so not really. As for "command overhead increases" - it's still doesn't explain that big difference between raw&ext4 vs zfs – Vladislav Losev Aug 05 '22 at 22:26
  • With zfs you have high cpu usage usr=8.72%, sys=91.11%. Maybe cpu is limiting factor. Try tests with zpool without compession and checksumming , then try with recordsize 4k up to 1M, then with zfs set sync=disabled. ext4 has no data checksumming and has 4k default blocksize. – gapsf Aug 06 '22 at 21:14

0 Answers0