Why do my SSD read latency benchmarks get markedly worse when I put an XFS filesystem on top?

Question

I am benchmarking a small server box based on the SuperMicro E300-8D. I've installed the latest CentOS 7.5 with the latest updates, 64GB of DDR4-2100 RAM, and a Samsung 970 EVO 1TB NVMe SSD. The OS is installed on a USB stick in the internal USB port, so the SSD is entirely unused except during my benchmarking.

The goal of my testing is to find an optimal concurrency level for this SSD, inspired by the benchmarking approach used by ScyllaDB. To that end I'm using diskplorer which internally uses fio to explore the relationship between concurrency and both IOPS and latency. It produces handy graphs like the ones below. In all cases I'm using a 4K random read workload.

The problem is I'm getting results that make no sense. Here's the first result:

Raw `/dev/nvme0n1`

$ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G

Winning!

This is fantastic! Samsung's own spec sheet claims 500K read IOPS and with 20 concurrent reads I'm getting almost 600K. The axis on the right is read latency in nanoseconds, the red line is mean latency, and the error bars are 5% and 95% latency. So it looks like the ideal concurrency level for this SSD is about 20 concurrent reads, yielding awesome latency < 100us.

That's just the raw SSD. I'll put XFS on it, which is optimized for async I/O, and I'm sure it won't add any significant overhead...

With new XFS filesystem on `/dev/nvme0n1`

$ sudo mkfs.xfs /dev/nvme0n1
$ sudo mount /dev/nvme0n1 /mnt
$ sudo ./diskplorer.py --mountpoint=/mnt --filesize=256G

Whiskey. Tango. Foxtrot.

What!? That's awful! It seems XFS has introduced some absurd amount of latency and dramatically reduced IOPS. What could be wrong?

Just in case, reboot the system to clear out the caches, not that caching should be a factor on a brand new file system:

XFS on `/dev/nvme0n1` after reboot

$ sudo shutdown -r now
(reboot happens)
$ sudo mount /dev/nvme0n1 /mnt
$ sudo ./diskplorer.py --mountpoint=/mnt --filesize=256G

So much for turning it off and then on again...

No change. It's not cache related.

At this moment there is a valid XFS filesystem on /dev/nvme0n1, and it is mounted to /mnt. I'm going to repeat the test I did first, on the raw block device, unmounted, while leaving the contents of the XFS filesystem in place.

Raw `/dev/nvme0n1` again

$ sudo umount /mnt
$ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G

XFS ruined me SSD!!!111oneone

Oh no, XFS ruined my SSD performance! /sarcasm

Clearly, it's not the case that XFS diabolically has ruined my SSD performance, or that XFS is poorly suited for this workload. But what could it be? Even unmounting the disk so XFS isn't involved, performance seems much reduced?

On a hunch, I tried DISCARDing the entire contents of the SSD which should reset the allocation of cells within the disk to its original state...

Raw `/dev/nvme0n1` after `blkdiscard`

$ sudo blkdiscard /dev/nvme0n1
$ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G

Miraculously, the performance of my SSD is restored. Has the whole world gone mad?

Based on a suggestion from @shodanshok, what if I do a dd onto the SSD after I have "fixed" it by doing a blkdiscard?

Raw `/dev/nvme0n1` after `blkdiscard` then zeroed with `dd`

$ sudo blkdiscard /dev/nvme0n1
$ sudo dd if=/dev/zero of=/dev/nvme0n1 bs=1M status=progress oflag=direct
$ sudo ./diskplorer.py --device=/dev/nvme0n1 --filesize=256G

This is an interesting result, and confirms my belief that XFS is not to blame here. Just by filling the SSD with zeroes, read latency and throughput have both significantly deteriorated. So it must be the SSD itself has some optimized read path for unallocated sectors.

Hypothesis

Clearly XFS isn't killing my SSD, and if it were, blkdiscard isn't magically restoring it. I emphasize again these benchmarks are all read benchmarks, so issues with write journaling, write amplification, wear leveling, etc are not applicable.

My theory is that this SSD and perhaps SSDs in general have an optimization in the read path, which detects a read of an unallocated region of the disk and executes a highly optimized code path that sends all zeros back over the PCIe bus.

My question is, does anyone know if that is correct? If so, are benchmarks of new SSDs without filesystems generally suspect, and is this documented anywhere? If this is not correct, does anyone have any other explanation for these bizarre results?

Just a hunch but... I think this is normal. SSDs slow down over time due to wear leveling. Apparently blkdiscard resets it. Can you do the same test but this time just allocate no more than 50% of the drive? — Konrad Gajewski, Aug 09 '18 at 02:43
Can you retry the benchmark with a full drive but without any filesystem? In other words, try issuing `dd if=/dev/zero of=/dev/nvme0n1 bs=1M oflag=direct` before rerunning the benckmark. If I/O speed degrades, we can put XFS out of question. — shodanshok, Aug 09 '18 at 07:25
@KonradGajewski in this test the drive's capacity was 1TB, and the test file written to the XFS filesystem was 256GB. Add in XFS metadata overhead and still it's less than 300GB used, so the disk utilization was already under 50%. Also I should add this is a new SSD, other than these benchmarks it's never been used so the flash cells should be fresh. — anelson, Aug 09 '18 at 11:22
@ewwhite I perused the [XFS FAQ entry about performance](http://xfs.org/index.php/XFS_FAQ#Q:_I_want_to_tune_my_XFS_filesystems_for_.3Csomething.3E) it seemed to discourage tweaking except in some specific cases. Specifically I read some articles suggesting `noatime` though the XFS docs say `relatime` is the default now, and in any case when I'm operating on the disk directly, not the mounted XFS filesystem, I don't think any XFS optimizations would be a factor. Is there specific tuning you recommend? — anelson, Aug 09 '18 at 11:25
@shodanshok great suggestion, I've updated the question with the results, which just confirm XFS is not the cause, any data writes will produce this result. — anelson, Aug 09 '18 at 12:28

score 7 · Accepted Answer · answered Aug 09 '18 at 15:51

Most modern SSDs use a page-based mapping table. At first (or after a complete TRIM/UNMAP) the mapping table is empty - ie any LBA returns 0, even if the underlying flash page/block is not completely erased and so its actual value is different than a plain 0.

This means that, after a complete blkdiscard, you are not reading from the flash chip themselves; rather, the controller immediately returns 0 to all your reads. This easily explain your findings.

Some more ancient SSDs use different, less efficient but simpler approaches which always reads from the NAND chip themselves. On such drives the value of a trimmed page/block is sometime undefined, due to the controller not simply marking them as "empty" but rather reading from the NAND each time.

Yes, SSDs are more complex beast that "plain" HDDs: after all, they basically are small, auto-contained, thinly provisioned RAID volumes with their own filesystem/volume management called FTL (flash translation layer).

score 0 · Answer 2 · answered Aug 31 '19 at 09:48

Just to augment @shodanshok's correct answer:

are benchmarks of new SSDs without filesystems generally suspect, and is this documented anywhere?

Yes, benchmarks on SSDs that haven't been "pre-conditioned" (and benchmarks that only use zero data and benchmarks that...) are generally suspect. This documented in a few places:

https://www.seagate.com/gb/en/tech-insights/lies-damn-lies-and-ssd-benchmark-master-ti/ warns "SSDs must be fully and properly preconditioned before measurements are taken"
https://www.snia.org/sites/default/files/technical_work/PTS/SSS_PTS_2.0.1.pdf has an exhaustive set of guidelines for doing SSD benchmarking
http://codecapsule.com/2014/02/12/coding-for-ssds-part-6-a-summary-what-every-programmer-should-know-about-solid-state-drives/ has nice decription of how SSDs work.

In general though it' never explicitly mentioned that you need to fill SSDs before doing benchmarking just because reads of data that have "never" been written can be artificially faster but you could argue that they are all assuming pre-conditioning.

PS: On Linux fio knows how to invalidate disk caches for "regular" I/O when run with root permissions and does so by default (https://fio.readthedocs.io/en/latest/fio_man.html#cmdoption-arg-invalidate ).

Why do my SSD read latency benchmarks get markedly worse when I put an XFS filesystem on top?

Raw /dev/nvme0n1

With new XFS filesystem on /dev/nvme0n1

XFS on /dev/nvme0n1 after reboot

Raw /dev/nvme0n1 again

Raw /dev/nvme0n1 after blkdiscard

Raw /dev/nvme0n1 after blkdiscard then zeroed with dd