1

Context: I'm on a Toshiba 512 GB NVMe (Model: KXG50ZNV512G)

I'm seeing this weird behaviour while benchmarking Postgres on ZFS-on-Linux (via pgbench), where the second and third runs of a benchmark are progressively slower than the first run.

Here is what is happening:

client=1  |  770 =>  697 | 10% reduction in TPS
client=4  | 2717 => 2180 | 24% reduction in TPS
client=8  | 4579 => 3339 | 37% reduction in TPS
client=12 | 4219 => 4175 | 01% reduction in TPS
client=48 | 5902 => 5623 | 05% reduction in TPS
client=96 | 7094 => 6739 | 05% reduction in TPS

I'm re-running these tests and the early numbers indicate that the 3rd run is slower than the 1st and 4th is slower than the 3rd.

Could the lack of TRIM support on ZFS-on-Linux causing this - https://github.com/zfsonlinux/zfs/pull/8255 ?

Saurabh Nanda
  • 449
  • 1
  • 7
  • 17

2 Answers2

1

Rather than the missing TRIM support (whose performance deficit you can often avoid by simply leaving ~10% unpartitioned space at the end of the disk), what is hitting you probably is ZFS CoW behavior.

Basically, when running on an empty dataset, you can write without incurring in read/modify/write because, well, you have not written much yet. When really rewriting data (as in following benchmarks), you are going to progressively hit more and more read/modify/write, leading to both read and write amplification (and slower performance).

To check if it is the case, simply use zpool iostat to record total reads/writes on the first three runs: if you see the second and third to command an increased amount of transferred bytes, you have the confirmation of what written above.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • Do you confuse CoW behavior for 512/4k sectors unalignment ? Because it looks that way. In the sane time sector unalignment cannot cause progressive degrading. – drookie Jan 30 '19 at 17:01
  • @drookie no, sector misalignment is avoided with proper `ashift` settings. I'm referring to r/m/w amplification which occurs when using `recordsize` larger than actual write i/o – shodanshok Jan 30 '19 at 17:28
  • Well, this is simply avoided by setting the recordsize matching the db block size. But I doubt this can cause the progressing degradation too. – drookie Jan 30 '19 at 19:18
  • @drookie it is not always possible to match `recordsize` to the actual write i/o, especially when compression is enabled. Moreover, we have no information on actual pool/dataset config, so I must assume default `recordsize` (128K) – shodanshok Jan 30 '19 at 21:45
0

You can verify whether your autotrim is enabled on that pool.

zpool get autotrim [poolname]

Turning that on may help the performance. If not, you can try to enable it with:

zpool set autotrim=on [poolname]

Leaving 10% empty space can also help. However, if the ssd is not brand new, you have to shrink the existing partition to leave out 10% empty space. After that, you also have to issue "blkdiscard" to that empty space. Note that blkdiscard is dangerous command, which may wipe out existing data if you enter wrong address. It is not recommended to do that on existing ssd.

Xudong Jin
  • 31
  • 2
  • 1
    ZFS on Linux did not have TRIM support at the time the original post was written. – Michael Hampton Jul 20 '21 at 19:14
  • @MichaelHampton can you edit just that answer and add `Starting with ZFS 0.8 ZFS has become trim support, earlier versions did not support it` i cant suggest an edit due too many open edits :-( i mean that answer is great in that fact but did not explain this specially point – djdomi Jul 21 '21 at 04:31