How to limit ZFS writes on NVME SSD in RAID1 to avoid rapid disk wear?

Question

Currently I'm running Proxmox 5.3-7 on ZFS with few idling debian virtual machines. I'm using two SSDPE2MX450G7 NVME drives in RAID 1. After 245 days of running this setup the S.M.A.R.T values are terrible.

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning:                   0x00
Temperature:                        27 Celsius
Available Spare:                    98%
Available Spare Threshold:          10%
Percentage Used:                    21%
Data Units Read:                    29,834,793 [15.2 TB]
Data Units Written:                 765,829,644 [392 TB]
Host Read Commands:                 341,748,298
Host Write Commands:                8,048,478,631
Controller Busy Time:               1
Power Cycles:                       27
Power On Hours:                     5,890
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0

I was trying to debug what's consuming so much write commands, but I'm failing. iotop shows 400kB/s average writes with 4MB/s spikes.

I've tried to run zpool iostat and it doesn't look bad too.

zpool iostat rpool 60
capacity operations bandwidth
pool alloc free read write read write

rpool 342G 74.3G 0 91 10.0K 1.95M
rpool 342G 74.3G 0 90 7.80K 1.95M
rpool 342G 74.3G 0 107 7.60K 2.91M
rpool 342G 74.3G 0 85 22.1K 2.15M
rpool 342G 74.3G 0 92 8.47K 2.16M
rpool 342G 74.3G 0 90 6.67K 1.71M

I've decided to take a look into writes by echoing 1 into /proc/sys/vm/block_dump and looking into /var/log/syslog. Here's the result:

Jan 25 16:56:19 proxmox kernel: [505463.283056] z_wr_int_2(438): WRITE block 310505368 on nvme0n1p2 (16 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283058] z_wr_int_0(436): WRITE block 575539312 on nvme1n1p2 (16 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283075] z_wr_int_1(437): WRITE block 315902632 on nvme0n1p2 (32 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283096] z_wr_int_4(562): WRITE block 460141312 on nvme0n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283108] z_wr_int_4(562): WRITE block 460141328 on nvme0n1p2 (16 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283271] z_null_iss(418): WRITE block 440 on nvme1n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283315] z_null_iss(418): WRITE block 952 on nvme1n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283348] z_null_iss(418): WRITE block 878030264 on nvme1n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283378] z_null_iss(418): WRITE block 878030776 on nvme1n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283409] z_null_iss(418): WRITE block 440 on nvme0n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283442] z_null_iss(418): WRITE block 952 on nvme0n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283472] z_null_iss(418): WRITE block 878030264 on nvme0n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.283502] z_null_iss(418): WRITE block 878030776 on nvme0n1p2 (8 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.289562] z_wr_iss(434): WRITE block 460808488 on nvme1n1p2 (24 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.289572] z_wr_iss(434): WRITE block 460808488 on nvme0n1p2 (24 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.457366] z_wr_iss(430): WRITE block 460808744 on nvme1n1p2 (24 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.457382] z_wr_iss(430): WRITE block 460808744 on nvme0n1p2 (24 sectors)
Jan 25 16:56:19 proxmox kernel: [505463.459003] z_wr_iss(431): WRITE block 460809000 on nvme1n1p2 (24 sectors)

and so on. Is there any way to limit number of writes? As you can see the data units written are outrageous and I'm stuck, because I'm out of ideas how to limit it.

It's a cheap dedicated server with 2x NVME SSD 480GB, so it's all I've got. — Peter R., Jan 25 '19 at 17:51
How do you know you made all those writes, and not the person who previously leased that server? — Michael Hampton, Jan 25 '19 at 17:51
I've checked the SMART after the server was provisioned. It had 2 power-on hours on the clock after initial tests and OS install. — Peter R., Jan 25 '19 at 17:53
Well, that's you then. Time to take a look at what your virtual machines are doing. — Michael Hampton, Jan 25 '19 at 17:54
I ran `iostat -md 600` on all machines simultaneously. The numbers don't add up at all. If I sum it up VMs generate only 1/3 of all writes that appear on host. For example: 55MB+125MB+88MB+90MB = 358MB on 4 debian VMs appear as 990MB write on the host machine during the same time period. — Peter R., Jan 25 '19 at 17:56
`zpool get all | grep 'ashift' rpool ashift 12` so I assume it's set to 4K. I'm pretty sure I did not set it manually, it must be the default one from Proxmox install. — Peter R., Jan 25 '19 at 18:01
There's no such file. The only one with this name is `/lib/modules-load.d/zfs.conf` which contains `zfs`. — Peter R., Jan 26 '19 at 21:23
@PeterR. I've updated my answer with a clarification request. Can you show the output of `nvme intel smart-log-add /dev/nvme0` ? — shodanshok, Jan 28 '19 at 11:43

shodanshok · Accepted Answer · 2019-01-28T15:30:50.087

6

There are different reasons why your real writes were so much inflated. Lets mark some base point:

first, let set a baseline: from your zpool iostat output, we can infer a continuous ~1.5 MB/s write stream to each of the mirror leg. So, in 245 days, it add up to 1.5*86400*245 = 32 TB written;
the number above already take into account both ZFS recordsize write amplification and dual data write due to first writing to ZIL, then at txg_commit (for writes smaller than zfs_immediate_write_sz).

Give the above, to reduce ZFS-induced write amplification, you should:

set a small recordsize (ie: 16K);
set logbias=throughput
set compression=lz4 (as suggested by @poige)

EDIT: to more correctly estimate write-amplification, please show the output of nvme intel smart-log-add /dev/nvme0

edited Jan 28 '19 at 15:30

answered Jan 26 '19 at 13:18

shodanshok

44,038
6
98
162

First of all - thank you for help. I decided to wait a little to make sure your suggestions are working. I've set recordsize to 16K, updated the logbias, reduced the size of qcow2 files to make some room on the disks. And now the SMART still shows 372TB, so it's not growing as fast as it was before. Awesome reply, thank you for your help. – Peter R. Jan 28 '19 at 12:18
@PeterR. Glad to help. If you have the time to run `nvme intel smart-log-add /dev/nvme0`, we can check the actual write amplification. – shodanshok Jan 28 '19 at 14:19
The most important lines of this output: wear_leveling : 79% min: 1051, max: 1095, avg: 1070 timed_workload_media_wear : 100% 63.999% timed_workload_host_reads : 100% 65535% timed_workload_timer : 100% 65535 min nand_bytes_written : 100% sectors: 9692273 host_bytes_written : 100% sectors: 11702215 – Peter R. Jan 28 '19 at 17:31
@PeterR. can you show the *full* output, updating your initial question (to preserve text format, which is lost on comments)? – shodanshok Jan 28 '19 at 18:18

poige · Answer 2 · 2019-01-29T01:35:01.290

5

In addition to already given advice to reduce recordsize — there's no reason not to use LZ4 compression (zfs set compression=lz4 …) as well by default, thus reducing size even more (and sometimes very significantly).

edited Jan 29 '19 at 01:35

answered Jan 28 '19 at 14:41

poige

9,171
2
24
50

It is **not** 2% wear level, but 2% *available spare*, which is an entirely different thing. The current SSD wear level is 21% (as show by `Percentage Used`). Feel free to double check NMVe specifications. That said, it is 2/3 in the warranted write endurance ([590 TBW for this model](https://ark.intel.com/products/93188/Intel-SSD-DC-P3520-Series-450GB-2-5in-PCIe-3-0-x4-3D1-MLC-)) – shodanshok Jan 28 '19 at 15:23
Thanks and feel free to provide a link (single is enough) where its "Spare" term is described. ;) – poige Jan 29 '19 at 01:33
1

Sure, you can check [here](https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3c-2018.05.24-Ratified.pdf). On page 121, under "Namespace Capacity": `Spare LBAs are not reported as part of this field`. So spare space is, well, additional space not directly seen by the block layer, used to replace *failing* NAND cells. To eat 2% into this area does not mean the SSD is 2% worn; rather, a small but significant number of NAND cell *already failed* and were replaced. An aging SSD can work reporting 0% used spare and then, suddenly, use *all* spare cells in a matter of weeks. – shodanshok Jan 29 '19 at 08:18
Yeah, already figured out since your first reply but anyways this explanation can be useful to other readers – poige Jan 29 '19 at 15:27

score 4 · Answer 3 · answered Jan 26 '19 at 15:18

4

A few items...

If this is a leased server, isn't the provider responsible for the health of the equipment?

Your ZFS filesystem ashift values, pool txg_timeout and a few other parameters may make sense to review.

answered Jan 26 '19 at 15:18

ewwhite

194,921
91
434
799

First of all - you are right, if the drive fails provider will get me a new one. I would like to have it sorted out anyway, because one day I can nuke my own hardware this way. Ashift is set to 12 so I assume its 4k. tgx_timeout is set to 5. – Peter R. Jan 26 '19 at 21:16
txg_timeout is too low. – ewwhite Jan 27 '19 at 01:18

How to limit ZFS writes on NVME SSD in RAID1 to avoid rapid disk wear?

3 Answers3