ZFS performance: do I need to keep free space in a pool or a file system?

Question

I know that the performance of ZFS heavily depends on the amount of free space:

Keep pool space under 80% utilization to maintain pool performance. Currently, pool performance can degrade when a pool is very full and file systems are updated frequently, such as on a busy mail server. Full pools might cause a performance penalty, but no other issues. [...] Keep in mind that even with mostly static content in the 95-96% range, write, read, and resilvering performance might suffer. ZFS_Best_Practices_Guide, solarisinternals.com (archive.org)

Now, suppose I have a raidz2 pool of 10T hosting a ZFS file system volume. Now I create a child file system volume/test and give it a reservation of 5T.

Then I mount both file systems per NFS to some host and perform some work. I understand that I can't write to volume more than 5T, because the remaining 5T are reserved to volume/test.

My first question is, how will the performance drop, if I fill my volume mount point with ~5T? Will it drop, because there is no free space in that file system for ZFS' copy-on-write and other meta-stuff? Or will it remain the same, since ZFS can use the free space within the space reserved for volume/test?

Now the second question. Does it make a difference, if I change the setup as follows? volume now has two file systems, volume/test1 and volume/test2. Both are given a 3T reservation each (but no quotas). Assume now, I write 7T to test1. Will the performance for both file systems be the same, or will it be different for every file system? Will it drop, or remain the same?

Thanks!

score 28 · Answer 1 · edited Apr 17 '18 at 03:20

The performance degradation occurs when your zpool is either very full or very fragmented. The reason for this is the mechanism of free block discovery employed with ZFS. Opposed to other file systems like NTFS or ext3, there is no block bitmap showing which blocks are occupied and which are free. Instead, ZFS divides your zvol into (usually 200) larger areas called "metaslabs" and stores AVL-trees¹ of free block information (space map) in each metaslab. The balanced AVL tree allows for an efficient search for a block fitting the size of the request.

While this mechanism has been chosen for reasons of scale, unfortunately it also turned out to be a major pain when a high level of fragmentation and/or space utilization occurs. As soon as all metaslabs carry a significant amount of data, you get a large number of small areas of free blocks as opposed to a small numbers of large areas when the pool is empty. If ZFS then needs to allocate 2 MB of space, it starts reading and evaluating all metaslabs' space maps to either find a suitable block or a way to break up the 2 MB into smaller blocks. This of course takes some time. What is worse is the fact that it will cost a whole lot of I/O operations as ZFS would indeed read all space maps off the physical disks. For any of your writes.

The drop in performance might be significant. If you fancy pretty pictures, take a look at the blog post over at Delphix which has some numbers taken off an (oversimplified but yet valid) zfs pool. I am shamelessly stealing one of the graphs - look at the blue, red, yellow, and green lines in this graph which are (respectively) representing pools at 10%, 50%, 75%, and 93% capacity drawn against write throughput in KB/s while becoming fragmented over time:

A quick & dirty fix to this has traditionally been the metaslab debugging mode (just issue echo metaslab_debug/W1 | mdb -kw at run-time for instantly changing the setting). In this case, all space maps would be kept in the OS RAM, removing the requirement for excessive and expensive I/O on each write operation. Ultimately, this also means you need more memory, especially for large pools, so it is kind of a RAM for storage horse-trade. Your 10 TB pool probably will cost you 2-4 GB of memory², but you will be able to drive it to 95% of utilization without much hassle.

¹ it is a bit more complicated, if you are interested, look at Bonwick's post on space maps for details

² if you need a way to calculate an upper limit for the memory, use zdb -mm <pool> to retrieve the number of segments currently in use in each metaslab, divide it by two to model a worst-case scenario (each occupied segment would be followed by a free one), multiply it by the record size for an AVL node (two memory pointers and a value, given the 128-bit nature of zfs and 64-bit addressing would sum up to 32 bytes, although people seem to generally assume 64 bytes for some reason).

zdb -mm tank | awk '/segments/ {s+=$2}END {s*=32/2; printf("Space map size sum = %d\n",s)}'

Reference: the basic outline is contained in this posting by Markus Kovero on the zfs-discuss mailing list, although I believe he made some mistakes in his calculation which I hope to have corrected in mine.

syneticon-dj, thank you for this explanation! Increasing RAM seems to help indeed. — Pavel, Nov 26 '13 at 18:45
What about BPR (block pointer rewrite)? Also this one http://blogs.kent.ac.uk/unseenit/2013/10/02/effects-of-zfs-fragmentation-on-underlying-storage/ mentions using a SLOG for ZIL helps too. And this guy http://nex7.blogspot.com.au/2013/03/readme1st.html says you just send and receive until it's all good. — CMCDragonkai, Jun 27 '14 at 09:20
@CMCDragonkai I can assure you from experience, that using a separate ZIL device does nothing towards the performance hit due to space map fragmentation. But **not** having a ZIL device will increase overall fragmentation and you will be more likely to hit the issue at lower percentages of space utilization. BPR is still vaporware - no demonstrable code exists, much less a stable implementation. A send-receive cycle is indeed likely to help in getting a defragmented pool, but this *will* mean downtime for the dataset sent/received. — the-wabbit, Jun 27 '14 at 10:05
What if you replicated the dataset prior to send-receive onto another disk? And then just rotate a send-receive cycle for each disk? — CMCDragonkai, Jun 28 '14 at 00:21
@CMCDragonkai you *can* keep downtime short by doing a full send first and working with incrementals after that. But downtime it stays. If you happen to use your datasets as backend storage for databases or virtualization, downtime hurts, even if it is short. Also, you will need a separate, empty pool for this to work. — the-wabbit, Jun 29 '14 at 14:22

score 13 · Accepted Answer · answered May 27 '13 at 13:49

13

Yes. You need to keep free space in your pool. It's mainly for copy-on-write actions and snapshots. Performance declines at about 85% utilization. You can go higher, but there's a definite impact.

Don't mess with reservations. Especially with NFS. It's not necessary. Maybe for a zvol, but not NFS.

I don't see the confusion, though. If you have 10T, Don't use more than 85% of it. Size your shares appropriately, using quotas to cap their use. Or don't use any quotas and monitor your overall pool usage.

answered May 27 '13 at 13:49

ewwhite

194,921
91
434
799

Thanks! There's no fair way in our setting to use quotas, so everyone uses the same mount point and can fill up the space, leading to a drop in performance. My idea was to guarantee some free space with a reservation so that the overall system never gets too slow. But IIUC, I can have this guarantee by limiting `volume` to 8.5T and never think about it again. Is that correct? – Pavel May 27 '13 at 13:57
You could.. or just watch. I mean, it's NFS... not a zvol, so you can delete files to get back under 8.5TB. – ewwhite May 27 '13 at 13:58
Yeah, but it's a pain to have these "please clean up your sh.., the fileserver is awfully slow" discussions in the mailing lists every couple of weeks... – Pavel May 27 '13 at 14:02
Technical solution to a social/administrative problem :) Do you rally anticipate that much data? – ewwhite May 27 '13 at 14:12
Hehe.. Yes, this is a quite common situation we face. So, are claims like this: ["On filesystems with many file creations and deletions, utilization should be kept under 80% to protect performance."](http://www.princeton.edu/~unix/Solaris/troubleshoot/zfs.html) unprecise, because it's really about the free space within a pool rather than file system? – Pavel May 27 '13 at 14:16
Yes, it's a pool-level thing. Since the quotas on filesystems can simply be expanded. – ewwhite May 27 '13 at 14:17
1

Personally I use 70-75% maximum, not 85%. Edge cases can even degrade lower than that (I won't scare you with how much lower). You can in fact set a reservation to limit space.. this is common at some sites I know of. All you do is create a dataset, and set a reservation (or refreservation) on it that is equivalent to 25% of the total pool size, and poof, even if you use up every free block in your real datasets, you know the pool is still 25% free space, as you have a 25% reservation on there. – Nex7 May 28 '13 at 07:36
I'd worry more if it were a thin zvol expanding... but for NFS, it's easy enough to monitor and watch things. For NFS-only, I've gone to 85% and higher temporarily with no severely-negative effects, but if you can't control the growth/usage of a volume this size, there's another problem. – ewwhite May 28 '13 at 10:20
I grant you the reservation is no replacement for proper monitoring. I suppose it depends on the environment - one site was very worried because they'd previously had a user blow up a share massively over night due to a bug in some software they'd built, thus the reservation setting. Quotas would also theoretically stop that sort of thing, though, but they weren't interested in that. As for 85% and such; a lot of it is more about fragmentation.. if you can go 85% and back down without severe fragmentation in the process, yes, will be ok. :) – Nex7 May 29 '13 at 09:05

ZFS performance: do I need to keep free space in a pool or a file system?

2 Answers2

Linked