ZFS disk with Postgresql fragmenting quickly with write-heavy application

Question

I am running a write-heavy application where I am storing my Postgresql on ZFS. Generally, it works well, but I am finding that my ZFS pool is fragmenting heavily. I even created a new ZFS pool and moved the data there using zfs send / recv, in order to defragment the full space and try out some new settings which might limit fragmentation, but to no avail.

I'm not sure what is going on. The application is mostly making UPDATEs, but Postgres should be able to reuse the space already on disk (as it makes dead tuples and autovacuums them, it will re-use those pages). I do see that the allocated space is mostly staying the same, so it's not increasing disk usage. However, the COW nature of ZFS seems to be making the filesystem use up the free space and fragment it heavily.

So, after about 15m of heavy writes, the ZFS pool fragmentation went from 0% (initial) to 14%. This can't be the way things are supposed to be:

NAME               SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
postgres_data       149G  89.0G  60.0G        -         -    14%    59%  1.00x    ONLINE  -
postgres_wal       9.50G   289M  9.22G        -         -     0%     2%  1.00x    ONLINE  -

(note: in the time that I've been writing this post, the fragmentation has gone up to 16%. There is something not right here.)

Some notes on my implementation:

I have the WAL on a separate pool, in the theory that constant writes to the WAL won't fragment the main data store. Apparently this is not working.
I also have tried to set it up with sync=disabled on the theory that the ZIL is making my disk into swiss cheese. Also not making a difference.
I currently using logbias=throughput. I've seen conflicting advice on this; official docs suggest that for postgres one should use throughput, while a very useful guide says that "logbias=throughput will fragment every. Single. Block. Written to your pool." So, in my initial pool (now no longer in use, because it fragmented up to 80%), I initially used logbias=latency but that clearly also fragmented the whole disk.
I am using async=9 because the physical disk (this is cloud storage) has a block size of 512.

My full zfs settings:

NAME                PROPERTY              VALUE                  SOURCE
postgres_data/data  type                  filesystem             -
postgres_data/data  creation              Tue May 17 23:36 2022  -
postgres_data/data  used                  88.9G                  -
postgres_data/data  available             55.4G                  -
postgres_data/data  referenced            88.9G                  -
postgres_data/data  compressratio         1.98x                  -
postgres_data/data  mounted               yes                    -
postgres_data/data  quota                 none                   default
postgres_data/data  reservation           none                   default
postgres_data/data  recordsize            8K                     inherited from postgres_data
postgres_data/data  mountpoint            /var/lib/postgresql    local
postgres_data/data  sharenfs              off                    default
postgres_data/data  checksum              on                     default
postgres_data/data  compression           zstd                   inherited from postgres_data
postgres_data/data  atime                 off                    inherited from postgres_data
postgres_data/data  devices               on                     default
postgres_data/data  exec                  on                     default
postgres_data/data  setuid                on                     default
postgres_data/data  readonly              off                    default
postgres_data/data  zoned                 off                    default
postgres_data/data  snapdir               hidden                 default
postgres_data/data  aclmode               discard                default
postgres_data/data  aclinherit            restricted             default
postgres_data/data  createtxg             619                    -
postgres_data/data  canmount              on                     default
postgres_data/data  xattr                 sa                     inherited from postgres_data
postgres_data/data  copies                1                      default
postgres_data/data  version               5                      -
postgres_data/data  utf8only              off                    -
postgres_data/data  normalization         none                   -
postgres_data/data  casesensitivity       sensitive              -
postgres_data/data  vscan                 off                    default
postgres_data/data  nbmand                off                    default
postgres_data/data  sharesmb              off                    default
postgres_data/data  refquota              none                   default
postgres_data/data  refreservation        none                   default
postgres_data/data  guid                  10551245409099064921   -
postgres_data/data  primarycache          all                    default
postgres_data/data  secondarycache        all                    default
postgres_data/data  usedbysnapshots       0B                     -
postgres_data/data  usedbydataset         88.9G                  -
postgres_data/data  usedbychildren        0B                     -
postgres_data/data  usedbyrefreservation  0B                     -
postgres_data/data  logbias               throughput             local
postgres_data/data  objsetid              595                    -
postgres_data/data  dedup                 off                    default
postgres_data/data  mlslabel              none                   default
postgres_data/data  sync                  disabled               local
postgres_data/data  dnodesize             legacy                 default
postgres_data/data  refcompressratio      1.98x                  -
postgres_data/data  written               88.9G                  -
postgres_data/data  logicalused           176G                   -
postgres_data/data  logicalreferenced     176G                   -
postgres_data/data  volmode               default                default
postgres_data/data  filesystem_limit      none                   default
postgres_data/data  snapshot_limit        none                   default
postgres_data/data  filesystem_count      none                   default
postgres_data/data  snapshot_count        none                   default
postgres_data/data  snapdev               hidden                 default
postgres_data/data  acltype               off                    default
postgres_data/data  context               none                   default
postgres_data/data  fscontext             none                   default
postgres_data/data  defcontext            none                   default
postgres_data/data  rootcontext           none                   default
postgres_data/data  relatime              off                    default
postgres_data/data  redundant_metadata    most                   inherited from postgres_data
postgres_data/data  overlay               on                     default
postgres_data/data  encryption            off                    default
postgres_data/data  keylocation           none                   default
postgres_data/data  keyformat             none                   default
postgres_data/data  pbkdf2iters           0                      default
postgres_data/data  special_small_blocks  0                      default

ewwhite · Answer 1 · 2022-05-18T07:16:39.833

1

FRAG doesn't mean what you think it does.

It's a representation of how fragmented your free space is. It is not a way to show how fragmented your actual data is.

In practice, don't worry about this value unless you're really tight on space or experiencing a performance problem.

You should back out some of your changes, since they may have a negative impact on your actual workload.

The small 8k recordsize is also contributing to this (consider the default 128k or something larger than 8k)

See: https://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSZpoolFragmentationMeaning

See: https://bun.uptrace.dev/postgres/tuning-zfs-aws-ebs.html#zfs-recordsize

edited May 18 '22 at 07:16

answered May 18 '22 at 06:24

ewwhite

194,921
91
434
799

Right, I’m aware that it’s free space fragmentation. I’m surprised that it’s rising so high. I was having some speed issues before with updating my main table and I was convinced (maybe incorrectly?) that the free space fragmentation was contributing to it. RE: recordsize, it’s meant to match PostgreSQL’s page size. Is that a problem? – Jason May 18 '22 at 06:58
1

Please see the links. I don't recommend an 8k ZFS record size value for Postgres. – ewwhite May 18 '22 at 07:16

ZFS disk with Postgresql fragmenting quickly with write-heavy application

1 Answers1