29

I am trying to test a project that needs compressed storage with use of the ext4 file system since the application I use relies on ext4 features.

Are there any production/stable solutions out there for transparent compression on ext4?

What I have tried:

Ext4 over ZFS volume with compression enabled. This actually had an adverse affect. I tried creating a ZFS volume with lz4 compression enabled and making an ext4 filesystem on /dev/zvol/... but the zfs volume showed double the actual usage and the compression did not seem to have any effect.

# du -hs /mnt/test
**1.1T**    /mnt/test
# zfs list
NAME        USED  AVAIL  REFER  MOUNTPOINT
pool       15.2T  2.70G   290K  /pool
pool/test  15.2T  13.1T  **2.14T**  -

ZFS Creation Commands

zpool create pool raidz2 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde2 /dev/sdf1 /dev/sdg1 /dev/sdh2 /dev/sdi1
zfs set recordsize=128k pool
zfs create -p -V15100GB pool/test
zfs set compression=lz4 pool/test
mkfs.ext4 -m1 -O 64bit,has_journal,extents,huge_file,flex_bg,uninit_bg,dir_nlink /dev/zvol/pool/test

Fusecompress: Seemed to work but not 100% stable. Looking for alternatives.

LessFS: Is it possible to use Lessfs in conjunction with ext4? I have not yet tried but would be interested in user insight.

One major problem: not true transparency

An issue I saw with fusecompress was quotas. For example, if I enabled compression on the filesystem, I would want my system to benefit from the compression, not necessarily the end user. If I enabled a quota of 1GB for a user, with a compression ratio of 1.5, they would be able to upload 1.5GB of data, rather than 1GB of data and the system benefiting from the compression. This also appeared to show on df -h. Is there a solution to have compression transparent to quotas?

user235918
  • 309
  • 1
  • 3
  • 7
  • Sure. Can you please list the OS/distro/version and details about the nature of the data you intend to store? – ewwhite Aug 04 '14 at 01:49
  • Also hardware details. – ewwhite Aug 04 '14 at 01:49
  • 1
    @ewwhite 8x3TB in a Software RAID6. Data will be rsynced backups from other servers so mixed data types and various end users, documents, etc. CentOS 6.5 x64. – user235918 Aug 04 '14 at 01:52
  • Are you sure you need this? Do you have many large, sparse files? Disk space is cheap these days. – Andrew Schulman Aug 04 '14 at 08:17
  • @AndrewSchulman: Taking advantage of compression is the better method from my calculation. The cost of extra disks and controllers that support them are more than the cost of CPU. – user235918 Aug 04 '14 at 18:51
  • @user235918 Also, to show the pool's compression ratio, can you give the output of `zfs get compressratio pool/test` ? – ewwhite Aug 04 '14 at 18:51
  • @ewwhite Yes I had already tested that. The compressratio is only 1.06X. – user235918 Aug 04 '14 at 18:52
  • @ewwhite I tested with a .tar file that was previously compressed from 633M to 216MB with gzip. It did not compress at all on the ext4 filesystem. It only compressed to 529M on a ZFS mount with lz4. So something with ext4 isn't working correctly, but I haven't tried volblocksize or anything from your post yet. – user235918 Aug 04 '14 at 18:57
  • @user235918 You're not going to be able to compress a file like a .gzip archive any further, assuming that's how you moved it. On ext4 and on top of a ZFS zvol, you won't *see* ANY compression. The file sizes will be their native sizes, but their space on disk will be smaller. It's transparent to the OS/applications. You'd only be able to see it in the zpool/zfs/compressratio figures. – ewwhite Aug 04 '14 at 19:00
  • @ewwhite Well I'm not sure what did it. Whether it was the sparse volume or the volblocksize but it appears to be working as expected now. I just created a new sparse volume and tested it by moving a bunch of files over. It is showing a compression ratio above 2 now and the size in zfs list is actually half the size of what du reports. I appreciate your help! – user235918 Aug 04 '14 at 19:10
  • @ewwhite Yeah I think the sparse volume fixed the ext4 issue and the volblocksize increased the compression because even before the tar file was only 529M on ZFS/LZ4. Now it is 334M which is about what I expected compared to the 216M of GZIP6. – user235918 Aug 04 '14 at 19:14
  • @user235918 Glad this helped. There are also some settings you'll want in `/etc/modprobe/zfs.conf` - Which version of ZFS on Linux did you download? 6.2 or 6.3? – ewwhite Aug 05 '14 at 12:36
  • @ewwhite I am using 6.3. With CentOS it is usually modprobe.d where the files are stored but I didn't have any zfs.conf there. – user235918 Aug 05 '14 at 23:35
  • Right. There are some values you'll want to put there. – ewwhite Aug 05 '14 at 23:54
  • @ewwhite. I did some searching online and came up with the zfs_arc_max. Is this what you're talking about? If so, I did some reading up on it, so thanks for pointing me in the right direction. – user235918 Aug 07 '14 at 00:14

2 Answers2

31

I use ZFS on Linux as a volume manager and a means to provide additional protections and functionality to traditional filesystems. This includes bringing block-level snapshots, replication, deduplication, compression and advanced caching to the XFS or ext4 filesystems.

See: https://pthree.org/2012/12/21/zfs-administration-part-xiv-zvols/ for another explanation.

In my most common use case, I leverage the ZFS zvol feature to create a sparse volume on an existing zpool. That zvol's properties can be set just like a normal ZFS filesystem's. At this juncture, you can set properties like compression type, volume size, caching method, etc.

Creating this zvol presents a block device to Linux that can be formatted with the filesystem of your choice. Use fdisk or parted to create your partition and mkfs the finished volume.

Mount this and you essentially have a filesystem backed by a zvol and with all of its properties.


Here's my workflow...

Create a zpool comprised of four disks:
You'll want the ashift=12 directive for the type of disks you're using. The zpool name is "vol0" in this case.

zpool create -o ashift=12 -f vol0 mirror scsi-AccOW140403AS1322043 scsi-AccOW140403AS1322042 mirror scsi-AccOW140403AS1322013 scsi-AccOW140403AS1322044

Set initial zpool settings:
I set autoexpand=on at the zpool level in case I ever replace the disks with larger drives or expand the pool in a ZFS mirrors setup. I typically don't use ZFS raidz1/2/3 because of poor performance and the inability to expand the zpool.

zpool set autoexpand=on vol0

Set initial zfs filesystem properties:
Please use the lz4 compression algorithm for new ZFS installations. It's okay to leave it on all the time.

zfs set compression=lz4 vol0
zfs set atime=off vol0

Create ZFS zvol:
For ZFS on Linux, it's very important that you use a large block size. -o volblocksize=128k is absolutely essential here. The -s option creates a sparse zvol and doesn't consume pool space until it's needed. You can overcommit here, if you know your data well. In this case, I have about 444GB of usable disk space in the pool, but I'm presenting an 800GB volume to XFS.

zfs create -o volblocksize=128K -s -V 800G vol0/pprovol

Partition zvol device:
(should be /dev/zd0 for the first zvol; /dev/zd16, /dev/zd32, etc. for subsequent zvols)

fdisk /dev/zd0 # (create new aligned partition with the "c" and "u" parameters)

Create and mount the filesystem:
mkfs.xfs or ext4 on the newly created partition, /dev/zd0p1.

mkfs.xfs -f -l size=256m,version=2 -s size=4096 /dev/zd0p1

Grab the UUID with blkid and modify /etc/fstab.

UUID=455cae52-89e0-4fb3-a896-8f597a1ea402 /ppro       xfs     noatime,logbufs=8,logbsize=256k 1 2

Mount the new filesystem.

mount /ppro/

Results...

[root@Testa ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sde2        20G  8.9G  9.9G  48% /
tmpfs            32G     0   32G   0% /dev/shm
/dev/sde1       485M   63M  397M  14% /boot
/dev/sde7       2.0G   68M  1.9G   4% /tmp
/dev/sde3        12G  2.6G  8.7G  24% /usr
/dev/sde6       6.0G  907M  4.8G  16% /var
/dev/zd0p1      800G  398G  403G  50% /ppro  <-- Compressed ZFS-backed XFS filesystem.
vol0            110G  256K  110G   1% /vol0

ZFS filesystem listing.

[root@Testa ~]# zfs list
NAME           USED  AVAIL  REFER  MOUNTPOINT
vol0           328G   109G   272K  /vol0
vol0/pprovol   326G   109G   186G  -   <-- The actual zvol providing the backing for XFS.
vol1           183G   817G   136K  /vol1
vol1/images    183G   817G   183G  /images

ZFS zpool list.

[root@Testa ~]# zpool list -v
NAME   SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
vol0   444G   328G   116G    73%  1.00x  ONLINE  -
  mirror   222G   164G  58.1G         -
    scsi-AccOW140403AS1322043      -      -      -         -
    scsi-AccOW140403AS1322042      -      -      -         -
  mirror   222G   164G  58.1G         -
    scsi-AccOW140403AS1322013      -      -      -         -
    scsi-AccOW140403AS1322044      -      -      -         -

ZFS zvol properties (take note of referenced, compressratio and volsize).

[root@Testa ~]# zfs get all vol0/pprovol
NAME          PROPERTY               VALUE                  SOURCE
vol0/pprovol  type                   volume                 -
vol0/pprovol  creation               Sun May 11 15:27 2014  -
vol0/pprovol  used                   326G                   -
vol0/pprovol  available              109G                   -
vol0/pprovol  referenced             186G                   -
vol0/pprovol  compressratio          2.99x                  -
vol0/pprovol  reservation            none                   default
vol0/pprovol  volsize                800G                   local
vol0/pprovol  volblocksize           128K                   -
vol0/pprovol  checksum               on                     default
vol0/pprovol  compression            lz4                    inherited from vol0
vol0/pprovol  readonly               off                    default
vol0/pprovol  copies                 1                      default
vol0/pprovol  refreservation         none                   default
vol0/pprovol  primarycache           all                    default
vol0/pprovol  secondarycache         all                    default
vol0/pprovol  usedbysnapshots        140G                   -
vol0/pprovol  usedbydataset          186G                   -
vol0/pprovol  usedbychildren         0                      -
vol0/pprovol  usedbyrefreservation   0                      -
vol0/pprovol  logbias                latency                default
vol0/pprovol  dedup                  off                    default
vol0/pprovol  mlslabel               none                   default
vol0/pprovol  sync                   standard               default
vol0/pprovol  refcompressratio       3.32x                  -
vol0/pprovol  written                210M                   -
vol0/pprovol  snapdev                hidden                 default
ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Why partition the zvol? Can't it just be used directly? – Michael Hampton Aug 04 '14 at 14:46
  • 3
    @MichaelHampton Mainly for alignment and consistency. Also, I want flexibility if I expand the underlying volume. There are several layers of abstraction here. It's similar to the argument of using `/dev/sdb` versus `/dev/sdb1`. – ewwhite Aug 04 '14 at 14:58
  • 1
    Thanks for your information. A lot of good advice in here. I'm going to test it out. – user235918 Aug 04 '14 at 18:32
  • 2
    @MichaelHampton BTW, these days, I don't partition anymore... especially with virtual machines. – ewwhite Oct 26 '15 at 22:02
  • Do you continue to use 128K volblocksize? With such a large block size, each small (ie: 4K) write will trigger a read-modify-checksum-write operation, wasting I/O bandwidth. There are any specific reasons to use such a large block size? – shodanshok Feb 21 '17 at 18:29
  • For my workloads, testing at various block sizes down to 8k showed no appreciable differences in performance. Compression rate was much better at 64k and above. – ewwhite Feb 21 '17 at 19:18
  • 1
    Can you please tell about the additional resource costs for the ZFS layer in this setup (RAM, CPU)? – Sz. Dec 16 '17 at 00:25
  • You forgot something about deduplication. It uses a lot of RAM for storing information about deduplicated blocks, because ZFS uses online deduplication algorithm. It os not possible write fiels first, then when system is idle, to make offline deduplication. Only one possible way is switch off deduplication flag on the day, then when system is idle, switch deduplication on, and complete rewrite all fresh files. When you switch deduplication off again, all deduplicated data will be still deduplicated. You can use SSD as write cache for preventing big RAM usage, but this is workaround. – Znik Jan 04 '18 at 14:17
  • Many guides exist that say ZFS needs 8GB of RAM just for its base level requirements. I'd like to use this with VMs, but most only need 2-4GB of RAM. Can this be used in low RAM setups without a serious performance penalty? – jimp Nov 19 '18 at 17:37
4

You also need to enable discard on the ext4 filesystem. Without discard, zfs does not reclaim the space when files are removed. This can end up leading to large space discrepancies between what the ext4 filesystem reports and the zfs volume reports.

Devon
  • 780
  • 1
  • 9
  • 20
  • 4
    Red Hat doesn't recommend doing this online with the discard mount option (with ext4 or xfs), as there's a performance impact. It's cleaner to periodically run the `fstrim` command. – ewwhite Aug 29 '14 at 13:38
  • wrt the comment about discard mounts impacting performance: This is true with old, low quality SSDs. It's not true with newer ones. – Stoat Nov 28 '15 at 17:14