SmartOS using ZFS and guest filesystems

Question

If I do something like RAIDZ2 with 10 or so disks, then on guest operating systems, if I use a filesystem like ext3/4, will the guest filesystem be as safe as if it were also using ZFS?

The reason I am asking is after reading some, it is recommended that when using ZFS that you have 1gb of RAM per 1tb of storage (I will end up with around 20-40tb). If I have ZFS on both the host and the guests, I would require double the RAM.

score 4 · Answer 1 · answered Dec 11 '13 at 22:02

Yes and no. If you have ZFS on the bottom, and then create zvols and offer them up using a block protocol -or- you offer up a file-level protocol (NFS, CIFS) and create files as disks (.vmdk's, .vhd's, etc), you gain some safety, but often the client (in this case the VM operating system) isn't necessarily set up by default in such a way that you are now protected as well as you might expect.

The reason for this is that many default filesystem setups opt for performance, sometimes at the expense of safety. This is why even when using a local hard disk, a Windows machine might need to CHKDSK after a sudden power event (and similarly, Linux may need to fsck). You'll have to look around for the OS and filesystem in question to determine what, if any, modifications you need to make to it to get it to more regularly sync up with its underlying disk, and/or enable any journaling that the filesystem supports.

Furthermore, the options you use from the client (the VM OS) to the server (ZFS) on the protocol you use has some impact, as well. One of the more infamous ones is iSCSI via COMSTAR on illumos/Solaris derivatives. The default on most of them is COMSTAR sets up new LU's with 'write cache' enabled. This, by default, will not pass all incoming I/O into ZFS as sync, and instead only do so if specifically flagged as such by the client (the VM OS). Depending on other settings, it's fairly common that the VM OS is /not/ passing all its I/O down as sync, so it isn't going into ZFS as sync, so it isn't making use of the disk ZIL mechanics, and thus you're not safe from power-loss events.

The primary issue here is basically that while ZFS itself is meant to be corruption-free, and literally does not even have a 'fsck'-style utility because it is not generally possible for ZFS to 'corrupt' itself, that logic does not necessarily hold true for the filesystems inside .vmdk's or zvols if those upper-level filesystems are not syncing up every single I/O to disk the second they get it (and default options are rarely to do so). It also won't hold true if your ZFS set up is not utilizing the ZIL (for instance, because you disabled it) or your underlying storage that ZFS is sitting on is ignoring or not being sent cache flush commands (for instance, because you told ZFS not to). In those scenarios ZFS should still always be consistent on disk, but the issue is after a power event when it comes back up it will be 'X' seconds ago consistent -- it'll boot without corruption in ZFS, but it will be how things were 5-30 seconds ago. This is fine for ZFS, but possibly not so fine for the filesystem inside those zvols/disks-as-files (.vmdk's). That 5 second 'rollback', if you will, might have been critical metadata for the filesystem inside that .vmdk and it becomes broken or even unbootable at the higher level. Meanwhile, ZFS still thinks everything is fine.

For your upper-level OS' to be as safe as humanly possible, all the following should be true:

your zfs needs to utilize the zil, do not do sync=disable, for example
your zfs needs to be writing to disks that either are battery/NVRAM-backed themselves, or obey cache flush commands; do not change zfs_nocacheflush to 1
your block protocol provider, likely COMSTAR (it is COMSTAR on SmartOS), needs to NOT be using 'write cache' setting - this is true even if you think you've got your clients sending sync - if you want everything sync anyway, why would you leave this enabled and risk something not be treated as such. similarly, if using file-level protocol (which you should, IMHO, over block whenever possible) make sure it is set to sync, this is the default on NFS (make sure your NFS mount options on the client do not say 'async').
you probably want to set sync=always on the datasets/zvols in your zfs that are offering VM disks
your hypervisor should be configured to send sync, or at the least, not disregard and remove sync requests from VM's (not an issue, AFAIK, on SmartOS; but can be on VMware)
your VM OS' need to be set up not to do any write caching - they need to, at the least, journal immediately, if not preferably sync on every write, and not cache writes. how this is done on each individual filesystem & OS option is too much for this post, but look for information on making your OS 'safe from power failure', key terms: write barriers, write cache, sync/fsync, journal

If your client OS filesystem is not holding back on writes, there's no intermediary eating your sync data requests, and the ZFS as the back is set up properly to sync and is sitting on proper storage itself, you should at that point have every expectation that any sort of power loss event will have minimal impact on the VM filesystems. At worst, they'll need a quick chkdsk/fsck on boot, if even that, and should never be significantly corrupted or flat out lost. The more of the above list are not true, the higher the probability of significant corruption.

As always, I have to also mention, KEEP BACKUPS. Even if you do all of the above and build a fully power-loss-safe environment, none of that is going to protect you from viruses on the VM's, hackers, a mistake made at 36 hours of being awake, a disgruntled employee, severe hardware problems, or (un)natural disasters.

score 0 · Answer 2 · answered Dec 10 '13 at 04:50

RAID is only useful for physical disks to prevent data loss from disk failure.
RAIDZ2 provides RAID6 like capability except it operates at the file level, rather than the disk level. As such, any protection offered by RAIDZ2 is automatically extended to the guests as they are the files.

SmartOS using ZFS and guest filesystems

2 Answers2