ZFS is an incredible filesystem and solves many of my local and shared data storage needs.
While, I do like the idea of clustered ZFS wherever possible, sometimes it's not practical, or I need some geographical separation of storage nodes.
One of the use cases I have is for high-performance replicated storage on Linux application servers. For example, I support a legacy software product that benefits from low-latency NVMe SSD drives for its data. The application has an application-level mirroring option that can replicate to a secondary server, but it's often inaccurate and is a 10-minute RPO.
I've solved this problem by having a secondary server (also running ZFS on similar or dissimilar hardware) that can be local, remote or both. By combining the three utilities detailed below, I've crafted a replication solution that gives me continuous replication, deep snapshot retention and flexible failover options.
zfs-auto-snapshot - https://github.com/zfsonlinux/zfs-auto-snapshot
Just a handy tool to enable periodic ZFS filesystem level snapshots. I typically run with the following schedule on production volumes:
# /etc/cron.d/zfs-auto-snapshot
PATH="/usr/bin:/bin:/usr/sbin:/sbin"
*/5 * * * * root /sbin/zfs-auto-snapshot -q -g --label=frequent --keep=24 //
00 * * * * root /sbin/zfs-auto-snapshot -q -g --label=hourly --keep=24 //
59 23 * * * root /sbin/zfs-auto-snapshot -q -g --label=daily --keep=14 //
59 23 * * 0 root /sbin/zfs-auto-snapshot -q -g --label=weekly --keep=4 //
00 00 1 * * root /sbin/zfs-auto-snapshot -q -g --label=monthly --keep=4 //
Syncoid (Sanoid) - https://github.com/jimsalterjrs/sanoid
This program can run ad-hoc snap/replication of a ZFS filesystem to a secondary target. I only use the syncoid portion of the product.
Assuming server1 and server2, simple command run from server2 to pull data from server1:
#!/bin/bash
/usr/local/bin/syncoid root@server1:vol1/data vol2/data
exit $?
Monit - https://mmonit.com/monit/
Monit is an extremely flexible job scheduler and execution manager. By default, it works on a 30-second interval, but I modify the config to use a 15-second base time cycle.
An example config that runs the above replication script every 15 seconds (1 cycle)
check program storagesync with path /usr/local/bin/run_storagesync.sh
every 1 cycles
if status != 0 then alert
This is simple to automate and add via configuration management. By wrapping the execution of the snapshot/replication in Monit, you get centralized status, job control and alerting (email, SNMP, custom script).
The result is that I have servers that have multiple months of monthly snapshots and many points of rollback and retention within: https://pastebin.com/zuNzgi0G - Plus, a continuous rolling 15-second atomic replica:
# monit status
Program 'storagesync'
status Status ok
monitoring status Monitored
last started Wed, 05 Apr 2017 05:37:59
last exit value 0
data collected Wed, 05 Apr 2017 05:37:59
.
.
.
Program 'storagesync'
status Status ok
monitoring status Monitored
last started Wed, 05 Apr 2017 05:38:59
last exit value 0
data collected Wed, 05 Apr 2017 05:38:59