8

I want to replicate in the region of 10Tb of data (lots of smallish files, low level of churn) across a WAN with minimal impact on the available infrastructure.

While I could simply use rsync, this means looking for the changes and comparing the local and remote data (disk I/O, network bandwidth and CPU costs) although rsync does this efficiently, I wonder of there is a more efficient solution which can track changes and propagate them (preferably bidirectionally).

The storage itself is iSCSI on HP NAS devices. We have looked previously at using its built-in replication capabilities but found them to be slow and unreliable.

DRBD mirrors would require additional hardware at both ends. Which would be rather expensive. I've also been bitten by DRBD replication failures in the past.

Would glusterfs be more efficient? Would it be really dumb to go with a 2 node setup? Is there a better solution?

symcbean
  • 19,931
  • 1
  • 29
  • 49

4 Answers4

5

On the block level, the synchronization can be done using Starwind that makes a mirrored disk on both ends. It can run over iSCSI LUNs, making active-active storage. No additional hardware required. https://www.starwindsoftware.com/blog/storage-ha-on-the-cheap-fixing-synology-diskstation-flaky-performance-with-starwind-free-part-3-failover-duration

On the file level, lsyncd and rsync do a mirror synchronizing files between servers. These tools might require tweaking the configuration files in order to ensure the file locking mechanism works as expected and no split-brain would occur. https://linoxide.com/tools/setup-lsyncd-sync-directories/

A.Newgate
  • 1,476
  • 5
  • 11
  • You provided a link to a document that describes high availability relying on a very fast link between the nodes. Not what I was asking about. – symcbean Jul 27 '20 at 15:35
  • You can replicate with a close-to-WAN latencies using StarWind and lsyncd is 100% WAN friendly. – NISMO1968 Jul 31 '20 at 19:12
1

You could use lsyncd to have a constant syncing of files between systems. lsyncd installs inotify watches on directories that are synced. Whenever files change in the directories, it will transfer changes to remote server using rsync.

Tero Kilkanen
  • 34,499
  • 3
  • 38
  • 58
  • Oooh - excitement. lsyncd looks like it fits the bill very well! – symcbean Jul 06 '20 at 19:29
  • 1
    For the record, after sizing this, I would have needed gigabytes of memory just to track the files involved. The impact on file I/O would have been horrendous. While lsyncd works very well at the low Gb level it does not scale much beyond that. – symcbean Jul 06 '21 at 20:49
0

You could use ionice for io load limit and bwlimit argument in rsync for limit network io. There are also some other methods: Rsync huge dataset of small files 5TB, +M small files

  • 1
    That's just going to slow it down but thanks for the link. Sadly ZFS on top of iscsi is even more difficult that on native disks. – symcbean Jul 06 '20 at 19:27
0

If you willing to try something new, then IPFS might be a great tool for you experiment with.

https://ipfs.io/

Using a Private IPFS Cluster might be give you great results depending on your file replication needs.

https://cluster.ipfs.io/

However bear in mind, this is pretty new stuff, but is maturing very quickly.

The Unix Janitor
  • 2,388
  • 14
  • 13
  • Ceph has been around a lot longer and is far better documented - but does not fit my use case. – symcbean Jul 27 '20 at 15:37
  • As far as i know, IPFS has nothing to do with Ceph... But I hope you find something to fit your use case, or in-fact your limitation, it maybe that you may need to change the network architecture , bit the bullet and invested in some 10G switches and fibre, or change to a provider than can give you these things.. good luck. – The Unix Janitor Jul 28 '20 at 11:43