0

I have a collection of files, currently in a NAS on our LAN which is about 1.1 million files, with a total size of 2TB. We need to replicate this up to AWS to begin processing it. However, changes need to also sync back to our LAN when made on the cloud-side.

So far, the lowest sync latency we've been able to get is about an hour or two. Mounting the local NAS on our EC2 instance and simply enumerating all files find [path] &> /dev/null takes over an hour.

However, the files are in a structure of directories by order numbers, and once the order is complete, they are rarely, if ever modified. Likewise, the directories contain the order numbers, so that could potentially be used to find the most recent ones. I feel that this fact could be used to our advantage, but I'm not sure how.

Bandwidth is not an issue (around 100 MBPS both ways), and latency from the office to our AWS region of choice is about 35 ms.

Is there a better way to handle this? We have the ability to run VMs locally on our LAN if need be.

Ben Yanke
  • 133
  • 5

2 Answers2

0

Syncing across WAN links can be severly impacted by latency, especially when remote directory walking is involved. With a large number of files it already makes a huge difference whether you enumerate on a local volume or on a network share.

Your best bet for two-way is a client-server approach that walks both sides locally, like e.g. rsync can.

If you're able to reduce the sync to one-way for certain folders and just replicate you have a lot more options like e.g. copy depending on archive flag (Windows) or use tar through a remote pipe (Linux).

In any way you can also go by local timestamps ("what's new since last sync?").

Zac67
  • 8,639
  • 2
  • 10
  • 28
  • rsync still has to stat all the files, the latency of which adds up over a million of them. Note the hour long directory traversal mentioned. – John Mahowald Aug 09 '17 at 06:57
  • Are you sure you're doing it locally? On our slowest NAS (single-core Atom), `ll -R` for a folder with 1.4 million files just took less than 10 min. – Zac67 Aug 09 '17 at 08:50
  • can do it a bit faster from a box on the same LAN as the fileserver, but that doesn't solve the problem, as I need to sync them to the EC2 box, where it does take 60+ minutes. Unless I'm misunderstanding your point. – Ben Yanke Aug 09 '17 at 14:11
  • The point is to _not_ remotely access the source directory but to access it directly on the file server and select files to copy from there - bypassing WAN, LAN and share access latencies increases the tree walking speed by several orders of magnitude. If you need to actually _pull_ the data to the EC2 instance, build a to-do list _locally_ where it takes mere minutes, send it over and process there. – Zac67 Aug 09 '17 at 14:19
  • Yes, I understand the structure of what needs to be done there. How do I concretely make that happen without learning C and writing my own tool? – Ben Yanke Aug 09 '17 at 14:30
  • I think rsync running in daemon mode on one side and using native rsync protocol can do it. Writing a small batch tool isn't really that hard. – Zac67 Aug 09 '17 at 14:43
  • It's my understanding that two way sync with rsync isn't well supported? – Ben Yanke Aug 09 '17 at 16:13
  • I was hoping you could narrow down the directories with changes somehow, you're not providing too much information. Both sides are Linux then? How about using the modification time to build a transfer list? `find (dir) -mmin -60` will produce a list of files changed within the last 60 minutes. – Zac67 Aug 09 '17 at 16:41
  • Currently the LAN side is on a Windows box, but I'll have to see what we can do. I do have a few ideas from our discussion. Honestly, it might mean just moving to a linux box on the LAN side and a lot of custom scripting. Not an easy problem to solve here! – Ben Yanke Aug 09 '17 at 18:23
  • With Windows, you could leverage the archive flag and `robocopy /m /xo` which will seek for changed files, reset the archive flag and copy when newer. – Zac67 Aug 09 '17 at 18:31
  • With a Linux destination you might need the `/fft`option as well. – Zac67 Aug 09 '17 at 18:38
0

Perhaps snapshot the volumes and copy the entire block device in. Not incremental, but 2 TB sequential copy should be faster than iterating a million files.

Or use a file system with built in send and receive snapshots, like btrfs or zfs.

John Mahowald
  • 30,009
  • 1
  • 17
  • 32
  • Unfortunately, I need two-way replication, not one. If I had only needed one, ZFS snapshots would be a slam-dunk. – Ben Yanke Aug 09 '17 at 13:46
  • Two way is going to be tricky. Either a quite smart and efficient sync script, or a distributed object storage of some kind that both can access. Actually, consider storing files to be processed in a temporary location. Cloud object storage in S3 perhaps. Then download the final results in batches. – John Mahowald Aug 10 '17 at 12:06