15

We have a folder structure on our intranet which contains around 800,000 files divvied up into around 4,000 folders. We need to synchronize this to a small cluster of machines in our DMZs. The depth of the structure is very shallow (it never exceeds two levels deep).

Most of the files never change, each day there are a few thousand updated files and 1-2 thousand new files. The data is historical reporting data being maintained where the source data has been purged (i.e. these are finalized reports for which the source data is sufficiently old that we archive and delete it). Synchronizing once per day is sufficient given that it can happen in a reasonable time frame. Reports are generated overnight, and we sync first thing in the morning as a scheduled task.

Obviously since so few of the files change on a regular basis, we can benefit greatly from incremental copy. We have tried Rsync, but that can take as long as eight to twelve hours just to complete the "building file list" operation. It's clear that we are rapidly outgrowing what rsync is capable of (12 hour time frame is much too long).

We had been using another tool called RepliWeb to synchronize the structures, and it can do an incremental transfer in around 45 minutes. However it seems we've exceeded its limit, it has started seeing files show up as deletes when they are not (maybe some internal memory structure has been exhausted, we're not sure).

Has anyone else run into a large scale synchronization project of this sort? Is there something designed to handle massive file structures like this for synchronization?

Dave Cheney
  • 18,307
  • 7
  • 48
  • 56
MightyE
  • 251
  • 2
  • 6
  • Have you tried splitting up the work over several instances of rsync running at the same time? I don't have a real good picture of the directory structure but you could split it up by directory name or file name. – Clutch Feb 23 '10 at 20:18
  • We had thought about that, but with such a flat structure, it's hard to find good dividing lines on which to split up the work. It's complicated by the fact that the folders are for the most part very similarly named (there is a naming convention which makes most of the folders start with the same initial set of 6 characters). – MightyE Feb 23 '10 at 21:09
  • Did you ever find a good solution, Dave? I'm considering lsyncd for a dir with 65535 sub-dirs, each of which *could* have 65^16 files. – Mike Diehn Sep 24 '14 at 21:46
  • 1
    @MikeDiehn I never did find a tool I was totally happy with here. We got that proprietary RepliWeb tool to fix the bug where they saw files as deletes which were not, it was an overflowed internal structure. I left that job years ago, I assume they're still using that. For your purposes, if your directories are reasonably distributed, you could go with something like Ryan's solution. It won't notice top level deletes, but 65535 subdirs suggests to me that you probably don't have those. – MightyE Oct 26 '14 at 15:21

5 Answers5

9

If you can trust the filesystem last-modified timestamps, you can speed things up by combining Rsync with the UNIX/Linux 'find' utility. 'find' can assemble a list of all files that show last-modified times within the past day, and then pipe ONLY that shortened list of files/directories to Rsync. This is much faster than having Rsync compare the metadata of every single file on the sender against the remote server.

In short, the following command will execute Rsync ONLY on the list of files and directories that have changed in the last 24 hours: (Rsync will NOT bother to check any other files/directories.)

find /local/data/path/ -mindepth 1 -ctime -0 -print0 | xargs -0 -n 1 -I {} -- rsync -a {} remote.host:/remote/data/path/.

In case you're not familiar with the 'find' command, it recurses through a specific directory subtree, looking for files and/or directories that meet whatever criteria you specify. For example, this command:

find . -name '\.svn' -type d -ctime -0 -print

will start in the current directory (".") and recurse through all sub-directories, looking for:

  • any directories ("-type d"),
  • named ".svn" ("-name '.svn'"),
  • with metadata modified in the last 24 hours ("-ctime -0").

It prints the full path name ("-print") of anything matching those criteria on the standard output. The options '-name ', '-type ', and '-ctime ' are called "tests", and the option '-print' is called an "action". The man page for 'find' has a complete list of tests and actions.

If you want to be really clever, you can use the 'find' command's '-cnewer ' test, instead of '-ctime ' to make this process more fault-tolerant and flexible. '-cnewer' tests whether each file/directory in the tree has had its metadata modified more recently than some reference file. Use 'touch' to create the NEXT run's reference file at the beginning of each run, right before the 'find... | rsync...' command executes. Here's the basic implementation:

#!/bin/sh
curr_ref_file=`ls /var/run/last_rsync_run.*`
next_ref_file="/var/run/last_rsync_run.$RANDOM"
touch $next_ref_file
find /local/data/path/ -mindepth 1 -cnewer $curr_ref_file -print0 | xargs -0 -n 1 -I {} -- rsync -a {} remote.host:/remote/data/path/.
rm -f $curr_ref_file

This script automatically knows when it was last run, and it only transfers files modified since the last run. While this is more complicated, it protects you against situations where you might have missed running the job for more than 24 hours, due to downtime or some other error.

Ryan B. Lynch
  • 2,006
  • 1
  • 12
  • 13
  • This is an extremely clever solution! I'm thinking you mean to `touch $next_ref_file` at the end? It does leave us without the ability to cope with deleted paths though (even these static archival reports eventually get old enough that they are archived and deleted). That might not be a show stopper though. – MightyE Feb 24 '10 at 15:15
  • I am finding though that even just `find . -ctime 0` is pretty slow on this directory structure (still waiting on it to complete to report its time). That actually disheartens me a bit because it seems like this might be a pretty low-level operation which probably sets the bar for the fastest we could expect this job to complete. It may be the case that disk I/O is the limiting factor here. – MightyE Feb 24 '10 at 15:22
  • As for that scriptlet, yes, I made a mistake. I meant run 'touch' on 'next_ref_file' (NOT 'curr_ref_file') right before running the 'find... | rsync...' command. (I'll fix my answer.) – Ryan B. Lynch Feb 24 '10 at 18:01
  • 3
    As for the slow 'find' command: What kind of filesystem are you using? If you're using Ext3, you might want to consider two FS tweaks: 1) Run 'tune2fs -O dir_index ' to enable Ext3's 'dir_index' feature, to speed up access to dirs with large file counts. 2) Run 'mount -o remount,noatime,nodiratime' to turn off access time updates, which speeds up reading, generally. 'dumpe2fs -h | grep dir_index' tells you if 'dir_index' is already enabled (on some distros, it's the default), and 'mount | grep ' tells you about access time updates. – Ryan B. Lynch Feb 24 '10 at 18:17
  • Sadly it's NTFS - Windows 2003 Server using Cygwin for the find command. I will remember those tuning options (excellent advice) for ext3 in case we ever run into something similar on one of our Debian clusters. – MightyE Feb 25 '10 at 23:20
8

Try unison, it was specifically designed to solve this problem by keeping the change lists (building file list), locally to each server, speeding up the time to calculate the delta, and the reduce amount that is sent across the wire afterwards.

Dave Cheney
  • 18,307
  • 7
  • 48
  • 56
  • I'm giving Unison a try. It's been running for about 2 hours now on the "Looking for changes" stage, and based on the files it's currently working on, it looks like it's about half way done (so maybe 4 hours total before transfer starts). It's looking like it will be better than rsync, but still outside of our desired operational window. – MightyE Feb 24 '10 at 14:20
  • 2
    The first time you create an index on both sides, the rebuild times are similar to rsync as it has to hash each file. Once this is done, unison uses the last modified time of the directory to identify when a file has changed, and only has to scan that file for changes. – Dave Cheney Feb 24 '10 at 14:31
  • Sadly I was the victim of an over-zealous Operations administrator who force-ended my session before the catalog was done being built (we limit the number of simultaneous log-ons to production servers). I lost the progress it had made on building the initial catalog, so I have to start over again. I'll let you know how it goes. – MightyE Feb 24 '10 at 17:41
  • It takes about 2 hours now that the initial catalog is built to scan for changes. I'm pretty surprised how much RAM Unison is using for this. For our file collection, the source server is using 635M, and the remote client is using 366M. To synchronize several machines in a cluster would be a pretty hefty footprint, particularly for the source server! – MightyE Feb 25 '10 at 23:15
  • 1
    Are you able to structure your data in a way that makes it easy to identify the data the has changed recently ? Ie, storing it in year/month/day/... format ? – Dave Cheney Feb 26 '10 at 02:19
3

http://oss.linbit.com/csync2/ is designed for this sort of thing, I'd give that a try.

Justin
  • 3,776
  • 15
  • 20
2

If you're using the -z switch on rsync, try running without it. For some reason I've seen this speed up even the initial enumeration of files.

Chris Thorpe
  • 9,903
  • 22
  • 32
  • We have tried with and without the -z flag. It did not seem to have an impact on the "building file list" execution duration. – MightyE Feb 24 '10 at 14:22
2

Taking the -z out of the rsync command which is no compression made the "receiving file list" go so much faster and we had to transfer about 500 GB. Before it took a day with the -z switch.

ryand32
  • 21
  • 1