47

I understand how rsync works on a high-level, but there are 2 sides. With S3 there is no daemon to speak of — well there is, but it's basically just HTTP.

There look to be a few approaches.

s3rsync (but this just bolts on rsync to s3). Straightforward. Not sure I want to depend on something 3rd party. I wish s3 just supported rsync.

There also are some rsync 'clones' like duplicity that claim to support s3 without said bolt-on. But how can it do this? Are they keeping an index file locally? I'm not sure how that can be as efficient.

I obviously want to use s3 because it's cheap and reliable, but there are things that rsync is the tool for, like backing up a giant directory of images.

What are the options here? What do I lose by using duplicity + s3 instead of rsync + s3rsync + s3?

Jaimie Sirovich
  • 571
  • 1
  • 4
  • 5
  • 4
    S3 is cheap? That's news to me. Reliable? For sure, but not cheap. – EEAA Aug 18 '12 at 23:55
  • 5
    Well, s3 is $0.13/gb or less as you store more or want less redundancy. A quick search reveals http://www.evbackup.com/ for rsync storage. Far more expensive. What's cheaper and has some level of redundancy? – Jaimie Sirovich Aug 19 '12 at 02:11
  • If *I* were to design rsync, it would support plugins so that new protocols (e.g. s3://) could be added. However, at present, rsync doesn't support this, so I don't believe rsync can be used directly for backing up to S3. – Edward Falk Jul 01 '19 at 02:56
  • The next issue is that I don't think S3 stores metadata such as ownership or permissions, so using e.g. "aws s3 sync" to do backups will work but probably isn't suitable for a full-blown backup of a Unix filesystem, since too much data would be lost on restore. I also think symlinks, hardlinks, and other special files would be lost. – Edward Falk Jul 01 '19 at 02:58

7 Answers7

44

Since this question was last answered, there is a new AWS command line tool, aws.

It can sync, rsync-like, between local storage and s3. Example usage:

aws s3 sync s3://mybucket /some/local/dir/

If your system's python environment is set up properly, you can install AWS client using pip:

pip install awscli
Dan Pritts
  • 3,181
  • 25
  • 27
  • 2
    In my experience, this uploads everything, not a just delta of changes. For example, I was pushing a static site to a dev server with `rsync`, and it took an average of 1 second, with just the changes going out over my slow connection. `aws s3 sync` on the other hand, took about 5 minutes, retransferring each and every file. – ryebread Mar 16 '16 at 16:43
  • 3
    I believe you that it doesn't work, but the docs say "A local file will require uploading if the size of the local file is different than the size of the s3 object, the last modified time of the local file is newer than the last modified time of the s3 object, or the local file does not exist under the specified bucket and prefix." Make sure you have the latest version of aws-cli - if you can reproduce this, file a bug with them on github. They were responsive when i filed a bug a while ago. – Dan Pritts Mar 16 '16 at 18:52
  • The command should be: aws s3 sync /some/local/dir/ s3://mybucket – Carlo S Nov 07 '17 at 23:30
  • 1
    Carlos, I'm not sure what your point is. If you mean to suggest that my example command is wrong, we are both right. The s3 sync can work in either direction. – Dan Pritts Nov 16 '17 at 07:16
  • Late to the party, but here's what happening: When *uploading* to S3, the quick check rules apply (upload if size or date has changed). When *downloading*, there are no quick check rules, and everything is downloaded unconditionally. – Edward Falk Jul 01 '19 at 02:53
17

The s3cmd tool has a great sync option. I use it to sync local backups, using something like:

s3cmd sync --skip-existing $BACKUPDIR/weekly/ s3://MYBACKUP/backup/mysql/

The --skip-existing means it doesn't try to checksum compare the existing files. If there is a file with that name already, it will just quickly skip it and move on. There is also --delete-removed option which will remove files not existing locally, but I want to keep on S3 even ones that I have cleaned up locally so I don't use this.

Nic Cottrell
  • 1,282
  • 16
  • 31
6

Don't want to tell anyone what to do but may I wave a flag for duplicity? or other incremental backup solution. Syncing is all very well, but if you backup nightly, what happens if you don't notice the problem for two days? Answer: Its too late, your local files and your backup are a mirror of each other and neither have the data you need. You really should consider incremental backups or snapshots so you can recover to a particular moment in time and to do this efficiently you need incremental backups. And if losing your data is an end of the world scenario then keep copies at different providers as you never know, then could get lost, hacked who knows.

I use duplicity and s3, its fine but is cpu intensive. But it does incremental backups. In an emergency when you want to restore a dir or particular file, as it was last wednesday, or last January, without restoring the other files on the same partition you need incremental backups and a tool where you can request just the files your need.

I have a cron, that does full every x months, otherwise incremental and deletes older than x months to keep s3 storage totals down, finally does collection status so I get mailed each morning with the status. You need to keep an eye on it regularly so you notice when your backup isnt working.

It requires significant local temp space to keep the local signatures so setup the temp dir carefully. This backups /mnt, excluding various dirs inside /mnt. This is good for backing up data, for system partitions use amazon imaging or snapshot tools.

PHP script:

# Duplicity Backups

$exclude  = "--exclude /mnt/ephemeral ".
            "--exclude /mnt/logs ".
            "--exclude /mnt/service ".
            "--exclude /mnt/mail ".
            "--exclude /mnt/mysql ";

$key = "PASSPHRASE=securegpgpassphrase";

$tmp = "/mnt/mytempdir";

system("mkdir -p $tmp");

# Amazon

$aws = "AWS_ACCESS_KEY_ID=xxxxxx ".
       "AWS_SECRET_ACCESS_KEY=xxxxxx ";

$ops = "-v5 --tempdir=$tmp --archive-dir=$tmp --allow-source-mismatch --s3-european-buckets --s3-use-new-style --s3-use-rrs";
$target = " s3://s3-eu-west-1.amazonaws.com/mybucket";

# Clean + Backup

system("$key $aws /usr/bin/duplicity $ops --full-if-older-than 2M $exclude /mnt $target");
system("$key $aws /usr/bin/duplicity $ops remove-older-than 6M --force $target");
system("$key $aws /usr/bin/duplicity $ops cleanup --force --extra-clean $target");
system("$key $aws /usr/bin/duplicity $ops collection-status $target")
Jack
  • 69
  • 1
  • 2
6

You can alternatively use minio client aka mc Using 'mc mirror' command will do the job.

$ mc mirror share/sharegain/ s3/MyS3Bucket/share/sharegain 
  • mc: minio client
  • share/sharegain: local directory
  • s3: Alias for https://s3.amazonaws.com
  • MyS3Bucket: My remote S3 bucket
  • share/sharegain: My object on s3

You can write a simple script as cronjob which will keep a sync at periodic interval.

Hope, it helps.

Atul
  • 71
  • 1
  • 1
  • There's also a `-w` flag now, which will use `fsnotify` to watch for changes. It can easily be set up as a system service or similar. – alkar Oct 20 '16 at 23:56
3

S3 is a general purpose object storage system that provides enough flexibility for you to design how you want to use it.

I'm not sure from your question the issues with rsync (other than indexing) or issues with '3rd party' tool you've run into.

If you have large set of files well structured, you can run multiple s3 syncs on your sub-folders.

The nice folks at Amazon also allow you to do an import/export from your portable harddrive for large file transfer to S3 or EBS -- http://aws.amazon.com/importexport/ which you can use for the first upload.

See Amazon s3 best practices here -- http://aws.amazon.com/articles/1904

As far as differrent tools, try them and see what works best for you. Regarding pricing, there is reduced redundancy pricing if it suits your needs -- http://aws.amazon.com/s3/pricing/

General recommendation -- have a fast multicore CPU and good network pipe.

UPDATE: Mention about checksumming on S3

Regarding S3 stores data in key value pairs and there is no concept of directories. S3sync verifies checksum (S3 has a mechanism to send checksum as an header for verification -- Content-MD5 header). The best practices link Data Integrity part of it has it in detail. S3 allows you to send/verify and retrieve checksums. There are plenty of folks doing incremental backups with duplicity. Even though there is no rsync running on S3, you can do checksums like I mentioned here.

rsync is a proven tool and most of the modern tools use the same algorithm or rsync library or call rsync externally.

Chida
  • 2,471
  • 1
  • 16
  • 29
  • 2
    I don't see how this answers the question. I was asking how duplicity manages to do what rsync does without a daemon on the other side. It has no ability to even get a checksum, or maybe it does, but then how would it incrementally update the files? – Jaimie Sirovich Aug 19 '12 at 06:19
  • OK. So you're saying that Duplicity uses this hash from S3, but it also claims to work over FTP. FTP has no hashing mechanism. I tend to err on the safe side and use the 'proven' tools. Rsync is proven yes, but it won't do s3 backups without the s3 add-on service s3rsync. I'm a bit scared of duplicity, but it has wider protocol appeal if I can get some level of rsync-like functionality with s3 without said accessory service. I just don't get how _well_ it works (and possibly differently with various protocols). How the heck does it do FTP syncing ? :) – Jaimie Sirovich Aug 20 '12 at 16:16
  • @JaimieSirovich Test it and see. If you had, you'd have known Duplicity builds "manifest" files in less time than it took you to type all these comments about what it *might* be doing. – ceejayoz Dec 02 '14 at 15:47
3

I'm not sure if true rsync is a good fit for Amazon.

As I understand it, the standard rsync algorithm means the client computes hashes for each block of a file and the server computes hashes for its copy and sends those hashes to the client which means the client can determine which blocks have changed and need uploading.

That causes two problems for Amazon in that a lot of hashes have to be send down over the internet and also it takes processing power to calculate all those hashes which would increase Amazon's costs - which is probably why they leave it to third party providers who can charge extra for that feature.

As for the clones, they are obviously storing the hashes somewhere and the somewhere may vary depending on the clone. It would be possible for them to store the hashes as a separate object per file on Amazon or as a database stored on Amazon or they may store them locally and remotely.

There are advantages and dis-advantages of doing it either way. If the hashes are stored remotely in individual files, then it can be costly to be continually retrieving them. If the hashes are stored in a database remotely, then this database can become large and it can be costly to be continually retrieving and updating them. If the hashes are stored locally, then this helps reduce costs, but introduces other complications and problems.

(Of course Amazon have other services, so it would be possible to keep a database in Amazon DB)

As an example, I tried out one early rsync clone many years ago. This was not written to take into account of Amazon's pricing structure and was issuing lots of http gets to retrieve the hash of each block and since Amazon charge for each get, it meant that while the storage part of my bill fell sharply, the transfer part ballooned.

What do I lose by using duplicity + s3 instead of rsync + s3rsync + s3?

You lose the fact that with rsync you know you are comparing source files with your backup files. With duplicity and other clones, you are comparing your source files with a hash which was taken when the backup was performed. For example, it may be possible to access S3 directly and replace one of its files without recomputing the hash or updating the hash database.

sgmoore
  • 652
  • 5
  • 10
0

After comparing multiple options mentioned in this thread I decided to go for S3fs. It allows you to mount S3 as a local filesystem. You can then proceed and use rsync the way you already know it.

This is a good tutorial to get started: Amazon S3 with Rsync

The author previously used the mentioned s3sync, but then switched to the option with S3Fs. I like it because I also have other backup folders locally mounted via SSHFS.

Hendrik
  • 251
  • 1
  • 4
  • 11
  • 13
    Danger, Will Robinson! This is really expensive as you're not getting any benefits of the rsync low-bandwidth communication --- s3fs will end up reading (and then writing, if it changes) the entire file, which means Amazon will bill you twice. Instead consider using an EC2 instance and using rsync remotely to that via ssh. Transfers to S3 from an EC2 instance are free, so all you pay for is rsync's low-bandwidth communication from your local machine to the EC2 instance. Running an EC2 micro instance on demand costs practically nothing. – David Given Oct 10 '13 at 11:05
  • 2
    This! There's a lot of bad advice out there for those that do not understand rsync and S3... – Mark Dec 17 '13 at 18:29
  • The one downside of this is that now you have a micro instance to manage. Trivial if you know how, but a barrier to entry for many. On the plus side, EC2-attached EBS storage is about half the price per byte of S3. – Dan Pritts Jul 29 '15 at 13:47
  • @DavidGiven What if I wrote direct to the mounted s3fs without using rysnc and then managed longevity via lifecycle? – Forethinker Jul 11 '18 at 06:54