5

I'm looking to backup various directories and files from a Linux server to AWS Glacier. I'm trying to work out the details on how to do manage this.

Incremental Backups

I want to upload files incrementally. So essentially, if a file hasn't changed, I don't want to upload it again to Glacier if it already exists on there. I think I have this part figured out. Because you can't get instant lists of the archives in your Glacier vault, I'll keep a local database of uploaded files, in order to be able to tell what exists in the vault and what doesn't. This will allow me to do incremental backups (only uploading missing or changed files).

Can't Overwrite Files?

According to (http://aws.amazon.com/glacier/faqs/):

Archives stored in Amazon Glacier are immutable, i.e. archives can be uploaded and deleted but cannot be edited or overwritten.

So what happens if I upload a file/archive, then later, the file changes locally, and the next time I do a backup, how does Glacier deal with this since it can't overwrite the file with a new version?

Deleting Old Data

AWS charges $0.03 per GB to delete archives that are less than 3 months old. Since I am doing a backup of a local server, I want to delete archives that no longer exist locally. What is the best way to organize this. Use the locally stored archive inventory to determine what data doesn't exist anymore and if it's > 3 months old, delete it from Glacier? That seems straightforward but is there a better approach to this?

Individual files vs. TAR/ZIP files

You can upload either individual files as archives or be more efficient by grouping your files into TAR or ZIP files before uploading. The idea of TAR/ZIP files is appealing because it makes it more simple and you incur smaller storage fees, but I'm wondering how I would deal with incremental uploads. If a 20 MB zip file is uploaded that contains 10,000 files, and one of those files is changed locally, do I need to upload another 20 MB zip file? Now I'm required to eat the cost of storing 2 copies of almost everything in those zip files... Also, how would I deal with deleting things in a ZIP file that don't exist locally anymore? Since I don't want to delete the whole zip file, now I'm incurring fees to store files that don't exist anymore.

Maybe I'm overthinking all of this. What are the most straightforward ways to approach these questions?

I don't know if it matters or not, but I'm using the PHP SDK for this backup script. Also I don't want to upload to an S3 bucket first and then backup the bucket to Glacier since I would have to now pay for S3 storage and transfer fees as well.

Jake Wilson
  • 8,494
  • 29
  • 94
  • 121
  • Sounds like you want to have the cake and eat it too :) Glacier does not sound like the right tool at all. You want store files that will mutate, and pay super cheap storage price. Why not just take ebs volume snapshots, it will be incremental by default. – alexfvolk Dec 14 '19 at 03:25

3 Answers3

3

So what happens if I upload a file/archive, then later, the file changes locally, and the next time I do a backup, how does Glacier deal with this since it can't overwrite the file with a new version?

Per the Glacier FAQ:

You store data in Amazon Glacier as an archive. Each archive is assigned a unique archive ID that can later be used to retrieve the data. An archive can represent a single file or you may choose to combine several files to be uploaded as a single archive. You upload archives into vaults. Vaults are collections of archives that you use to organize your data.

So what this means is each file you upload is assigned a unique ID. Upload the same file twice and each copy of the file gets its own ID. This gives you the ability to restore to previous versions of the file if desired.

Use the locally stored archive inventory to determine what data doesn't exist anymore and if it's > 3 months old, delete it from Glacier? That seems straightforward but is there a better approach to this?

To avoid the surcharge for deleting data less than 3 months old this is likely the best approach. But it won't just be the data that doesn't exist any more that you need to track & delete. As mentioned above, any time a file changes and you re-upload it to Glacier you'll get a new ID for the file. You'll eventually want to delete the older versions of the file as well, assuming you don't want the ability to restore to those older versions.

If a 20 MB zip file is uploaded that contains 10,000 files, and one of those files is changed locally, do I need to upload another 20 MB zip file? Now I'm required to eat the cost of storing 2 copies of almost everything in those zip files... Also, how would I deal with deleting things in a ZIP file that don't exist locally anymore? Since I don't want to delete the whole zip file, now I'm incurring fees to store files that don't exist anymore.

That's the tradeoff you really have to decide for yourself. Do you tar/zip everything and then be forced to track those files and everything in them, or is it worth it to you to upload files individually so you can purge them individually as they're no longer needed.

A couple other approaches you might consider:

  • Have two or more tar/zip archives, one that contains files that are highly unlikely to change (like system files) and the other(s) containing configuration files and other things that are more likely to change over time.
  • Don't bother with tracking individual files and back everything up in a single tar/zip archive that gets uploaded to Glacier. As each archive reaches the 3-month point (or possibly even later) just delete it. That gives you a very easy way to track & restore from a given point in time.

Having said all that, however, Glacier just may not be the best approach for your needs. Glacier is really meant for data archiving, which is different than just backing up servers. If you just want to do incremental backups of a server then using S3 instead of Glacier might be a better approach. Using a tool like Duplicity or rdiff-backup (in conjunction with something like s3fs) would give you the ability to take incremental backups to an S3 bucket and manage them very easily. I've used rdiff-backup on a few linux systems over the years and found it worked quite nicely.

Bruce P
  • 2,163
  • 3
  • 16
  • 21
2

Here is the command-line tool for *nix, which supports uploading of only-modified files, replacing localy modified files and deleting localy removed files from Glacier https://github.com/vsespb/mt-aws-glacier

vsespb
  • 71
  • 1
0

As an alternative, you could use something like Duplicity, then upload the archives it produces.

This has a few benefits:

  • Duplicity does incremental backups, so only the changed files are captured in the backup set
  • Duplicity can deal with file changes, so if only a small part of a file is modified, in theory, only the change is uploaded
  • Your backups are encrypted, if you're the paranoid type

The easiest way to use Duplicity with Glacier is:

  • Backup to a local directory somewhere (and keep this backup). Duplicity needs access to it's "manifest" file each time a backup is run so it can tell which files have changed.
  • Upload any new archives created by Duplicity to Glacier from your local backup. Use something like glacier-cmd for this.
pdey
  • 1
  • 1
  • Since you can't index or easily access Glacier files how would Duplicity ever know if files have changed or not? – Jake Wilson Aug 03 '14 at 06:13
  • 1
    Duplicity uses its manifest file to tell when a file has changed. This is why it's easier if you keep a local copy of your backup too. You could also keep only the manifest files local, if you're willing to script in the extra complexity. – pdey Aug 03 '14 at 06:23
  • @JakeWilson The pricing of Glacier makes it important you have your own online copy of the files. You only restore from Glacier in case you lose all local hardware (e.g. in case of fire). – Mikko Rantalainen Jun 29 '22 at 15:47