2

I have 3.5TB of data (file backups) in AWS Glacier. I would like AWS to ship me a retrieval drive/appliance with this data because I don't think the download would successfully complete. I attempted to do this with Snowball but couldn't because Snowball would only let me select S3 buckets.

Is there a way to select a Glacier archive with Snowball, is there another AWS disk retrieval service I should use, or is there another process that is commonly used in this situation such as a client that can reliably download the Glacier archive over several days?

Michael
  • 125
  • 5

1 Answers1

6

You can't export data directly from Glacier to a disk.

S3 Glacier Storage Class

If your data was in S3, moved to the Glacier storage class, you initiate a retrieval request from Glacier to S3. You then use Import/Export or Snowball to export on a drive.

Once it's in S3 you can use any of the S3 tools available to download the files. If you have a 100Mbps internet connection and can manage 80Mbps it's around 4.5 days, which is probably faster than a snowball. You can potentially use S3 transfer acceleration if your S3 region is distant from your location, but it's more expensive than standard S3.

Retrieving directly from Glacier

I can't find any simple way of getting data from Glacier to a disk without downloading it manually. The Glacier documentation says you initiate a retrieval request, when it finishes you have at least 24 hours to download it.

The only way I can think of to get the data to Snowball is to

  • Initiate the Glacier request, wait for the notification
  • Spin up an EC2 instance. Larger instances have higher network bandwidth, and enhanced networking will help. An st1 throughput optimized drive might be cheaper than gp2 SSD, but if it's only a day or two it doesn't matter much. You'd probably still want to boot from an SSD, but having two drives is a bit more work
  • Download the data from glacier to the EC2 file system
  • Upload the data from the EC2 file system to S3
  • Request a Snowball from that S3 bucket.

This might mean paying double the bandwidth charges. You could also use the new File based S3 Storage Gateway, but you'd have to set it up. You could also use EFS, but it's expensive. Finally it might be possible to map S3 as a hard drive using something like s3fs, but I have no experience with that.

If your download from Glacier fails for any reason you have to start it again. For a single large 3.5TB archive this could be a problem. Range retrievals can help, but if it's one large file you'd have to stitch it back together.

I assume that downloading from Glacier to EC2 will be much faster and more reliable than directly to your PC.

Recommendation

It's difficult to make a single recommendation without more information, particularly around connection speed, reliability, and whether the Glacier download is one file or many.

To get the files quickly you're probably best just downloading it from Glacier, ideally with range retrievals.

To be safe, download to EC2, upload to S3, then download from S3. S3 supports parallel downloads, so it should use all of your available bandwidth.

Retrieval pricing has been simplified from the previous model. It's between $0.01/GB and $0.0025/GB, plus data transfer charges.

Tim
  • 30,383
  • 6
  • 47
  • 77
  • 1
    it looks like you are assuming that the data is stored in S3 using the `GLACIER` storage class. The mention in the question of an "archive" -- a native Glacier term, not one used by S3 -- suggests that this data was stored directly in Glacier, the S3 tag on the question notwithstanding. – Michael - sqlbot Apr 27 '17 at 01:16
  • Thanks @Michael-sqlbot. I had the mistaken impression that all glacier retrievals went via S3. I've done some reading and updated my answer. – Tim Apr 27 '17 at 01:36
  • Glacier retrieval prices [were completely restructured in late 2016](https://aws.amazon.com/blogs/aws/aws-storage-update-s3-glacier-price-reductions/). The cost, particularly for bulk retrievals, is now almost negligible. Also, a range retrieval isn't what it sounds like. Glacier has a staging area from which you download (optionally by byte ranges, which is *not* a range retrieval). A range retrieval means only retrieving a specific byte range from an archive into the staging area. In the old structure, it wasn't downloading that cost so much, it was the retrieval into staging for download. – Michael - sqlbot Apr 27 '17 at 02:15
  • Thanks again. I know what the range retrieval is but wasn't very clear, hopefully clarified now. – Tim Apr 27 '17 at 04:24