0

I have an S3 bucket with millions of files, and I want to download all of them. Since I don't have enough storage, I would like to download them, compress them on the fly and only then save them. How do I do this?

To illustrate what I mean: aws s3 cp --recursive s3://bucket | gzip > file

Dave M
  • 4,494
  • 21
  • 30
  • 30
jorge
  • 1
  • 1
  • Instead of >file you probably can use netcat (pipe though nc). – Hennes Feb 27 '22 at 10:40
  • A couple of ideas 1) Mount S3 as a drive (google it) and zip it from there 2) Get a spot instance, download and zip. Make sure you're using an S3 gateway endpoint in your VPC to reduce costs. – Tim Feb 27 '22 at 16:55
  • You could also write a lambda that takes a path from S3 and gzips the contents then returns the gzipped file. Then you could use the `aws` CLI to list the files and send requests to the lambda. – shearn89 Feb 28 '22 at 09:39
  • "Download" to where? To an Amazon EC2 instance, or your own computer? – John Rotenstein Mar 11 '22 at 23:39

1 Answers1

0

It's not clear if you want to keep the uncompressed objects in S3 or if the bucket contents are still changing.

One option you have is to use S3 inventory. It's not instant, but it will automatically generate a list of objects in the bucket and write that to a S3 bucket (the same bucket or another). You could read this list into a small script (whatever you are comfortable with) and have it work one object at a time. Use the S3 CLI to pull down the object, then compress it using the OS/script tools.

I strongly recommend building in something that checks if the compressed object already exists so you can restart the process if it fails or new objects are added without having to process everything again.

If you are writing the compressed objects back to S3, consider using an EC2 instance or Lambda. With Lambda you may need to use a file stream to compress the file on the fly rather than pulling it down. You should be able to find examples of this for at least Python, if not other supported languages.

-- One word of caution, do a rough calculation on how much this is going to cost. Get requests are fairly cheap, but data transfer out can be expensive. Also if you are using any storage class other than Standard, it's probably going to have a retrieval cost associated with it.

Tim P
  • 66
  • 5