0

I have a directory on an Ubuntu, with 340K images, 45GB of total size! Is there an efficient way to transfer them all to an S3 of DigitalOcean?

I thought of using s3cmd put or s3cmd sync but I'm guessing that would perform the put operation on every single file individually.

Any thoughts would be much appreciated!

Sotiris Kaniras
  • 198
  • 1
  • 10
  • 1
    This may be worth reading: https://serverfault.com/questions/73959/using-rsync-with-amazon-s3 – lainatnavi Dec 13 '19 at 18:54
  • 1
    The AWS CLI uploads files in parallel, you can configure the number of threads. 45GB is fairly trivial, just start it with 50 threads and let it run until it's done. – Tim Dec 14 '19 at 07:16

1 Answers1

1

I don't believe the S3 API lets you submit multiple files in a single API call, but you could look into concurrency options for the client you are using.

A good starting point would be the official AWS Command Line Interface (CLI) which has some S3 configuration values which allow you to adjust concurrency for aws s3 CLI transfer commands including cp, sync, mv, and rm:

max_concurrent_requests - The maximum number of concurrent requests (default: 10)
max_queue_size          - The maximum number of tasks in the task queue (default: 1000)
multipart_threshold     - The size threshold the CLI uses for multipart transfers of
                          individual files (default: 8MB)
multipart_chunksize     - When using multipart transfers, this is the chunk size
                          that the CLI uses for multipart transfers of individual
                          files (default: 8MB)
max_bandwidth           - The maximum bandwidth that will be consumed for uploading
                          and downloading data to and from Amazon S3 (default: None)

The AWS S3 configuration guide linked above also includes recommendations around adjusting these values for different scenarios.

For faster transfer you should also create your S3 bucket in a region with the least latency for your Digital Ocean instance or consider enabling S3 Transfer Acceleration. There are additional CLI options (and cost) if you use S3 Acceleration.

Once your configuration options are set, you can then use a command line like aws s3 sync /path/to/files s3://mybucket to recursively sync the image directory from your DigitalOcean server to an S3 bucket. The sync process only copies new or updated files, so you can run the same command again if a sync is interrupted or the source directory has been updated.

Stennie
  • 1,250
  • 7
  • 12
  • Thanks for your very well explained answer! Do you think it's a feasible tool for 45gb of data? Also, is the `AWS CLI` the same with `s3cmd`? – Sotiris Kaniras Dec 14 '19 at 13:31
  • 1
    `[S3cmd`](https://github.com/s3tools/s3cmd) is a third party tool using the AWS API but it will have different configuration options. `S3cmd` supports recursive file uploads but I don't see any options for adjusting concurrency settings. I tend to use the official AWS CLi which has a more complete range of options across AWS services (including some of the relevant S3 options I've highlighted here). You can always research alternatives if there are other features/options you are looking for. – Stennie Dec 14 '19 at 22:04
  • 1
    As far as feasibility goes, I don't foresee any obvious issues as long as you choose appropriate settings. I would test with a smaller set of files to find the best concurrency options since these could be limited by resources on your source instance. Given your earlier question about Parse, I'm also not sure if you've fully described what you are trying to solve. This question only mentions uploading images, but if this is one step of a migration from GridFS to S3 storage you probably want to rewrite the image paths in MongoDB as well. However, I'm answering the question as posed here :). – Stennie Dec 14 '19 at 22:05
  • You really helped me solve it! I decided to use a tool called `s3-parallel-put` instead... `but if this is one step of a migration from GridFS to S3 storage you probably want to rewrite the image paths in MongoDB as well.` Thankfully Parse's file adapter takes care of that! – Sotiris Kaniras Dec 14 '19 at 22:26