Upload very many files to S3

I have about 1 million images (organized in directories) that I need to get into S3. I started using s3sync.rb, but since it is built for syncing it creates tons of extra files for keeping track of stuff. I don't need or want this- just need to upload it once.

Next I tried s3cmd (the python version) which has a --recursive option for simple put. The problem is that it tries to process all the upload files up front (at least that is what it looks like in debug mode), which doesn't work for the number of files I have.

I'm thinking of trying something like bucket explorer or s3fox, but I'm afraid of wasting a bunch of time and only getting half way.

Any recommendations please?

Edit: I am aware of some of the options for mounting s3, but haven't had good experiences with s3fs. Would jungledisk work well with large numbers of files? Also those programs tend to create the extra files that I would rather not have.

therealsix

Posted 2012-01-20T20:47:51.813

Reputation: 101

s3fox (organizer) crashes pretty much immediately – therealsix – 2012-01-20T21:25:23.647

Cloudberry does not seem to create extra files, but their support told me not to expect good things with this number of files. Said they are working on it. – therealsix – 2012-01-22T03:44:08.317

Answers

I haven't tried that particular storage option, but cyberduck supports s3, has a sync option, and has generally been quite robust for me.

Journeyman Geek

Posted 2012-01-20T20:47:51.813

Reputation: 119 122

Could you send them a portable storage device with your data on it?

G-wizard

Posted 2012-01-20T20:47:51.813

Reputation: 111

You could try running s3sync.rb with the --no-md5 options. With that option only the modified date is compared.

I have used Jungledisk to backup a pretty large number of files (~20k) and it performed very well. Although it does create a separate database to keep the files that were uploaded (and perform deduplication). From what I have seen, the size of the backup database is trivial compared to the size of all the files that have been backed up.

No matter how you upload things to S3 there will be "extra files" because S3 does not store directories, it only supports keys and nodes, so the directory info has to be saved separately.

cmorse

Posted 2012-01-20T20:47:51.813

Reputation: 1 010