Gzip huge directory into separate .gz files for ssh transfer

2

1

I have a directory of ~200.000 .npy files with a total size ~100 gb. All files are stored directly below the main directory (i.e. there's no sub-directories). I need to transfer the directory and would like to do it by first compressing it into a smaller number of gzip files that I then transfer using ssh. I've tried to naïvely gzip the whole directory at once, which made my server freeze, requiring a hard reboot.

How can I easily gzip the directory of files into, say, 1000 .gz files that I can then easily transfer and unzip again?

I'd preferably like to do this in a manner, where the maximum resource consumption on the server at any one point (primarily RAM/IO) is agnostic regarding the characteristics of the directory (total size / # of files). I'm hoping that I can find a method that I'll be able to use with even larger directories without making my server freeze. The solution should preferably be using bash or Python. Thanks!

pir

Posted 2016-12-03T07:57:21.553

Reputation: 221

When you tried to gzip the entire directory, what exactly did you do? – Daniel B – 2016-12-03T09:19:25.610

Answers

2

This is appears to be a good match for rsync. It will transparently compress the contents, and it can be told to limit the bandwidth usage, which serves both to avoid clogging the network, and to prevent high IO load on the originating server:

rsync -az --bwlimit=1m directory server:/destination/

-a tells rsync to copy the file metadata such as creation times, -z means use compression, and --bwlimit limits the bandwidth used over the network.

As an additional bonus when rsync is used, if you interrupt the operation for any reason, and re-run it again, it will automatically pick up where it left off. If you also need to delete extra files at the destination, add the --delete option.

user4815162342

Posted 2016-12-03T07:57:21.553

Reputation: 293

This is a good suggestion, but what if you don't have rsync installed on the destination server? – Alessandro Dotti Contra – 2016-12-03T13:25:37.817

1@adc rsync is normally installed on Linux servers. If you somehow stumble on one that doesn't have it, I would suggest combining tar czf - directory | ssh remote 'cd destination && tar xf -'. If that runs too fast and causes high IO load on the origin server, add throttle -m 1 between the first tar and ssh. (You'll need to install the throttle utility, but only on the client.) – user4815162342 – 2016-12-03T13:32:33.127

I agree rsync is part of nearly all default Linux server installations, but you never know for sure beforehand, as some system administrators like to remove everything not stricly needed. Just for the sake of discussion, because we're drifting away from the original question. – Alessandro Dotti Contra – 2016-12-03T15:13:33.533

@adc True enough. Without rsync at my disposal, I'd go with the tar based solution. If you want, I can post that as a separate answer. – user4815162342 – 2016-12-03T15:46:52.070

You can edit and expand your answer if you like; I second both your solutions. – Alessandro Dotti Contra – 2016-12-03T15:51:08.177

Looks good! Makes sense to use this approach instead of gzipping. However, I've tried running this and so far it's just stalled at the console. Do you know what is a reasonable time for it to initialize and start the synchronization? – pir – 2016-12-04T07:09:22.300

1@pir 200k is quite a lot of files; if unsure, add the -v to see what rsync is doing. – user4815162342 – 2016-12-04T08:17:34.413