Transfering about 300gb in files from one server to another

20

14

I have about 200,000 files that I am transferring to a new server today. I haven't done anything on such a large scale before, and wanted to get some advice on how I should go about this. I am moving them between two Centos 6 distros, and they are in different locations in the country. I don't have enough HDD space on the original server to tar up all of the directories and files into one massive tarball, so my question is how should I transfer all of these files? rsync? some special way of using rsync? Any input/suggestions on how to do it would be amazing.

Thanks

EDIT: For those wondering, i HIGHLY suggest using a screen when running a large rsync command like this. Especially when something silly may occur and you lose the connection to the server A which you are running the rsync command from. Then just detach the screen and resume it later.

MasterGberry

Posted 2013-03-26T18:12:40.100

Reputation: 399

4Have you tried rsync yet? Maybe on a small set of files or so? Should be the ideal tool for that. – slhck – 2013-03-26T18:14:51.173

It's almost certainly not the best tool for this job, but you may be interested in the fact that you can stream tar through an ssh connection rather than having to compress to a file before moving the file: tar cz | ssh user@example.com tar xz – Aesin – 2013-03-27T01:00:36.447

2it could be off topic, but (especially for an initial load, and then using rsync for subsequent updates) : "Never underestimate the bandwidth of a station wagon full of tapes" (ie: have you considered placing a 2nd hd (or plug a usb2/usb3 disk), backup on it, and send that one via fedex to the remote location? It could be MUCH faster than anything else, and save bandwidth for other uses. – Olivier Dulac – 2013-03-27T09:10:59.983

I don't have any BW limits on one provider, and the other I won't reach this month. So I don't really have an issue wasting it :P – MasterGberry – 2013-03-27T15:54:21.660

1

@OlivierDulac http://what-if.xkcd.com/31/

– Bob – 2013-03-28T10:07:43.570

Answers

24

Just to flesh out Simon's answer, rsync is the perfect tool for the job:

   Rsync  is  a  fast  and extraordinarily versatile file copying
   tool.  It can copy locally,  to/from  another  host  over  any
   remote  shell,  or to/from a remote rsync daemon.  It offers a
   large number of options  that  control  every  aspect  of  its
   behavior  and permit very flexible specification of the set of
   files to be copied.  It is famous for its delta-transfer algo‐
   rithm,  which reduces the amount of data sent over the network
   by sending only the differences between the source  files  and
   the  existing  files in the destination.  Rsync is widely used
   for backups and mirroring and as an improved copy command  for
   everyday use.

Assuming you have ssh access to the remote machine, you would want to do something like this:

rsync -hrtplu path/to/local/foo user@remote.server.com:/path/to/remote/bar

This will copy the directory path/to/local/foo to /path/to/remote/bar on the remote server. A new subdirectory named bar/foo will be created. If you only want to copy the contents of a directory, without creating a directory of that name on the target, add a trailing slash:

rsync -hrtplu path/to/local/foo/ user@remote.server.com:/path/to/remote/bar

This will copy the contents of foo/ into the remote directory bar/.

A few relevant options:

 -h,                         output numbers in a human-readable format 
 -r                          recurse into directories
 -t, --times                 preserve modification times
 -p, --perms                 preserve permissions
 -l, --links                 copy symlinks as symlinks
 -u, --update                skip files that are newer on the receiver
 --delete                    delete extraneous files from dest dirs
 -z, --compress              compress file data during the transfer
 -C, --cvs-exclude           auto-ignore files in the same way CVS does
 --progress                  show progress during transfer
 --stats                     give some file-transfer stats

terdon

Posted 2013-03-26T18:12:40.100

Reputation: 45 216

so could i do rsync -hrtplu --progress if I wanted to see the progress as it is going along? – MasterGberry – 2013-03-26T18:54:49.880

@MasterGberry yup, exactly. I have a backup script that runs rsync --progress --stats -hrtl --update source destination. – terdon – 2013-03-26T18:57:39.823

I seem to be having issues getting it to run. rsync -hrtplu --progress --rsh='ssh -p2202' is what i am using and it can't connect. Keep getting 255 error. But I am sshd'd into it. So I know its not the firewall...do I need to provide the password via the cmd also? or wouldn't it just ask me after for it? – MasterGberry – 2013-03-26T19:08:43.137

Derp, nvm. I forgot about outbound traffic on my firewall. Thanks – MasterGberry – 2013-03-26T19:17:40.197

Important note : with rsync, be extra careful when using the "--delete" : read a lot about it, test on other (/tmp/...) folders, and beware the changes when adding or not a trailing "/" at the end of the source dir(s) or destination dir. – Olivier Dulac – 2013-03-28T10:26:31.807

14

It depends on how fast it needs to be copied, and how much bandwidth is available.

For a poor network connection consider the bandwidth of a truck filled with tapes. (Read: mail a 2.5 inch HDD, or just drive it there yourself. 300 gigabit drives should be easy to find).

If it is less time critical or you you plenty of bandwidth then rsync is great. If there is an error you can just continue without re-copying the earlier files.

[Edit] I forgot to add that you can run rsync several times if your data gets used during the copy.

Example:
1) Data in use. Rsync -> All data gets copied. This may take some time.
2) Run rsync again, only the changed files get copied. This should be fast.

You can do this several times until there are no changes, or you can do it the smart/safe way by making the data read-only during the copy. (e.g. if it is on a used shared set that share to read-only. Or rsync the data, then at night set the share read-only while you run it a second time).

Hennes

Posted 2013-03-26T18:12:40.100

Reputation: 60 739

1No server should be living somewhere where bandwidth can't handle 300G in a reasonable amount of time... – Dan – 2013-03-27T07:59:05.600

1That depends on what is 'reasonable'. Say the weakest link is 100 mbit (I do not care if that is the upload limit from on office or the download at the other). That roughly allows for 10MB/sec. ( div by 10 seems reasonable, I know you can get slightly more if all goes perfectly well. E.g. nobody else is using the line for anything at all). 10MB/sec ~~ 600MB/min ~~ 36000MB/hour ~~ 36 GB/hour ~~ 300GB is 8h20min. That is doable overnight. That also makes a lot of assumptions. E.g. if the upload is only 2 mbit ( We have offices with those speeds) it takes 50 times as long (415h, or 17.3 days). – Hennes – 2013-03-27T11:56:53.237

Yikes! Yes, 8-10 is reasonable, but I was indeed making a number of assumptions. – Dan – 2013-03-27T15:44:03.270

2@Dan If it is a requirement that the server is up and serving requests, saturating the upstream bandwidth is probably a bad idea. So you would have to artificially throttle the transfer speed to account for that. – Bob – 2013-03-28T10:10:04.453

6

I would go for rsync! I am using it to backup my server to an offsite server and it works fine. Usually there are a few MBs to copy but some days it goes up to 20-30GB and it allways worked without a problem.

Simon

Posted 2013-03-26T18:12:40.100

Reputation: 3 831

0

rsync over NFS using Gigabit connection will took nearly about 10h. It will be better to copy data on HDD and move them between server. If you need to make one-to-one copy of actually disk, use dd or something like that, to create raw image of disk. Using ssh (scp) cause a huge overhead. Empirically tested on Gigabit connection. rsync is good in making incremental synchronization between two servers used in HA or in backup mode. I guess.

Pawel

Posted 2013-03-26T18:12:40.100

Reputation: 107

The language and style of this answer need to be improved. – FSMaxB – 2013-03-27T07:09:34.717

Rsync is especially great the if files can change during the copy. Just run it a few times. First time all data gets copied. Second time only what got changed during the first (long) copy. A third time would be done at night or with the shares read-only. – Hennes – 2013-03-27T12:00:13.400

will took nearly about 10h. It will be better to copy data on HDD and move them between server. except that it's across the country, so it'd take longer. – Rob – 2013-03-27T13:07:29.497

@FSMaxB: I will do this later, thx. – Pawel – 2013-03-28T09:45:32.203

@Rob: I have read this ;) That the servers are in two different location. So, you need to calculate, what will be better for You. Taking a journey accross the country (checking the cost of fuel, etc.) or using network connection. What will be more beneficial. – Pawel – 2013-03-28T09:48:46.420

0

first time use NFS and tar/untar (NFS is the fastest protocol in this case, tar to save network bandwidth by more CPU utilization)

tar cf - * | ( cd /target; tar xfp -)

next time/s use rsync

jet

Posted 2013-03-26T18:12:40.100

Reputation: 2 675

If your have enough CPU power you can improve on this by adding gzip to the loop. And without NFS you can use netcat. (Or even both: tar -cf - * | gzip | nc -p 4567 and nc -l 4567 | gunzip | tar xf -. – Hennes – 2013-03-30T11:33:33.603

thanks Hennes, that was my idea, but I forgot gzip in the pipes – jet – 2013-03-30T14:30:50.193