Transferring a large amount of data between continents

12

2

Possible Duplicate:
Free way to share large files over the Internet?
What are some options for transfering large files without using the Internet?

My wife's lab is doing a project here in the US with collaborators in Singapore. They occasionally need to transfer a large amount of high-dimensional image data (~10GB compressed) across continents. With current technologies, what would be a good solution for this usage scenario?

I can think of a few but none of them seems ideal:

  • Direct connection via Internet: transfer rate is about 500KB/s, also lacking a tool to handle errors/retransmissions.
  • Upload to a common server or service such as Dropbox: painful to upload for non-US collaborator.
  • Burning discs or copying to HDs and shipping through Courier: latency is significant, plus the extra work to make a local copy.

Any suggestions?

Update: neither party of the collaboration are tech-savvy users.

Frank

Posted 2011-12-02T19:19:40.917

Reputation: 730

Question was closed 2011-12-05T02:17:49.693

Image as in pictures, or image as in a file representing a DVD? – Daniel Beck – 2011-12-02T19:48:49.340

High dimensional images, as generated by microscopes. – Frank – 2011-12-02T20:28:35.860

1So it's several very large files? Could you give us more information regarding file count, individual file size, and how many of those change between transfers? Is it all of them, some of them, etc.? – Daniel Beck – 2011-12-02T20:30:19.243

1

Some DNA sequencers have decided that FedEx is the fastest way to send their prohibitively large amounts of data around the world.

– joshuahedlund – 2011-12-02T22:03:20.573

Sounds like a job for Sneakernet or IPoAC.

– Naftuli Kay – 2011-12-02T23:01:55.900

This comes up a lot in High energy physics. There was a time when the only cost-effective thing to do was write tapes and air freight them. Those days seem to be gone (for now, sometimes these things cycle) and a variety of internet based solutions are used. – dmckee --- ex-moderator kitten – 2011-12-03T20:29:12.903

Answers

20

I suggest you use rsync. Rsync supports delta-transfer algorithm, so if your files are only partially changed, or if the previous transfer was terminated abnormally, Rsync is smart enough to sync only what's new/changed.

There are several ports of the original Rsync to Windows and other non-unix-compatible systems, both free and non-free. Please see Rsync Wikipedia article for details.

Rsync over SSH is very widely used, and works well. 10GB is relatively small amount of data nowdays, and you didn't specify what "occasionally" means. Weekly? Daily? Hourly? With 500KB/sec transfer rate it will take around 6 hours, not really a long time. If you need to transfer the data frequently, it is probably better to create a cron task to start rsync automatically.

haimg

Posted 2011-12-02T19:19:40.917

Reputation: 19 503

Doesn't rsync require its own protocol for deltas, requiring a capable counterpart system on the other end? – Daniel Beck – 2011-12-02T19:47:30.607

@DanielBeck: There is nothing in the docs that says that rsync over SSH cannot use deltacopy... Basically rsync client executes another rsync copy on the server via ssh, so I don't see why it wouldn't work. – haimg – 2011-12-02T19:55:51.153

+1 You have a point there. That leaves the Linux requirement on the server though? – Daniel Beck – 2011-12-02T20:01:53.983

Does rsync's delta-algorithm work when transferring binary compressed data (.zip or .jpg)? – Aditya – 2011-12-02T20:46:22.210

@DanielBeck: I've added a link to Wikipedia article with several Windows rsync ports. Apparently at least some of them work as a server, including ssh. I've never used any of them though. – haimg – 2011-12-02T21:36:21.473

@Aditya: Yes. rsync's delta algorithm works with binary data too. So, if there are some common sections between the source and the target file, they will be skipped. However, re-compressing usually changes the archive too much, so delta algorithm is not that effective in this case. – haimg – 2011-12-02T21:40:13.030

Rsync is probably the best option in terms of reliability and minimizing the amount of data transferred but getting any of the windows ports to work properly takes a fair bit of technical knowledge in my experience. Last time I tried I gave up and wrote some scripts that used bit torrent to transfer the files automatically instead. – stoj – 2011-12-02T22:21:41.250

@haimg: There's a patch available for gzip to make it rsync-friendly. Link

– afrazier – 2011-12-03T04:49:43.947

12

Connection across the internet can be a viable option and a program such as bittorrent is exactly suited to this purpose as it will break the files up into logical pieces to be sent over the internet to be reconstructed at the other end.

Bittorrent also gives you automatic error correction, repair of damaged pieces and if more people are needing the files then they will get the benefit of being able to be supplied the file from as many sources as already have (parts of) the file downloaded.

Granted people see it as a nice way to download films and such, but the it does have many more legal uses.

A lot of bittorrent clients also have built in trackers so you don't have to have a dedicated server to host the files.

Mokubai

Posted 2011-12-02T19:19:40.917

Reputation: 64 434

2Thanks for the input. Use of BitTorrent within academic networks may make their administrators nervous. Also, the set up and maintenance of a tracker server may not be that easy for an average computer user. – Frank – 2011-12-02T20:35:31.230

2That is a good point, bittorrent is actively prohibited in many corporate and academic networks. With proper administration though you can set up a white list within networks of users or machines that are allowed to use bittorrent, though this would mean very close ties with respective IT departments to work properly. As I mentioned you do not necessarily need to have a dedicated server as it can be built in to many client programs. If it is not a good fit for your situation though then no worries, it just seemed to me to be reasonable considering your requirements. – Mokubai – 2011-12-02T20:46:25.347

If you were using bitorrent, also using a webseed sounds like a clever idea – Journeyman Geek – 2011-12-02T23:51:45.207

(As an example of one of ‘more legal uses’ mentioned in the answer, Facebook utilizes bittorrent to deploy their site, 1GB binary, to thousands of production servers. How unfortunate that a technology is discarded mostly because of one of its uses.)

– Anton Strogonoff – 2011-12-03T09:05:27.983

6

Split the file up in chunks of e.g. 50MB (using e.g. split). Compute checksums for all of them (e.g. md5sum). Upload directly using FTP and an error-tolerant FTP client, such as lftp on Linux. Transfer all of the chunks and a file containing all checksums.

On the remote site, verify that all the chunks have the desired checksum, reupload those that failed, and reassemble them to the original file (e.g. using cat).

Revert location of server (I posted under the assumption that the destination site provided the server and you start the transfer locally when the files are ready) as needed. Your FTP client shouldn't care.


I have had similar issues in the past and using an error-tolerant FTP client worked. No bits were ever flipped, just regular connection aborts, so I could skip creating chunks and just upload the file. We still provided a checksum for the complete file, just in case.

Daniel Beck

Posted 2011-12-02T19:19:40.917

Reputation: 98 421

3You need to be aware though that lftp does not abort a transfer in progress for any reason. Make sure that you always have enough free disk space on the destination site. – Daniel Beck – 2011-12-02T19:50:53.660

3

A variation of the answer of Daniel Beck is to split up the files in chunks in the order of 50MB to 200MB and create parity files for the whole set.

Now you can transfer the files (including the parity files) with FTP, SCP or something else to the remote site and do a check after arrival of the whole set. Now if there are parts damaged they can be fixed by the parity files if there are enough blocks. This depends more or less on how many files are damaged and how many parity files you created.

Parity files are used a lot on Usenet to send large files. Most of the time they are split up as RAR archives then. It's not uncommon to send data up to 50 to 60GB this way.

You should definitely check out the first link and you could also take a look at QuickPar, a tool that can be used to create parity files, verifies your downloaded files and can even restore damaged files with the provided parity files.

Martijn B

Posted 2011-12-02T19:19:40.917

Reputation: 241

+1 - This approach works well on usenet, and the parity files can repair an astonishing amount of missing data. Downside being the processing time required to split and generate parity files and to parity check and extract files after reciept. – deizel – 2011-12-03T05:29:36.083

1

Is it one big 10GB file? Could it be easily split up?

I haven't played with this much, but it struck me as an interesting and relatively simple concept that might work in this situation:

http://sendoid.com/

Craig H

Posted 2011-12-02T19:19:40.917

Reputation: 1 172

Sendoid is pretty cool, but unfortunately uploading is still going to be painful. Then again, the problem persists for all types I believe, unless you are going to mail a HDD. +1 as it's easy to use. – DMan – 2011-12-03T04:08:42.330

0

Make the data available via ftp/http/https/sftp/ftps (requiring logon credentials) and use any download manager on the client side.

Download managers are specifically designed to retrieve data regardless of any errors that may occur so they ideally fit your task.

As for the server, an FTP server is typically the easiest to set up. You may consult a list at Wikipedia. HTTPS, SFTP and FTPS allow encryption (in pure FTP/HTTP, password is sent in clear text) but SFTP/FTPS are less commonly supported by client software and HTTP/HTTPS server setup is tricky.

ivan_pozdeev

Posted 2011-12-02T19:19:40.917

Reputation: 1 468

1The problem with using http or ftp is that is there are any transmission errors, you have to send the whole thing again. rsync, bittorrent, and other protocols can verify that the files match and only retransmit the damaged pieces. Parity data, like QuickPar generates, can help too. – afrazier – 2011-12-03T01:00:13.003

Both FTP and HTTP include a transfer resumption capability as an optional extension which is supported by the majority of servers and virtually all download managers. – ivan_pozdeev – 2011-12-20T03:28:07.673

They may resume, and theoretically TCP makes sure that data arrives in order and with a valid checksum. However, anyone who's had a large HTTP or FTP transfer corrupted has learned the value of more robust protocols or some kind of ECC. – afrazier – 2011-12-20T03:57:35.037