0

In Fintech, the following scenario seems fairly common:

You've paid for access to a huge collection of data, but it is made available to you as thousands of little files, each with a footprint in the neighborhood of 300 kB, but altogether amounting to roughly 1 TB of data. Some of the files are stored with zip compression on the remote machine; some aren't. Furthermore, all of these files can only be accessed via FTP and you are limited to one connection to the server at a time.

What is the fastest way to get copies of these files?

StudentsTea
  • 165
  • 9
  • Scripted (S)FTP. Using find to traverse the remote graph of files and piping its output to (s)ftp. Mounting the remote collection to the local file system using curlftpfs, then standard linux commands to copy files. Mounting the remote file system using curlftpfs, zipping entire directories, then copying those. All of these methods eventually cause the local Linux box to freeze. Checking resource allocation shows no memory leaks, or limited RAM. – StudentsTea Sep 12 '16 at 00:40
  • But I'm not interested in what I've tried, I'm interested in what other people *would* try. – StudentsTea Sep 12 '16 at 00:41
  • FTP != sftp. Please edit your question to include precise details on the protocols that are available, in addition to what you tried, what worked, what didn't work, etc. – EEAA Sep 12 '16 at 00:42
  • Well, if you want help, you need to write a good question. The amount of effort and detail you put into your question has a direct bearing on the quality of answers received. – EEAA Sep 12 '16 at 00:43
  • I see my level of effort has drawn you. ;) – StudentsTea Sep 12 '16 at 00:44
  • In this question, I'm interested *specifically* in FTP. – StudentsTea Sep 12 '16 at 00:45
  • I know *you* are not interested in what you've tried, but *we* are, for two reasons. 1) We don't take well to freeloaders. We want people to to provide evidence that they have done their due diligence in resolving the problem on the own before coming here. 2) Past work provides context for how the system is built, what it's capable of, if there are performance bottlenecks, etc, all of which are relevant to an answer. So again, please edit your question and include some more details. – EEAA Sep 12 '16 at 00:58
  • The fastest way is to FedEx a hard drive. – Michael Hampton Sep 12 '16 at 01:53
  • @MichaelHampton - Yeah. :( Unfortunately, a lot of data providers don't offer that service; I wish they did. Thank you for your feedback. – StudentsTea Sep 12 '16 at 01:58
  • You're talking about moving about three million files. With plain old FTP the amount of time you need is measured in _months_. The protocol is _that bad_. – Michael Hampton Sep 12 '16 at 02:05
  • Sigh... I had a feeling that was the case. *slumps into desk* – StudentsTea Sep 12 '16 at 02:15
  • The protocol you actually need for getting an entire dataset of this size is rsync. Without that, or overnight hard drive delivery, the dataset is practically unusable and may as well not exist. – Michael Hampton Sep 12 '16 at 02:25
  • That's not to say FTP is known for dropping packets / corrupting data in transit, is it? :/ **please say no, please say no** – StudentsTea Sep 12 '16 at 02:30
  • 1
    ftp has a subcommand called mget. With proper scripting of the directory tree and mget of the files in each dir, and a reasonable connection, it should take no more than a few days. ftp does not verify copies but part of scripting the directory tree will be to collect file sizes in bytes on the remote host. Use binary transfer mode and compare the local sizes with the catalog, and reretrieve discrepancies at the end. This whole process used to be very common, am sure one can find code laying around that does it. – Jonah Benton Sep 12 '16 at 03:46

0 Answers0