SFTP - Recursively fetch new files

1

1

I have a remove file system structure like this:

+ /measure
    + / 2013-09
         + / 2013-09-04
              + / fooinstrument
                   + result03343445845.csv
                   + result03343445846.csv
              + / barinstrument
                   + result03343445847.csv
                   + result03343445848.csv

It contains a lot of files in a hierarchic structure. I have read only access to this via SFTP (no other protocols like CIFS available or any special software running on the server). (So I can't install any software on the source host.)

I want to import these files into my database every night using a cron job (the cron job itself is no problem). Therefore, I'd like to download recursively all new files to my local file system and then pass the path of each downloaded file to my application (its filename as a command line parameter), e.g.:

/usr/local/bin/myapp -import /srv/localstorage/result03343445845.csv

This invocation of myapp isn't a crucial point. If I can get a list of all downloaded paths by piping the output of the downloader to some file, I can read the file list line by line later and invoke the app. That's no problem.

I don't care about the original directory hierarchy. My main objective is to get the files to my local file system so that my command line tool can be fed with the filename as an input. So whether I have a duplicate of the deep hierarchy given by the server or whether all files go into the same directory, is not that important. The latter might be even more interesting, as the file names are unique serials. So it's no problem to move all files together into one directory.

+ / localstorage
     + result03343445845.csv
     + result03343445846.csv
     + result03343445847.csv
     + result03343445848.csv

One of my problems is, that the source files stay on the server forever. So the server doesn't delete old files I've already downloaded, as I'm not the only one who collects these data. So the script must "remember" what files are old (=> don't download!), e.g. by keeping local copies of all files ever retrieved. (If two files have the same filename, they can safely be considered equal, as the filename is made of a serial number. So no content comparison necessary.)

Another important point: After a year, there will be probably 30.000 files or even more. It wasn't reasonable to download all files every night, including the old ones I already have. So it is really necessary only to download the new files (new = no such file name in the local copy).

What's the easiest and best way to do this on Linux (Debian)? I thought of a shell script that uses sftp, scp or maybe even curl? Thanks a lot for your advice and your ideas on such a script!

// edit: By the way, what do you think? Does the question fit better into Serverfault?

MrSnrub

Posted 2013-09-03T23:20:43.347

Reputation: 125

Answers

3

rysnc is a great utility for synchronizing directory hierarchies. Ideally you'll have rsync installed on both the server and the client, but it will work even if installed on just the client. These commands will transfer files that don't already exist on the local machine and run myapp on them.

cd DESTINATION_DIR
rsync -rv --ignore-existing --log-format='%o %f' USER@HOST:/PATH_TO_measure_DIR . | grep recv | sed "s,recv ,," | xargs -i sh -c "[ -f {} ] && /usr/local/bin/myapp -import {}"

rsync brings over the files (preserving directory structure) then we parse out the list of recieved files, make sure they are regular files (we don't want to run myapp on newly created directories), and then invoke myapp on them.

wingedsubmariner

Posted 2013-09-03T23:20:43.347

Reputation: 1 432

How to use with remote custom port ? – Chaminda Bandara – 2019-03-19T23:09:13.880

1

mount the server directory locally to you:

sshfs  username@servername:/path/ /mount

or

curlftpfs username@servername:/path/ /mount

then

rsync -av /mount /data/ > /data/rsync.log

it copies only new files and you have filenames in the log file

jet

Posted 2013-09-03T23:20:43.347

Reputation: 2 675

I would recommend against using sshfs, it is much, much slower than using sftp directly. Also, rsync is able to connect over ssh/sftp on its own, so there is no need for the mount. – wingedsubmariner – 2013-09-04T01:16:24.083