Mirroring with wget - Exclude old files

2

0

I am using wget to mirror an ftp file-archive.

This is my command:
wget -m $mirror
(Stripped all unneccessary parameters)

So far, everything is okay, all differences to the online archive will be synced.

But now, there is a script that removes files from my copy as they are not needed. So, if I start wget again, it will re-download these files (several gigabytes!), too.

Is there an option to exclude files from the download, that are older than a certain timestamp?

I already looked at the -A -R -I -X parameters, but they only seem to work with filenames...

Nippey

Posted 2013-01-30T07:51:06.420

Reputation: 161

Does -c (continue) help? – vonbrand – 2013-01-30T17:47:04.213

No, it didn't. I just noticed that ncftp would be a solution - - - if there wouldn't be the corporate firewall :/ – Nippey – 2013-01-31T07:12:12.243

Answers

0

My first thought when reading your question was "This looks like a job for rsync!". Unfortunately, while rsync can indeed leap tall buildings in a single bound, it cannot deal with FTP. If you have ssh access to the mirror, things will be be much easier.

Assuming you don't, you can try mounting the remote FTP directory locally, and then you can use simple cp (inspired by this):

  1. Install curlftpfs. If you are on a debian-based distro (I assume you are using Linux since you mention wget), run

    apt-get install curlftpfs 
    
  2. Create local mount path

    mkdir -p /mnt/myftp
    
  3. Mount the destination ftp site using curlftpfs

    curlftpfs -o allow_other ftp://user:pass@ftp.mirror.com /mnt/myftp
    
  4. Use cp to copy the files, passing it a find command as an argument:

    cd /mnt/myftp && \
    cp -v --parents `find . -type f -mtime -20` ~/foo/
    

Explanation:

  • The find command will find all files (-type f) in the remote FTP server that were modified <=20 days ago (-mtime 20).
  • The cp command will copy those files
    • If they are newer than the corresponding file in the target directory (-u)
    • Preserving their parent directories (--parents)
  • The cd /mnt/myftp bit is necessary to make cp create the correct parent directories in the destination folder. If you do not cd to the ftp directory first, cp will create folders like this:

    ~/foo/mnt/myftp/bar
    

    Instead of this:

    ~/foo/bar
    

Combined, these commands/options should have the desired effect of mirroring the remote server while ignoring older files.

Caveats:

This is a relatively simplistic case scenario. If you have more advanced requirements (all those wget options you left out) you may want to have a look at man cp or, for more dvanced options, man rsync. You will be able to do essentially the same thing in rsync by passing the result of the find command using rsync's --include-from option.

If you update your question with more specific requirements (preserving links, hard links, timestamps, user privileges, directory recursion and the like) I should be able to modify my answer to suit them.

terdon

Posted 2013-01-30T07:51:06.420

Reputation: 45 216

To be honest, it is worse than linux but better than windows: I am in a working environment where Linux is not allowed (Network Policy etc..), so I have to use Cygwin.... I will see if I can mount things there and post back next week – Nippey – 2013-01-31T07:07:40.147

No mounting of file systems other than NTFS in Cygwin... :( – Nippey – 2013-02-04T06:46:20.943