Mirror server and ignore already-processed files

0

Before I start writing my own app for this, maybe there is already a better solution for the problem:

I need to check an HTTP Server every day for new files to download and process. Basically these are zip files which need to be extracted.

Old files are deleted some day and new files are uploaded multiple times a day. I do not want to process a file twice.

My current solution is to save all the files locally and use wget with -nc options, which is called by a cronjob twice a day:

wget -nc -t 10 -o wget.log -r -l 1 --no-parent --reject "index.html*" http://myserver/

Now I can parse the log file, get all new downloaded files and process them:

grep saved wget.log | awk '{ print $6}' # generate a list of downloaded files

But I will accumulate a bunch of files on my disk that I don't need. So, do I need a database to store already downloaded files and check for each file if it was already processed?

reox

Posted 2014-01-14T15:41:10.283

Reputation: 915

Do you have access enough to the HTTP server to know if it provides rsync assess as well, as do the various repositories for the Linux distributions? An rsync implementation may be easier if the web server's architecture supports it, and if all you are doing is grabbing "records" out of a "database", in a general sense. Was coding up a script to mirror a subset of ubuntu, and reviewing this document for rsync'ing ubuntu was giving me some ideas along the lines of what you are thinking of doing, if that helps at all.

– Billy McCloskey – 2014-01-14T16:11:21.747

no i don't. but also in that case i need to save all records from the server, because otherwise rsync would not know if i allready downloaded it. – reox – 2014-01-15T11:58:06.763

Yes, but there are filtering options to rsync to limit what is downloaded, recursively off the site, so all records with an asterisk, and only what has changed. – Billy McCloskey – 2014-01-15T13:11:51.163

Answers

0

I wrote now a short script to mirror the server and save the filenames in a database.

you can also query for md5 hashes, for example if a filename can be a duplicate

import urllib.request as urll
import re
import shelve
import hashlib
import time

res = urll.urlopen(url)

html = res.read()

files = re.findall('<a href="([^"]+)">', str(html))[1:]

db = shelve.open('dl.shelve')

print(files)

for file in files:
    if file not in db:
        print("Downloadling %s..." %file)
        res = urll.urlopen(url + "" + file)
        bytes = res.read()
        md5 = hashlib.md5(bytes).hexdigest()

        with open("dl\\"+file, 'wb') as f:
            f.write(bytes)

        print((time.time(), len(bytes), md5))
        db[file] = (time.time(), len(bytes), md5)

db.close()

reox

Posted 2014-01-14T15:41:10.283

Reputation: 915