How to automate downloading files?

I got a book which had a pass to access digital versions of hi-res scans of much of the artwork in the book. Amazing! Unfortunately the presentation of all the these are 177 pages of 8 images each with links to zip files of jpgs. It is extremely tedious to browse, and I would love to be able to get all the files at once rather than sitting and clicking through each one separately.

archive_bookname/index.1.htm - archive_bookname/index.177.htm each of those pages have 8 links each to the files

linking to files such as <snip>/downloads/_Q6Q9265.jpg.zip, <snip>/downloads/_Q6Q7069.jpg.zip, <snip>/downloads/_Q6Q5354.jpg.zip. that don't quite go in order. I cannot get a directory listing of the parent /downloads/ folder.

Also, the file is behind a login-wall, so doing a non-browser tool, might be difficult without knowing how to recreate the session info.

I've looked into wget a little but I'm pretty confused and have no idea if it will help me with this. Any advice on how to tackle this? Can wget do this for me automatically?

Damon

Posted 2012-04-30T02:51:22.907

Reputation: 2 119

Answers

Using python might be easier. So this is a solution using python. If python is not an option for you, then ignore. :)

I'm assuming scraping the website is legal.

Write a simple Python script to loop through archive_bookname/index.1-177.htm, scrap them using BeautifulSoup, locate the links with either css class selectors or simple regex matching, then use urllib.urlretrieve() to get the files. That's how I'd do it.

Bibhas

Posted 2012-04-30T02:51:22.907

Reputation: 2 490

I definitely have legal access to all the files on it, I know that much. I contacted them to mention that I wished there was an easier way to access the files and never got a response – Damon – 2012-04-30T15:17:45.920

why code-format that?! – Chris2048 – 2012-05-01T21:52:03.510

Cause I'm much more accustomed to python than wget. Was waiting for someone to post a wget solution. :-) – Bibhas – 2012-05-02T04:08:57.897

@Bibhas sorry, I didn't mean there is anything wrong with your answer, just why did you put "I'm assuming scraping the website is legal" in code formatting? – Chris2048 – 2012-05-02T08:21:14.683

@Chris2048 Oh! That's not code tag. That's blockquote. I wanted to highlight that line. Thats why. – Bibhas – 2012-05-02T19:50:35.120

I have to login to access the files. Will that affect this method? (yah ages later.. these solutions were all pretty confusing and I haven't bothered yet) – Damon – 2012-09-06T19:59:12.407

Then you have tough luck. If there is no absolute url for these files that you can auto-generate, then it's not possible. If the files are behind some authentication check, this wont work. – Bibhas – 2012-09-07T10:34:13.880

You can specify an input html file with

wget -F -i <file>

so you could just dump the html files and loop through them
(I've added a base url for relative links):

for i in <whereYouDumpedFiles>/*.html
do
  wget -F -B <base-url> -i $i
done

alternatively

you could just dump the links to a file (seperated by newlines) by whatever method and do this:

wget -i <url-file>

a good way to get at those links would be:

lynx -hiddenlinks=ignore -nonumbers -listonly --dump <relevant-url> \
 | sort | uniq | sed /<regexp-pattern-of-urls-you-want>/\!d

possibly in a for loop that appends to 'url-file'

Chris2048

Posted 2012-04-30T02:51:22.907

Reputation: 543

Or you can simply use Perl and its brilliant module called www::mechanize. It's really simple to put something together and theres tons of examples on official CPAN documentation.

milosgajdos

Posted 2012-04-30T02:51:22.907

Reputation: 218

'simply' use PERL is not accurate for me :p I do some programming, but not familiar at all on how to start looking into that.. – Damon – 2012-09-06T19:42:38.737