Wget having trouble just saving the files I want - exclude directories doesn't seem to work

1

I want to download all government spending over £500 by the Department of Energy and Climate change. These are .xls and .xlsx files, generated once per month. They are stored at locations like this:

https:// www.gov.uk/government/uploads/system/uploads/attachment_data/file/209425/20130627_April_2013_PUS_.xls

where the number after file is a unique number and the filename doesn't have any naming consistency. These files are linked from individual monthly pages which have the form:

https:// www.gov.uk/government/publications/departmental-spend-over-500-april-2013

which in turn links from an index page which is:

https://www.gov.uk/government/collections/departmental-spend-over-500

This command works:

wget -r --force-html -e robots=off -A xls,xlsx,"" -l 2
https://www.gov.uk/government/collections/departmental-spend-over-500

but as well as the .xls and .xlsx files I get the complete directory of the .gov.uk site (to a depth of two links from where I started) which downloads ~100MB of text/html files other than the .xls files which is a bit excessive. So my question is:

How I can make wget only source from the directories above or alternatively exclude obvious ones that I don't want?

I've tried the obvious -I and -X, -D etc commands but with no luck. NB I had to include "" as well as xls files in the -A switch otherwise it would ignore the linking html files...

Any advice gratefully received! This is on a mac btw.

baronmax

Posted 2015-05-29T20:11:03.707

Reputation: 21

Answers

1

Ha! Finally worked it out. In the include you have to include the full path for all of the directories - but NOT the url:

wget -r -A xls,xlsx,"" -l 2 -I /government/uploads/system/uploads/attachment_data/file/,/government/publications/,/government/collections/departmental-spend-over-500 https://www.gov.uk/government/collections/departmental-spend-over-500

Not obvious - well not to me anyway...

(scroll right in the code box to see it all)

Edit: Actually better - I've split it out here:

wget -r -A xls,xlsx,"" -l 2 
-I /government/uploads/system/uploads/attachment_data/file/,
   /government/publications/,
   /government/collections/departmental-spend-over-500
https://www.gov.uk/government/collections/departmental-spend-over-500

Line 1: recursive, include xls & xlsx & files that don't have extensions (- in this case html files...) and do two levels from where specified in line 5

Lines 2-4: include these paths/directories from the top url (ie exclude everything else)

Line 5: where to start from

baronmax

Posted 2015-05-29T20:11:03.707

Reputation: 21