Wget having trouble just saving the files I want - exclude directories doesn't seem to work

I want to download all government spending over £500 by the Department of Energy and Climate change. These are .xls and .xlsx files, generated once per month. They are stored at locations like this:

https:// www.gov.uk/government/uploads/system/uploads/attachment_data/file/209425/20130627_April_2013_PUS_.xls

where the number after file is a unique number and the filename doesn't have any naming consistency. These files are linked from individual monthly pages which have the form:

https:// www.gov.uk/government/publications/departmental-spend-over-500-april-2013

which in turn links from an index page which is:

https://www.gov.uk/government/collections/departmental-spend-over-500

This command works:

wget -r --force-html -e robots=off -A xls,xlsx,"" -l 2
https://www.gov.uk/government/collections/departmental-spend-over-500

but as well as the .xls and .xlsx files I get the complete directory of the .gov.uk site (to a depth of two links from where I started) which downloads ~100MB of text/html files other than the .xls files which is a bit excessive. So my question is:

How I can make wget only source from the directories above or alternatively exclude obvious ones that I don't want?

I've tried the obvious -I and -X, -D etc commands but with no luck. NB I had to include "" as well as xls files in the -A switch otherwise it would ignore the linking html files...

Any advice gratefully received! This is on a mac btw.

baronmax

Posted 2015-05-29T20:11:03.707

Reputation: 21

Wget having trouble just saving the files I want - exclude directories doesn't seem to work

Answers