1
I want to download all government spending over £500 by the Department of Energy and Climate change. These are .xls and .xlsx files, generated once per month. They are stored at locations like this:
https:// www.gov.uk/government/uploads/system/uploads/attachment_data/file/209425/20130627_April_2013_PUS_.xls
where the number after file is a unique number and the filename doesn't have any naming consistency. These files are linked from individual monthly pages which have the form:
https:// www.gov.uk/government/publications/departmental-spend-over-500-april-2013
which in turn links from an index page which is:
https://www.gov.uk/government/collections/departmental-spend-over-500
This command works:
wget -r --force-html -e robots=off -A xls,xlsx,"" -l 2
https://www.gov.uk/government/collections/departmental-spend-over-500
but as well as the .xls and .xlsx files I get the complete directory of the .gov.uk site (to a depth of two links from where I started) which downloads ~100MB of text/html files other than the .xls files which is a bit excessive. So my question is:
How I can make wget only source from the directories above or alternatively exclude obvious ones that I don't want?
I've tried the obvious -I and -X, -D etc commands but with no luck. NB I had to include "" as well as xls files in the -A switch otherwise it would ignore the linking html files...
Any advice gratefully received! This is on a mac btw.