Recursive download of subfolder with wget - --no-parent apparently not working

3

1

I need some documentation about XUL but I do not have Internet access most of the time. So, I've tried to download the Mozilla Tutorial with the following command:

wget --no-parent -r -l 2 -p -k https://developer.mozilla.org/en/XUL_Tutorial

My intention was to download both the https://developer.mozilla.org/en/XUL_Tutorial page and its subpages (for example, https://developer.mozilla.org/en/XUL_Tutorial/Install_Scripts). However, even though I passed the --no-parent flag, it keeps getting pages such as https://developer.mozilla.org/index.php?title=Special:Userlogin&returntotitle=en%2FXUL+Tutorial%2FInstall+Scripts.

I do not understand why it happens. How could I achieve the behavior I intended?

brandizzi

Posted 2011-05-30T15:08:40.197

Reputation: 145

Answers

1

I had to disable gzip compression to make it work. I also changed the user-agent because some pages forbid wget. So this is what I've put into my .wgetrc:

header = Accept-Encoding: none

user_agent = Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6

Works great here.

Julian Ziegler

Posted 2011-05-30T15:08:40.197

Reputation: 26

11

You need the trailing slash at the end of the URL.

Dyax

Posted 2011-05-30T15:08:40.197

Reputation: 111

it only download the index.html file That was because you were using l 2. Since you did not change the accepted answer, I guess you never increased the recursion level to realize this is the best answer for the question as it had been asked. – Synetech – 2014-12-13T02:42:19.133

This is the correct answer to the question. Confirmed. – Johannes Overmann – 2019-06-28T09:31:20.933

I tried wget --no-parent -r -l 2 -p -k https://developer.mozilla.org/en/XUL_Tutorial/ but it only download the index.html file... – brandizzi – 2011-09-15T13:13:58.310

that was my issue! probably common among wget beginners – John Berryman – 2014-01-21T23:35:07.680

1

Was having a similar issue:

wget -r -l1 --no-parent -nH "https://www.website.com/parent/directory/"

I believe there was an issue with https vs. http. I updated $HOME/.wgetrc to:

header = Accept-Encoding: none
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (Windows NT 5.1; rv:10.0.2) Gecko/20100101 Firefox/10.0.2
referer = http://www.google.com/
robots = off

Then changed changed https to http:

wget -r -l1 --no-parent -nH "http://www.website.com/parent/directory/"

The wget program no longer created folders (or retrieved files) from outside the specified directory hierarchy.

Dave Jarvis

Posted 2011-05-30T15:08:40.197

Reputation: 2 126

I tried it and it seems to work perfectly. Waiting the end of the download (which I do not need anymore actually) to be sure :) However, I did not changed to HTTP - I mean, I changed, but it kept redirecting to HTTPS. Do you know why your .wgetrc seems to be changing the behavior? – brandizzi – 2012-08-24T16:59:10.267