16
7
How do you use wget to download an entire site (domain A) when its resources are on another domain, (domain B)?
I've tried:
wget -r --level=inf -p -k -E --domains=domainA,domainB http://www.domainA
16
7
How do you use wget to download an entire site (domain A) when its resources are on another domain, (domain B)?
I've tried:
wget -r --level=inf -p -k -E --domains=domainA,domainB http://www.domainA
14
wget --recursive --level=inf --page-requisites --convert-links --html-extension \
--span-hosts=domainA,domainB url-on-domainA
UPDATE: I remember the command above worked for me in the past (that was 2010 and I was using GNU Tools for Windows back then); however I had to change it to the following when I wanted to use it today:
wget --recursive --level=inf --page-requisites --convert-links \
--adjust-extension --span-hosts --domains=domainA,domainB domainA
The shorthand for that would be: wget -rEDpkH -l inf domainA,domainB domainA
-r
= --recursive
-l <depth>
= --level=<depth>
-E
= --adjust-extension
-p
= --page-requisites
-K
= --backup-converted
-k
= --convert-links
-D <domain-list>
= --domain-list=<domain-list>
-H
= --span-hosts
-np
= --no-parent
-U <agent-string>
= --user-agent=<agent-string>
GNU Wget Manual: https://www.gnu.org/software/wget/manual/wget.html
3Try --span-hosts --domains=example.org,iana.org
- I think --span-hosts
needs to be a boolean, and then you use --domains
to specify which hosts to span. – Eric Mill – 2014-10-18T20:47:38.590
Konklone, --span-hosts is a boolean from 1.12 and later, I didn't know that. @MatthewFlaschen, I updated the answer. By the way, that will still work on 1.11 and earlier, if you're using GNU Tools for Windows. – Parsa – 2014-10-19T01:11:21.513
Amazing answer, clear, with short version, everything explained, before and after, maintained. Wow! Thank you – Tomáš Votruba – 2019-07-09T13:43:57.920
I get: wget: --span-hosts: Invalid boolean domainA,domainB'; use
on' or `off'. After changing to on, it does not work. – Matthew Flaschen – 2014-02-14T01:26:33.363
@MatthewFlaschen What I've written here worked for me. Could you provide the arguments you've used? – Parsa – 2014-02-26T02:04:34.300
I don't have the exact command I ran before. However, I have the same problem with:
wget --recursive --level=inf --page-requisites --convert-links --html-extension --span-hosts=example.org,iana.org example.org
I'm using GNU Wget 1.13.4 on Debian. – Matthew Flaschen – 2014-02-28T05:42:05.993
1
wget --recursive --level=inf --page-requisites --convert-links --html-extension -rH -DdomainA,domainB domainA
This partly works. However, for some reason, it doesn't seem to work if the URL (at the end) is a redirect. Also, it downloads links too, not just page requisites. Also, -r and --recursive are the same. – Matthew Flaschen – 2014-02-14T01:44:21.103
0
wget --page-requisites --convert-links --adjust-extension --span-hosts --domains domainA,domainB domainA
You might need to ignore robots.txt (note, this may be a violation of some terms of service, and you should download the minimum required). See https://www.gnu.org/software/wget/manual/wget.html#Robot-Exclusion .
-1
Consider using HTTrack. It has more options when crawling content on other domains than wget. Using wget with --span-hosts, --domains and --accept where insufficient for my needs but HTTrack did the job. I remember that setting limit of re-directions on other domains helped a lot.
The reason that command doesn't work is because using
--domains
by itself doesn't turn--span-hosts
on. Adding--span-hosts
would've solved the problem. :| – Parsa – 2014-10-19T01:19:08.417Wow! No one after all this time? – Parsa – 2010-10-01T08:03:50.447