Make wget download page resources on a different domain

16

7

How do you use wget to download an entire site (domain A) when its resources are on another domain, (domain B)?
I've tried:
wget -r --level=inf -p -k -E --domains=domainA,domainB http://www.domainA

Parsa

Posted 2010-04-09T05:47:38.643

Reputation: 425

The reason that command doesn't work is because using --domains by itself doesn't turn --span-hosts on. Adding --span-hosts would've solved the problem. :| – Parsa – 2014-10-19T01:19:08.417

Wow! No one after all this time? – Parsa – 2010-10-01T08:03:50.447

Answers

14

wget --recursive --level=inf --page-requisites --convert-links --html-extension \
     --span-hosts=domainA,domainB url-on-domainA

UPDATE: I remember the command above worked for me in the past (that was 2010 and I was using GNU Tools for Windows back then); however I had to change it to the following when I wanted to use it today:

wget --recursive --level=inf --page-requisites --convert-links \
     --adjust-extension --span-hosts --domains=domainA,domainB domainA

The shorthand for that would be: wget -rEDpkH -l inf domainA,domainB domainA

  • -r = --recursive
  • -l <depth> = --level=<depth>
  • -E = --adjust-extension
  • -p = --page-requisites
  • -K = --backup-converted
  • -k = --convert-links
  • -D <domain-list> = --domain-list=<domain-list>
  • -H = --span-hosts
  • -np = --no-parent
  • -U <agent-string> = --user-agent=<agent-string>

GNU Wget Manual: https://www.gnu.org/software/wget/manual/wget.html

Parsa

Posted 2010-04-09T05:47:38.643

Reputation: 425

3Try --span-hosts --domains=example.org,iana.org - I think --span-hosts needs to be a boolean, and then you use --domains to specify which hosts to span. – Eric Mill – 2014-10-18T20:47:38.590

Konklone, --span-hosts is a boolean from 1.12 and later, I didn't know that. @MatthewFlaschen, I updated the answer. By the way, that will still work on 1.11 and earlier, if you're using GNU Tools for Windows. – Parsa – 2014-10-19T01:11:21.513

Amazing answer, clear, with short version, everything explained, before and after, maintained. Wow! Thank you – Tomáš Votruba – 2019-07-09T13:43:57.920

I get: wget: --span-hosts: Invalid boolean domainA,domainB'; useon' or `off'. After changing to on, it does not work. – Matthew Flaschen – 2014-02-14T01:26:33.363

@MatthewFlaschen What I've written here worked for me. Could you provide the arguments you've used? – Parsa – 2014-02-26T02:04:34.300

I don't have the exact command I ran before. However, I have the same problem with:

wget --recursive --level=inf --page-requisites --convert-links --html-extension --span-hosts=example.org,iana.org example.org

I'm using GNU Wget 1.13.4 on Debian. – Matthew Flaschen – 2014-02-28T05:42:05.993

1

wget --recursive --level=inf --page-requisites --convert-links --html-extension -rH -DdomainA,domainB domainA

mnml

Posted 2010-04-09T05:47:38.643

Reputation: 1 391

This partly works. However, for some reason, it doesn't seem to work if the URL (at the end) is a redirect. Also, it downloads links too, not just page requisites. Also, -r and --recursive are the same. – Matthew Flaschen – 2014-02-14T01:44:21.103

0

wget --page-requisites --convert-links --adjust-extension --span-hosts --domains domainA,domainB domainA

You might need to ignore robots.txt (note, this may be a violation of some terms of service, and you should download the minimum required). See https://www.gnu.org/software/wget/manual/wget.html#Robot-Exclusion .

Matthew Flaschen

Posted 2010-04-09T05:47:38.643

Reputation: 2 370

-1

Consider using HTTrack. It has more options when crawling content on other domains than wget. Using wget with --span-hosts, --domains and --accept where insufficient for my needs but HTTrack did the job. I remember that setting limit of re-directions on other domains helped a lot.

watbywbarif

Posted 2010-04-09T05:47:38.643

Reputation: 590