Mirror a blog with wget

9

2

I am trying to mirror a blog, eg www.example.com, with wget.

I use wget with the following options (shell variables are substituted correctly):

wget -m -p -H -k -E -np \
    -w 1 \
    --random-wait \
    --restrict-file-names=windows \
    -P $folder \
    -Q${quota}m \
    -t 3 \
    --referer=$url \
    -U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4' \
    -e robots=off \
    -D $domains
    -- $url

The blog contain images that reside on other domains.

Even though I have specified the -p option (download linked page assets) these images are not being downloaded unless I specify each domain explicitly in the -D option.

If I omit the -D option then wget will follow every link outside www.example.com and download the whole internet.

Is it possible for wget to just follow every link under www.example.com and download each page’s required assets, whether those reside on the same domain or not without me having to specify each domain explicitly?

Kostas Andrianopoulos

Posted 2014-10-16T03:17:06.630

Reputation: 91

I'd love to find a good answer to this one also. I've run into the same situation and couldn't find a single wget invocation that did it. I ended up using wget -N -E -H -k -K -p first, and came up with a script to fetch missing linked images. – lemonsqueeze – 2014-10-16T16:52:10.773

5

According to this one, httrack is a killer for this. I'll give it a shot next time instead of wget.

– lemonsqueeze – 2014-10-16T16:58:00.070

Assuming your blog (minus the page assets) is not spanning multiple domains, try removing both the -D $domains as well as -H. Without -H it should stay within your domain but still retrieve the direct page assets, even when they are on a different domain. – blubberdiblub – 2015-12-19T16:06:22.737

Answers

1

No, the only way is to specify the domains that you want wget to follow using -D or --domains=[domain list] (in the form of comma separated list)

sparks

Posted 2014-10-16T03:17:06.630

Reputation: 133