9
2
I am trying to mirror a blog, eg www.example.com
, with wget
.
I use wget with the following options (shell variables are substituted correctly):
wget -m -p -H -k -E -np \
-w 1 \
--random-wait \
--restrict-file-names=windows \
-P $folder \
-Q${quota}m \
-t 3 \
--referer=$url \
-U 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.6) Gecko/20070802 SeaMonkey/1.1.4' \
-e robots=off \
-D $domains
-- $url
The blog contain images that reside on other domains.
Even though I have specified the -p
option (download linked page assets) these images are not being downloaded unless I specify each domain explicitly in the -D
option.
If I omit the -D
option then wget will follow every link outside www.example.com
and download the whole internet.
Is it possible for wget
to just follow every link under www.example.com
and download each page’s required assets, whether those reside on the same domain or not without me having to specify each domain explicitly?
I'd love to find a good answer to this one also. I've run into the same situation and couldn't find a single wget invocation that did it. I ended up using
wget -N -E -H -k -K -p
first, and came up with a script to fetch missing linked images. – lemonsqueeze – 2014-10-16T16:52:10.7735
According to this one, httrack is a killer for this. I'll give it a shot next time instead of wget.
– lemonsqueeze – 2014-10-16T16:58:00.070Assuming your blog (minus the page assets) is not spanning multiple domains, try removing both the
-D $domains
as well as-H
. Without-H
it should stay within your domain but still retrieve the direct page assets, even when they are on a different domain. – blubberdiblub – 2015-12-19T16:06:22.737