3
1
I'm trying to download an entire site with wget
like this:
wget -r http://whatever/
wget -m http://whatever/
But it only downloads the pages with text, no images. How can I download the pages with text and images? What am I missing here?
3
1
I'm trying to download an entire site with wget
like this:
wget -r http://whatever/
wget -m http://whatever/
But it only downloads the pages with text, no images. How can I download the pages with text and images? What am I missing here?
4
The wget
command you'll need to use is much lengthier as explained below. As such, you may wish to commit it to a file like wholesite.sh
, make it an executable, and run it. It'll create a directory of the url and subdirectories of the site's assets, including images, js, css, etc.
wget \
--recursive \
--level 5 \
--no-clobber \
--page-requisites \
--adjust-extension \
--span-hosts \
--convert-links \
--restrict-file-names=windows \
--domains yoursite.com \
--no-parent \
yoursite.com
--recursive
This specifies how many subdirectories of the site's assets you want to retrieve(since assets like images are often kept in subdirectories of the site) The default max depth to search for assets is 5 subdirectories. You can modify this with the level
flag just below.
--level 5
Search through 5 subdirectories for assets. I'd recommend increasing or decreasing this if the target site is larger or smaller respectively.
--no-clobber
Don't overwrite existing files.
--page-requisites
causes wget
to download all the files that are necessary to properly display a given HTML page which includes images, css, js, etc.
--adjust-extension
Preserves proper file extensions for .html, .css, and other assets.
--span-hosts
Include necessary assets from offsite as well.
--convert-links
Update site links to work as files within subdirectories on your local machine(for viewing locally).
--restrict-file-names=windows
Modify filenames to work in Windows as well, in case you're using this command on a Windows system.
--domains yoursite.com
Do not follow links outside this domain.
--no-parent
Don't follow links outside the directory you pass in.
yoursite.com
# The URL to download
you're writing your own explanations.. but in a format that looks it might be copy/pasted from something official. You should be more clear. This is the official description for -r
-r, --recursive specify recursive download
And if it's infinitely recursive then probably nobody would use it without -l
I know when i've used -r
it as ALWAYS been used with -l
and then it won't just download infinitely. You seem to be suggesting -r
alone, to download "a whole site". But if that's infinite then perhaps that might not stop until your hard drive is stuffed with much of the internet? – barlop – 2019-03-20T20:11:29.927
Your assessment of recursive retrieving is fair. I figured the default max depth(for modern versions of GNU Wget) of 5 would be adequate for this user's needs. I've personally used versions of this command with more or fewer flags and in this case, I copied this from an online source to get a faster answer. I'll rewrite it to include more in-depth explanations. – baelx – 2019-03-20T20:11:35.283
when you say --span-hosts Include necessary assets from offsite as well.
Are you sure. Are you sure that won't actually when combined with your -r -l 5
lead to all sorts of links that aren't relevant and are from other sites. Necessary assets sounds like just images from elsewhere. But spanning sites do a depth of -r -l 5 could go way more. I'm not sure how one would ensure wget gets pages recursively when local, but for necessary assets like images, when going off site, how one would make it not go recursive? – barlop – 2019-03-20T20:44:15.773
And if you don't follow links outside your domain, then will it still get necessary assets like images. Maybe it will work.. but have you tested it? I might try it some time, it sounds interesting The wget --help just says -D, --domains=LIST comma-separated list of accepted domains
so doesn't say specifically following links but may be more general.. so i'm not sure it will still get images off site maybe it will. – barlop – 2019-03-20T20:44:44.533
technically http is part of the url so it'd be more accurate to replace
http://url.url.url
withhttp://whatever
– barlop – 2019-03-20T20:34:34.483