How to download an entire site with wget including its images

3

1

I'm trying to download an entire site with wget like this:

wget -r http://whatever/

wget -m http://whatever/

But it only downloads the pages with text, no images. How can I download the pages with text and images? What am I missing here?

assembler

Posted 2019-03-20T17:55:11.827

Reputation: 133

technically http is part of the url so it'd be more accurate to replace http://url.url.url with http://whatever – barlop – 2019-03-20T20:34:34.483

Answers

4

The wget command you'll need to use is much lengthier as explained below. As such, you may wish to commit it to a file like wholesite.sh, make it an executable, and run it. It'll create a directory of the url and subdirectories of the site's assets, including images, js, css, etc.

wget \
     --recursive \
     --level 5 \
     --no-clobber \
     --page-requisites \
     --adjust-extension \
     --span-hosts \
     --convert-links \
     --restrict-file-names=windows \
     --domains yoursite.com \
     --no-parent \
         yoursite.com

Explanation

--recursive This specifies how many subdirectories of the site's assets you want to retrieve(since assets like images are often kept in subdirectories of the site) The default max depth to search for assets is 5 subdirectories. You can modify this with the level flag just below.

--level 5 Search through 5 subdirectories for assets. I'd recommend increasing or decreasing this if the target site is larger or smaller respectively.

--no-clobber Don't overwrite existing files.

--page-requisites causes wget to download all the files that are necessary to properly display a given HTML page which includes images, css, js, etc.

--adjust-extension Preserves proper file extensions for .html, .css, and other assets.

--span-hosts Include necessary assets from offsite as well.

--convert-links Update site links to work as files within subdirectories on your local machine(for viewing locally).

--restrict-file-names=windows Modify filenames to work in Windows as well, in case you're using this command on a Windows system.

--domains yoursite.com Do not follow links outside this domain.

--no-parent Don't follow links outside the directory you pass in.

yoursite.com # The URL to download


Example adapted from: https://gist.github.com/christiangenco/8531418

baelx

Posted 2019-03-20T17:55:11.827

Reputation: 2 083

you're writing your own explanations.. but in a format that looks it might be copy/pasted from something official. You should be more clear. This is the official description for -r -r, --recursive specify recursive download And if it's infinitely recursive then probably nobody would use it without -l I know when i've used -r it as ALWAYS been used with -l and then it won't just download infinitely. You seem to be suggesting -r alone, to download "a whole site". But if that's infinite then perhaps that might not stop until your hard drive is stuffed with much of the internet? – barlop – 2019-03-20T20:11:29.927

Your assessment of recursive retrieving is fair. I figured the default max depth(for modern versions of GNU Wget) of 5 would be adequate for this user's needs. I've personally used versions of this command with more or fewer flags and in this case, I copied this from an online source to get a faster answer. I'll rewrite it to include more in-depth explanations. – baelx – 2019-03-20T20:11:35.283

when you say --span-hosts Include necessary assets from offsite as well. Are you sure. Are you sure that won't actually when combined with your -r -l 5 lead to all sorts of links that aren't relevant and are from other sites. Necessary assets sounds like just images from elsewhere. But spanning sites do a depth of -r -l 5 could go way more. I'm not sure how one would ensure wget gets pages recursively when local, but for necessary assets like images, when going off site, how one would make it not go recursive? – barlop – 2019-03-20T20:44:15.773

And if you don't follow links outside your domain, then will it still get necessary assets like images. Maybe it will work.. but have you tested it? I might try it some time, it sounds interesting The wget --help just says -D, --domains=LIST comma-separated list of accepted domains so doesn't say specifically following links but may be more general.. so i'm not sure it will still get images off site maybe it will. – barlop – 2019-03-20T20:44:44.533