Save a single web page (with background images) with Wget

76

65

I want to use Wget to save single web pages (not recursively, not whole sites) for reference. Much like Firefox's "Web Page, complete".

My first problem is: I can't get Wget to save background images specified in the CSS. Even if it did save the background image files I don't think --convert-links would convert the background-image URLs in the CSS file to point to the locally saved background images. Firefox has the same problem.

My second problem is: If there are images on the page I want to save that are hosted on another server (like ads) these wont be included. --span-hosts doesn't seem to solve that problem with the line below.

I'm using: wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off http://domain.tld/webpage.html

user14124

Posted 2009-10-13T23:23:58.830

Reputation: 991

1exactly the same line ( wget --no-parent --timestamping --convert-links --page-requisites --no-directories --no-host-directories -erobots=off domain.tld ) actually saves background images referenced from CSS after updating to 1.12. The manual says: "With http urls, Wget retrieves and parses the html or css from the given url, retrieving the files the document refers to, through markup like href or src, or css uri values specified using the ‘url()’ functional notation."

Second problem still needs to be solved – user14124 – 2009-10-14T00:23:10.797

Answers

108

From the Wget man page:

Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to ‘-p’:

wget -E -H -k -K -p http://www.example.com/

Also in case robots.txt is disallowing you add -e robots=off

vvo

Posted 2009-10-13T23:23:58.830

Reputation: 1 291

4Or better yet wget --recursive --no-clobber --page-requisites --html-extension --convert-links --restrict-file-names=windows [url] – Petah – 2014-10-22T03:34:28.920

@{etah: I tried your command with your arguments, it will download other webpages besides the one specified to it. – Tim – 2015-06-18T17:53:25.507

2For a page I was trying to get it worked 100% with command wget -E -H -k -K -p -e robots=off URL, thank you. – lowtechsun – 2019-02-23T11:31:22.137

It seems that it is just rewriting js and css to absolute urls – Greg Dean – 2012-09-30T06:53:50.293

1nevermind, it was robots.txt disallowing me update the answer with the workaround – Greg Dean – 2012-09-30T07:12:59.127

20Expanded: wget --adjust-extension --span-hosts --convert-links --backup-converted --page-requisites [url] – sam – 2013-08-16T16:25:15.433

7

The wget command offers the option --mirror, which does the same thing as:

$ wget -r -N -l inf --no-remove-listing

You can also throw in -x to create a whole directory hierarchy for the site, including the hostname.

You might not have been able to find this if you aren't using the newest version of wget however.

Ernie Dunbar

Posted 2009-10-13T23:23:58.830

Reputation: 679

1This will likely crawl the whole website with its sub-urls – 4253wyerg4e – 2018-09-13T02:14:27.737

2

I made Webtography for a similar purpose: https://webjay.github.io/webtography/

It uses Wget and pushes the site to a repository on your GitHub account.

I use these arguments:

--user-agent=Webtography
--no-cookies
--timestamping
--recursive
--level=1
--convert-links
--no-parent
--page-requisites
--adjust-extension
--max-redirect=0
--exclude-directories=blog

https://github.com/webjay/webtography/blob/master/lib/wget.js#L15-L26

webjay

Posted 2009-10-13T23:23:58.830

Reputation: 131

2

It sounds like wget and Firefox are not parsing the CSS for links to include those files in the download. You could work around those limitations by wget'ing what you can, and scripting the link extraction from any CSS or Javascript in the downloaded files to generate a list of files you missed. Then a second run of wget on that list of links could grab whatever was missed (use the -i flag to specify a file listing URLs).

If you like Perl, there's a CSS::Parser module on CPAN that may give you an easy means to extract links in this fashion.

Note that wget is only parsing certain html markup (href/src) and css uris (url()) to determine what page requisites to get. You might try using Firefox addons like DOM Inspector or Firebug to figure out if the 3rd-party images you aren't getting are being added through Javascript -- if so, you'll need to resort to a script or Firefox plugin to get them too.

quack quixote

Posted 2009-10-13T23:23:58.830

Reputation: 37 382

Try adding the option -H to the list. It stands for --span-hosts and allows downloading of content from external hosts. – Michael - Where's Clay Shirky – 2009-12-30T21:47:16.927

Like I said in the comment for my first post it seems it has been fixed in v1.12. I still don't know how to include images that are on other servers though. – user14124 – 2009-10-14T00:32:13.320

yep, parsing the CSS is new in wget v1.12, it's at the top of the changelog: http://freshmeat.net/urls/376000c9c7a02f7a3592180c2390ff04

– quack quixote – 2009-10-14T00:41:32.050