How do you use WGET to mirror a site 1 level deep, recovering JS, CSS resources including CSS images?

Pretend I wanted a simple page copy to be downloaded to my HD for permanent keeping. I'm not looking for a deep recursive get, just a single page, but also any resources loaded by that page to be also downloaded.

Example: https://www.tumblr.com/

Expect:

The index.html
Any loaded images
Any loaded JS files
Any loaded CSS files
Any images loaded in the CSS file
links for the page resources localized to work with the downloaded copies (no web dependency)

I'm interested to know if you can help me find the best wget syntax or other tool that will do this. The tools I have tried usually fail to get the images loaded by CSS, so the page never looks right when loaded locally. Thank you!

Tangent Solution

I found a way to do this using FireFox. The default save is broken and there is an addon that is called "Save Complete" which apparently can do a good job with this. However, you can't download it because it says it is not supported in current FireFox version. The reason is that it was rolled into this addon: "Mozilla Archive Format". Install that, then when you use File > "Save Page As.." there is a new option called "Web Page, complete" which is essentially the old addon, which fixes the stock implementation FireFox uses (which is terrible). This isn't a WGET solution but it does provide a workable solution.

EDIT: Another ridiculous issue for anyone who might be following this question in future, trying to do this. Do get the addon to work properly you need to Tools > Mozilla Archive Format and change the (terrible) default setting of "take a faithful snapshot of the page" to "preserve scripts and source using Save Complete", otherwise the addon will empty all your script files and replace them with the text "/* Script removed by snapshot save */".

Lana Miller

Posted 2011-10-01T02:26:42.047

Reputation: 371

file > save as on firefox or other browser will download all images, js and css files – user31113 – 2011-10-01T02:34:31.263

Do you actually want the files, or do you just want a correctly rendered version of the page? – None – 2011-10-01T02:36:32.070

I want the files, they would be required to correctly render the page anyway. If you didn't have them it would look different. File > Save As does not work in Firefox. If you do this, you don't get the css images. Try it at https://www.tumblr.com/login. Background image missing, bg image for input fields missing.

– None – 2011-10-01T02:43:41.463

None of the wget solutions worked for me. My Tangent Solution is the best method to achieve this kind of site saving. However, I have seen it fail on very complicated pages like http://www.apple.com, presumably because a lot of the resource paths are dynamically generated by executing javascript, some not right away but during some kind of ajax execution.

– Lana Miller – 2011-12-16T11:13:13.323

Answers

wget -p -k http://ExampleSite.com

The -p will get you all the required elements to view the site correctly (css, images, etc). The -k will change all links (to include those for CSS & images) to allow you to view the page offline as it appeared online.

Update: This is specific for your example site: tumblr.com

wget -H -N -k -p --exclude-domains quantserve.com --no-check-certificate -U "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110613 Firefox/6.0a2" https://www.tumblr.com

The Breakdown:

-H = Allows wget to go to span a foreign host. Required since tumblr does not have its images on the front page on the same address, they are using secure.assets.tumblr.com see note on excluding domains

-N = will grab only files that are newer that what you currently have, in case you are downloading the same page again over time

-k = convert your links to view it offline properly

-p = grabs all required elements to view it correctly (css, images, etc)

--exclude-domains = since the tumblr.com homepage has a link for quantserve.com and i'm guessing you don't want this stuff, you need to exclude it from your wget download. Note: This is a pretty important one that you should use with -H because if you go to a site and they have multiple links for outside hosts (think advertisers & analytics stuff) then you are going to grab that stuff also!

--no-check-certificate required since tumblr is using https

-U changes the user-agent. Not really necessary in this instance since it allows the default wget user-agent but I know some sites will block it. I just threw it in here so in case you run into any problems on other sites. In the example snippet I gave, it appears as Mozilla Firefox 6.02a

finally you have the site: https://www.tumblr.com

serk

Posted 2011-10-01T02:26:42.047

Reputation:

1I tried this, didn't get any JS or CSS or image files. Did you? – None – 2011-10-01T03:53:10.540

If you are using it on tumblr (your example above), you may have to specify --no-check-certificate. – None – 2011-10-01T03:57:05.247

I think you're right, it probably does need that option. Still nothing except index.html however. Something is missing... – None – 2011-10-01T04:09:23.247

@LanaMiller I updated my answer. Let me know if there are any issues. – None – 2011-10-01T12:00:15.277

Could you not do something like -exclude-domains != tumblr.com? – alpha1 – 2011-10-02T04:34:43.920

For the specific site you mentioned and many others coded like it wget (and curl) just won't work. The issue is that some of the asset links required to render the page in a browser are themselves created through javascript. Wget has a feature request pending to run javascript:

http://wget.addictivecode.org/FeatureSpecifications/JavaScript

However until that is complete sites that build asset links using javascript won't be cloneable using wget. The easiest solution is to find a tool that is actually building a DOM and parsing javascript like a browser engine (i.e. the firefox method you mentioned).

polynomial

Posted 2011-10-01T02:26:42.047

Reputation: 1 424

You can also do this automatically (or programatically if you do coding) by issuing a command via shell using wget:

wget --convert-links -r http://www.yourdomain.com

It will download the page and internal files and makes the links local.

Jhourlad Estrella

Posted 2011-10-01T02:26:42.047

Reputation: 129

1This will get everything. Read the question. – evgeny – 2011-10-01T03:13:35.140

-1

wget -r http://www.example.com

I think that will grab everything, but give it a shot and find out.

Seth

Posted 2011-10-01T02:26:42.047

Reputation:

1It gets everything which is way too much. So far the FireFox solution I found is the best working solution. It gets what you need and nothing more. – None – 2011-10-01T03:33:34.183

-1

$(man wget):

-p

--page-requisites

This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.

Ordinarily, when downloading a single HTML page, any requisite documents that may be needed to display it properly are not downloaded. Using -r together with -l can help, but since Wget does not ordinarily distinguish between external and inlined documents, one is generally left with ''leaf documents'' that are missing their requisites.

For instance, say document 1.html contains an "<IMG>" tag referencing 1.gif and an "<A>" tag pointing to external document 2.html. Say that 2.html is similar but that its image is 2.gif and it links to 3.html. Say this continues up to some arbitrarily high number.

If one executes the command:

wget -r -l 2 http://<site>/1.html

then 1.html, 1.gif, 2.html, 2.gif, and 3.html will be downloaded. As you can see, 3.html is without its requisite 3.gif because Wget is simply counting the number of hops (up to 2) away from 1.html in order to determine where to stop the recursion. However, with this command:

wget -r -l 2 -p http://<site>/1.html

all the above files and 3.html's requisite 3.gif will be downloaded. Similarly,

wget -r -l 1 -p http://<site>/1.html

will cause 1.html, 1.gif, 2.html, and 2.gif to be downloaded. One might think that:

wget -r -l 0 -p http://<site>/1.html

would download just 1.html and 1.gif, but unfortunately this is not the case, because -l 0 is equivalent to -l inf---that is, infinite recursion. To download a single HTML page (or a handful of them, all specified on the command-line or in a -i URL input file) and its (or their) requisites, simply leave off -r and -l:

wget -p http://<site>/1.html

Note that Wget will behave as if -r had been specified, but only that single page and its requisites will be downloaded. Links from that page to external documents will not be followed. Actually, to download a single page and all its requisites (even if they exist on separate websites), and make sure the lot displays properly locally, this author likes to use a few options in addition to -p:

wget -E -H -k -K -p http://<site>/<document>

To finish off this topic, it's worth knowing that Wget's idea of an external document link is any URL specified in an "<A>" tag, an "<AREA>" tag, or a "<LINK>" tag other than "<LINK REL="stylesheet">".

James Sumners

Posted 2011-10-01T02:26:42.047

Reputation: 139

2Which part of this do you think resembles the solution? Because I tried reading the contents of the man page and I don't see the correct solution here. Did you try any of this yourself? What do you think the command is that addresses the question specifically. – None – 2011-10-01T03:47:45.800