How to save all the webpages linked from one

15

9

I would like to save this webpage and all the pages it links to. and hope to have the same linking between the saved webpages.

Are there some ways instead of opening and saving each linked pages?

Tim

Posted 2011-04-23T04:28:25.927

Reputation: 12 647

Answers

12

You can do what you'd like with the wget command line utility. If you provide it with the -r option, it will recursively download web pages. For example:

wget -r http://mat.gsia.cmu.edu/orclass/integer/integer.html

This will download that webpage and anything it links to. You can also make it only recurse a certain number of levels, to do this, you simply provide -r with a number. Like such:

wget -r 5 http://mat.gsia.cmu.edu/orclass/integer/integer.html

Wuffers

Posted 2011-04-23T04:28:25.927

Reputation: 16 645

@Mark: Thanks! I now try to download http://mat.gsia.cmu.edu/orclass/ and the pages it links using command wget -r http://mat.gsia.cmu.edu/orclass. wget will create a directory mat.gsia.cmu.edu under the one I specified and download the pages under it. But the links between the downloaded pages does not have mat.gsia.cmu.edu in their paths, so it becomes a problem and I cannot go from one page to another by clicking the links. I was wondering why and how to solve the problem? Thanks!

– Tim – 2011-04-23T14:33:14.340

I don't think that you can recursively download external links, @Tim. – Wuffers – 2011-04-23T16:11:28.350

Does "external links" mean those not under the current path? – Tim – 2011-04-23T16:39:23.237

@Tim: By external links I mean links that refer outside of mat.gsi.cmu.edu – Wuffers – 2011-04-23T16:40:09.523

Thanks! Is http://mat.gsia.cmu.edu/classes/integer/integer.html outside of mat.gsi.cmu.edu or of http://mat.gsia.cmu.edu/orclass/ ? http://mat.gsia.cmu.edu/orclass/ is what I used in command wget, and the downloaded wegpage of http://mat.gsia.cmu.edu/classes/integer/integer.html is not directable from the downloaded page of http://mat.gsia.cmu.edu/orclass/

– Tim – 2011-04-23T16:51:21.953

If you download mat.gsia.cmu.edu/orclass, it will not recurse into the directory mat.gsia.cmu.edu/class/integer/. So you will have to do wget -r mat.gsia.cmu.edu if you want them both (Note this will download everything on that site.) – Wuffers – 2011-04-23T16:54:16.320

With command wget -r mat.gsia.cmu.edu/orclass, it did recurse into mat.gsia.cmu.edu/class/integer/. The only problem is that in the downloaded wegpages, the links from the downloaded mat.gsia.cmu.edu/orclass to those downloaded mat.gsia.cmu.edu/class/integer/ do not work, because the local link addresses are not correct, as pointed out in my first comment. – Tim – 2011-04-23T17:05:37.220

1@Tim: Oh, ok. Sorry for the misunderstanding. I think that you could try editing the HTML files yourself to check and try to make them work. – Wuffers – 2011-04-23T17:12:46.640

11

This thread is old now, but others might look at it. Thank you, Wuffers, for pointing me in the right direction but, to expand on Wuffers's answer: A modern version of wget has a number of useful options for recursing links and patching them to be local relative links so that you can navigate a local copy of a web site. Use the -r option to recurse, the -k option to patch local links, the -H option to traverse into domains other than the original one, the -D option to limit which domains you traverse into, the -l option to limit the depth of recursion, and the -p option to make sure that the leaves of your traversal have everything they need to display correctly. For example, the following will download a page and everything it immediately links to, making it locally browsable, the -p option ensures that if the linked-to-pages contain images, that they are downloaded, too:

wget -r -l 1 -p -k -H -D domain.com,relateddomain.com http://domain.com/page/in/domain

Using a command similar to the one above, I was able to download a chunk of a wiki page, with external links, onto my local disk without downloading megabytes of extraneous data. Now, when I open the root page in my browser, I can navigate the tree without an Internet connection. The only irritant was that the root page was buried in subdirectories and I had to create a top-level redirect page in order to make it convenient to display. It may take some trial-and-error to get it right. Read the wget man page and experiment.

Pablo Halpern

Posted 2011-04-23T04:28:25.927

Reputation: 211

4

You can use a website crawler like httrack, which is free.

From the website;

[httrack] allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" website in your browser, and you can browse the site from link to link, as if you were viewing it online.

RJFalconer

Posted 2011-04-23T04:28:25.927

Reputation: 9 791

1+1 Excellent application! But it's grabbing all the linked zip files as well, which I didn't want. But then I should have probably read the instructions first! – finlaybob – 2014-03-28T16:29:14.683

Yup, it can/will follow all links so will download files. (@Finlaybob are you aware the homepage listed on your profile has been hacked?) – RJFalconer – 2014-03-30T20:04:59.307

I was not! I'll look into it - thanks for letting me know! – finlaybob – 2014-03-31T16:19:22.433