Using wget to mirror a website and everything from the first level of external sites

5

3

I need to mirror a particular website (all the pages under that particular domain) any pages (but not whole sites) that the website links to.

I'm confused about the how to do this

wget -r --level=inf (or some other variant) will mirror the site.

wget -r -H --level=1 will get all the links (from all domains) to the first level.

Anyone have any ideas on how I could combine these, to get the entire of the main site and one level deep into external sites. I've been banging my head against the manual all afternoon.

Thanks

lobsterboy

Posted 2010-09-14T14:51:57.280

Reputation:

Answers

6

This is unfortunately impossible with wget (and the attempt at solving this with -H -l 1 does not do what you expect). What you want is HTTrack.

httrack --ext-depth=1 http://example.com

This can also be abbreviated as httrack %e1 http://example.com. Note that HTTrack counts levels starting at 1, not 0, so it won't follow links found on external pages unless you increase the depth.

bug

Posted 2010-09-14T14:51:57.280

Reputation: 161

4

I would use a combination wget -m -k -K -p http://example.com && wget -r -k -K -H -N -l 1 http://example.com.

About the two commands: wget -m -k -K -p http://example.com will mirror (-m = -r --level=inf -N) it, convert the links to your local mirror (-k), backs up the original file before it gets converted (-K) and downloads all prerequisites for proper viewing the mirror (-p).

After that the second command wget -r -k -K -H -N -l 1 http://example.com would do essentially the same but only for one level spanning all hosts and it would check the timestamps with -N, so you wouldn't download the same files again. I didn't include the -p option here, because it could download very much then...

p.vitzliputzli

Posted 2010-09-14T14:51:57.280

Reputation: 587