Recursively mirroring a hosted blog

I am not asking how to download a standard webpage, or website tree, as I know how to do that.

The problem I am having is that wget/downloadthemall/HTTrack/FDM/IDM, etc., do not seem to work with the blog format.

They should in theory, as it is still a standard webpage with links, yet they don't.

I have tried wget with both -m and -r -l3 to no avail, as well as downthemall.

The problem I am having is that these downloader programs do not seem to follow the tag system, or are not aware that the majority of content to follow is behind the "older posts" type links.

Is there a way to customise the use of one of these downloader programs to follow a specific path through a website, without scripting?

user48869

Posted 2010-09-19T19:22:47.213

Reputation:

It would be nice to know what blog, specifically, you are talking about. – digitxp – 2010-09-19T19:51:03.250

I can't make the question specific to any site or service as per the FAQ, but lets use blogger as an example. – None – 2010-09-19T20:12:18.277

Answers

Are the blog links pointing to another domain, or something that looks like another domain? For example, you might be telling wget to fetch everything from "someblog.com", but the links point to "www.someblog.com", which resolves to the same page but still possibly might confuse wget.

LawrenceC

Posted 2010-09-19T19:22:47.213

Reputation: 63 487

No, the links all seem to be on the same domain, but for whatever reason wont parse the image tags and save images. The cgi stuff also seems to cause a problem, as I often get the same page multiple times with many different filenames based on what wget or whatever requested... – None – 2010-09-20T12:08:48.557

Maybe the image directories might be protected from hotlinking, and one way to do that is refuse downloads if the HTTP referer is not correct. Investigate wget's http-referer option. – LawrenceC – 2010-09-21T02:39:18.080