wget: Turn Off Forced .html Retreival

0

When performing a recursive download, I specify a pattern via the -R parameter for wget to reject, but if this file is a HTML file, wget downloads the file regardless of whether or not it matches the pattern.

e.g.

wget -r -R "*dynamicfile*" example.com

still retrieves files such as example.com/dynamicfile1.html

Is there a way to prevent this?

Mike B

Posted 2010-04-20T17:13:26.963

Reputation: 1

Answers

0

It does this because wget uses the html files to know where to scan next as it crawls through the webpage. I would just let wget do its business and then do a rm *.html after it is done, or something similar.

EDIT: Doing an rsync *dynamicfile* /foo/bar to a second directory might be a better way to filter your files to only keep the ones with the correct name (assuming that you want to keep some of the html files if they have the right name)

Jarvin

Posted 2010-04-20T17:13:26.963

Reputation: 6 712

1I'm trying to filter the file because it causes wget to get stuck in an infinite loop, so this won't work. – Mike B – 2010-04-20T18:05:12.377

Sounds like your infinite loop is the true issue your trying to deal with. This is different enough that you should probably just post a new question instead asking about preventing infinite loops with wget. – Jarvin – 2010-04-20T18:36:23.607

You should add a depth limit to wget. This will make sure it isn't an infinite loop. – Jarvin – 2010-04-20T18:42:26.697