wget recursively download from pages with lots of links

When using wget with the recursive option turned on I am getting an error message when it is trying to download a file. It thinks the link is a downloadable file when in reality it should just be following it to get to the page that actually contains the files(or more links to follow) that I want.

wget -r -l 16 --accept=jpg website.com

The error message is: .... since it should be rejected. This usually occurs when the website link it is trying to fetch ends with a sql statement. The problem however doesn't occur when using the very same wget command on that link. I want to know how exactly it is trying to fetch the pages. I guess I could always take a poke around the source although I don't know how messy the project is. I might also be missing exactly what "recursive" means in the context of wget. I thought it would run through and travel in each link getting the files with the extension I have requested.

I posted this up over at stackOverFlow but they turned me over here:) Hoping you guys can help.

EDIT: Output of error message

2010-04-13 16:54:47 (128 KB/s) - `somewebsite.com/index.php?id=917218' saved [10789]

Removing somewebsite.com/index.php?id=917218 since it should be rejected.

I rather not reveal the source of the website :)

wget

Shadow

Posted 2010-04-13T22:18:28.340

Reputation: 113

Please post the actual error message you get, or even better the full output of wget. – sleske – 2010-04-13T22:21:34.387

That message happens a lot with various websites that is traversing. – Shadow – 2010-04-13T22:58:40.013

Answers

As pointed out by Hugh Allen, using just --accept=jpg will make wget load only files with extension .jpg (and .htm, .html, which are always fetched). That's why wget tells you it will remove the php file. So try to use --accept=jpg,php or similar.

See the wget manual for details. I recommend you read it, as it explains the whole accept/reject mechanism in great detail.

sleske

Posted 2010-04-13T22:18:28.340

Reputation: 19 887

I have read that thing like the bible. I still can't get it to do what I want. If I include the php files won't it just download them and not actually fetch files from them? And if I run that address that was rejected in wget by itself with my --accept=jpg option it gets all the files from the page. So I know wget can read links from php files. – Shadow – 2010-04-14T04:55:11.730

@Shadow: No, if wget downloads php files it should detect that they contain HTML, and follow the links inside. If it does not, then there is another problem. But that is impossible to debug without knowing the exact site and page you are downloading (as it might be a problem with the server config). – sleske – 2010-04-14T11:05:15.707

Ah ha, that could be it. When I added php to the list of accepted files it then gave errors on html files, all of which had SQL queries at the end. That probably has nothing to do with it but it irks me a bit. I'll keep on poking at it trying a few of the different things – Shadow – 2010-04-14T14:27:10.167

Maybe --accept=jpg means reject everything else.

Hugh Allen

Posted 2010-04-13T22:18:28.340

Reputation: 8 620

1Yes, it does. But wget should still download HTML files to extract any links for recursive retrieval, it will just delete the file afterwards. – sleske – 2010-04-14T15:36:33.783