Using Wget to Recursively Crawl a Site and Download Images

13

10

How do you instruct wget to recursively crawl a website and only download certain types of images?

I tried using this to crawl a site and only download Jpeg images:

wget --no-parent --wait=10 --limit-rate=100K --recursive --accept=jpg,jpeg --no-directories http://somedomain/images/page1.html

However, even though page1.html contains hundreds of links to subpages, which themselves have direct links to images, wget reports things like "Removing subpage13.html since it should be rejected", and never downloads any images, since none are directly linked to from the starting page.

I'm assuming this is because my --accept is being used to both direct the crawl and filter content to download, whereas I want it used only to direct the download of content. How can I make wget crawl all links, but only download files with certain extensions like *.jpeg?

EDIT: Also, some pages are dynamic, and are generated via a CGI script (e.g. img.cgi?fo9s0f989wefw90e). Even if I add cgi to my accept list (e.g. --accept=jpg,jpeg,html,cgi) these still always get rejected. Is there a way around this?

Cerin

Posted 2011-03-29T15:23:27.987

Reputation: 6 081

Answers

5

Why won't you try to use wget -A jpg,jpeg -r http://example.com?

meoninterwebz

Posted 2011-03-29T15:23:27.987

Reputation: 59

The question states that some of the images are of the form /url/path.cgi?query, so your suggestion will not fetch those. – Charles Stewart – 2012-11-11T00:38:31.170

1

How do you expect wget to know the contents of subpage13.html (and so the jpg's that it links to) if it is not allowed to download it. I suggest you allow html, get what you want, then remove what you don't want.


I'm not quite sure about why your cgi's are getting rejected... is there any error output by wget? Perhaps make wget verbose (-v) and see. Might be best as a separate question.

That said, if you don't care about bandwidth and download lots then remove what you don't want after, it doesn't matter.


Also check out --html-extension

From the man page:

-E

--html-extension

If a file of type application/xhtml+xml or text/html is downloaded and the URL does not end with the regexp .[Hh][Tt][Mm][Ll]?, this option will cause the suffix .html to be appended to the local filename. This is useful, for instance, when youâre mirroring a remote site that uses .asp pages, but you want the mirrored pages to be viewable on your stock Apache server. Another good use for this is when youâre downloading CGI-gener- ated materials. A URL like http://site.com/article.cgi?25 will be saved as article.cgi?25.html.

Note that filenames changed in this way will be re-downloaded every time you re-mirror a site, because Wget canât tell that the local X.html file corresponds to remote URL X (since it doesnât yet know that the URL produces output of type text/html or application/xhtml+xml. To prevent this re-downloading, you must use -k and -K so that the original version of the file will be saved as X.orig.


--restrict-file-names=unix might also be useful due to those cgi urls...

Pricey

Posted 2011-03-29T15:23:27.987

Reputation: 4 262

I should stop linking wget options.. was about to point out --no-parent but I will stop there. – Pricey – 2011-03-30T15:41:01.883

0

You can also use MetaProducts Offline Explorer without programming

TiansHUo

Posted 2011-03-29T15:23:27.987

Reputation: 119

-1

Try adding the --page-requisites option

ggiroux

Posted 2011-03-29T15:23:27.987

Reputation:

That downloads all linked media. The only way to use wget to download images is to download ALL content on a page?! – Cerin – 2011-03-29T15:48:13.157