You hardly could get good results by using the brute force approach most one-liner commands would give (although I use the wget option to get the whole site a lot).
I would suggest you to create a script that uses some form of condidional selection and loops to actually match and follow the kind of links that take you to the images you want.
The strategy I usually follow:
- In the browser, go to the first page of interest and show the source code;
- Right click an image -> "Image properties" -> locate the 'src=' attributes and the image tags.
- Get the overall pattern of these tags/links/hrefs, and use some regex (
grep -o
) to parse the links;
- Use these links with some command to download the images;
- Get also the links on the page that take to other pages;
- Repeat.
This is indeed much more complicated than a one-liner that takes it all, but the experience is enlightening. Webscraping is an art on itself.
For that, also, I would recommend Python, although it is perfectly possible to do it with Shell Script (bash) if you prefer, or any scripting language by the way (Ruby, PHP, perl, etc.).
Hope this helps.
wget
respects robots.txt, so it may not go recursively if robots.txt disallows it. I think flag-erobots=off
should be added to the answer. – jojman – 2019-12-17T00:22:21.3271your command is not working – Zignd – 2012-06-08T14:06:15.737
please check out the post again, i edited it – Zignd – 2012-06-08T14:12:33.980
Maybe you've been banned anyway. – Vic Abreu – 2012-06-08T14:17:16.860
tumblr is the kind of site that would very likely ban these scraping scripts. – heltonbiker – 2012-06-08T14:18:58.137