How to download all images from a website (not webpage) using the terminal?

4

7

I want a command that I type a URL, for example photos.tumblr.com, and it download all photos on this site in a folder, but not only images of the site's homepage. This command need to download the images from all parts of the site, such as photos.tumblr.com/ph1/1.png / photos.tumblr.com/ph3/4.jpg.

Please show me an example using this url: http://neverending-fairytale.tumblr.com/ and test it before answer the question

Zignd

Posted 2012-06-08T13:36:18.567

Reputation: 481

Answers

5

You can use:

wget -r -A=.jpg,.png http://website.com

With this command you will get all the JPG and PNG files, but you can get banned from the site.

So, if you use:

wget --random-wait --limit-rate=100k -r -A=.jpg,.png http://website.com

You'll get your images waiting a random time between downloads and setting a speed limit.

Vic Abreu

Posted 2012-06-08T13:36:18.567

Reputation: 399

wget respects robots.txt, so it may not go recursively if robots.txt disallows it. I think flag -erobots=off should be added to the answer. – jojman – 2019-12-17T00:22:21.327

1your command is not working – Zignd – 2012-06-08T14:06:15.737

please check out the post again, i edited it – Zignd – 2012-06-08T14:12:33.980

Maybe you've been banned anyway. – Vic Abreu – 2012-06-08T14:17:16.860

tumblr is the kind of site that would very likely ban these scraping scripts. – heltonbiker – 2012-06-08T14:18:58.137

1

You can download the entire website (I would use wget -r -p -l inf -np), then (or simultaneously) run a shell script to delete all non-image files (the file command can be used to check if a file is an image).

(The -A/-R options of wget are not reliable; they only check the extension of the URL (so you can filter by .jpg, .jpeg, .png, etc.), but there is no requirement for these to be present.)

Ankur

Posted 2012-06-08T13:36:18.567

Reputation: 709

1

You hardly could get good results by using the brute force approach most one-liner commands would give (although I use the wget option to get the whole site a lot).

I would suggest you to create a script that uses some form of condidional selection and loops to actually match and follow the kind of links that take you to the images you want.

The strategy I usually follow:

  • In the browser, go to the first page of interest and show the source code;
  • Right click an image -> "Image properties" -> locate the 'src=' attributes and the image tags.
  • Get the overall pattern of these tags/links/hrefs, and use some regex (grep -o) to parse the links;
  • Use these links with some command to download the images;
  • Get also the links on the page that take to other pages;
  • Repeat.

This is indeed much more complicated than a one-liner that takes it all, but the experience is enlightening. Webscraping is an art on itself.

For that, also, I would recommend Python, although it is perfectly possible to do it with Shell Script (bash) if you prefer, or any scripting language by the way (Ruby, PHP, perl, etc.).

Hope this helps.

heltonbiker

Posted 2012-06-08T13:36:18.567

Reputation: 129

0

You can use a git repo such as this one:

https://github.com/nixterrimus/tumbld

There are also other repos which provide similar functionality.

Mark Anderson

Posted 2012-06-08T13:36:18.567

Reputation: 1