Code Golf Image Downloader

20

1

WARNING: Answers may be useful to some code golfers.

In many challenges, the post contains images, which must be saved to a file in order to be able to work on the problem. This is an especially tedious manual task. We programmers should not have to be subjected to such drudgery. Your task is to automatically download all the images contained in a Code Golf.SE question.

Rules

  • Your program may connect to any part of stackexchange.com, but may not connect to any other domains, excepting the locations of the images (i.e., don't bother with a URL shortener).
  • An integer N is given as input, on the command line or stdin.
  • The URL http://codegolf.stackexchange.com/questions/N is guaranteed to be a valid link to a Code Golf question.
  • Each image displayed in the body of question N must be saved to a file on the local computer. Either of the following locations is acceptable:
    • The current directory
    • A directory input by the user
  • Your program must not save files other than the images in the question body (e.g. user avatars, or images contained in answers).
  • Images must be saved with the same file extension as the original.

This is a — write the shortest program you can.

Validity criterion for answers

There are various possible edge cases with multiple images of the same name, text with the same name as HTML elements, etc. An answer will be invalidated only if it can be shown to fail on some revision of a question posted before January 10, 2015.

feersum

Posted 2015-01-12T15:19:26.883

Reputation: 29 566

Should the image names be kept the same or can we do like 0.png, 1.png etc – stokastic – 2015-01-12T15:57:34.687

@stokastic You can name the part before the extension to whatever you want (as long as you don't use the same name twice, overwriting a previous file). – feersum – 2015-01-12T16:06:21.847

Answers

10

Mathematica, 211 210 bytes

i=Import;FileNameTake@#~Export~i@#&/@ImportString["body"/.("items"/.i["http://api.stackexchange.com/2.2/questions/"<>InputString[]<>"?site=codegolf&filter=!*Lgp.gEWHA6BNP.l","JSON"])[[1]],{"HTML","ImageLinks"}]

Ungolfed:

i = Import;
FileNameTake@#~Export~i@# & /@ 
 ImportString[
  "body" /. (
    "items" /. 
      i["http://api.stackexchange.com/2.2/questions/" <> 
        InputString[] <> "?site=codegolf&filter=!*Lgp.gEWHA6BNP.l", 
       "JSON"]
  )[[1]], 
  {"HTML", "ImageLinks"}
 ]

It's pretty straightforward. I've set up a filter for the StackExchange API, which returns only the body of a question. The code retrieves the question information with that filter and parses it as JSON. I select the correct element (the body), and use ImportString to parse the HTML and filter out all image URLs. FileNameTake@#~Export~Import@# then downloads each of the images and stores it in the current working directory with the same file name as that in the URL.

You can find out the current working directory with Directory[].

In principle, there's a much shorter version, because ImportString can actually download all the files right away, instead of just giving me the URLs. But then I lose information about the original file type (since they are converted to Image objects upon download), so I can only save them all as the same type (PNG, say).

Martin Ender

Posted 2015-01-12T15:19:26.883

Reputation: 184 808

8

Javascript - 149 161 bytes

$.get("http://codegolf.stackexchange.com/q/"+prompt(),function(e){$(".post-text:first img",e).each(function(e,t){$('<a href="'+t.src+'"download>')[0].click()})})

with whitespace

$.get('http://codegolf.stackexchange.com/q/' + prompt(), function(d) {
  $('.post-text:first img',d).each(function(i,e){
   $('<a href="' + e.src + '"download>')[0].click();
  })
})

script has to be run from stackexchange site to work. Will default to the current page if no question number is specified in the prompt

Professor Allman

Posted 2015-01-12T15:19:26.883

Reputation: 261

1As @doorknob mentioned above, you can save a bit by swapping q for question. And if you don't mind getting all the images in the posts on the page, you can do $('[src*="imgur"]',d) I believe. I like that this can be run in the console - instant gratification. – Josiah – 2015-01-13T02:09:34.653

1questions can be shortened to q, but it should include the codegolf.stackexchange.com part instead of relying on being at that page. @Josiah it is possible to include images from other domains in posts. – feersum – 2015-01-13T02:48:51.790

1The selector #question .post-text img can be shortened to .post-text:first img or .post-text:eq(0) img. – c.P.u1 – 2015-01-13T07:44:05.700

5

Python 2 - 241 bytes

Pretty straightforward, can probably be golfed further. I search the site for all occurrences of img src= between the first occurrence of post-text and the /div immediately following that. Each image url is then read and saved to the working directory.

import string,sys,urllib,re;o=string.find;u=urllib.urlopen
r=u("http://codegolf.stackexchange.com/q/"+sys.argv[1]).read()
i=o(r,"post-text")
for p in re.findall(r'img src="([^"]*)',r[i:o(r,"/div",i)]):f=open(p[-9:],"wb");f.write(u(p).read())

stokastic

Posted 2015-01-12T15:19:26.883

Reputation: 981

Filenames are kept as is - the name is taken as the last 9 bytes ([-9:]) of the image url, which should keep its 5 character name and a .png or .jpg etc. It will chop off bytes of the filename if the extension is longer than 3 characters. – stokastic – 2015-01-12T16:08:39.637

What if the file name is shorter than 9 bytes? Wouldn't that include a slash in the file name? – Martin Ender – 2015-01-12T16:09:40.260

you can save 2 bytes by making the for loop one line. for p re.findall(...):f=open(...);f.write(...) – undergroundmonorail – 2015-01-12T16:12:37.010

@mar I don't think the file name can be less than 9 bytes, but I might be mistaken – undergroundmonorail – 2015-01-12T16:13:14.513

@MartinBüttner I think 9 bytes is a reasonable assumption, but I can change it if you think I should. For what it's worth - using only 6 or 7 bytes is probably enough and will still pretty much guarantee distinct file names. – stokastic – 2015-01-12T16:16:34.147

You can save 8 chars by replacing questions with q (in the URL). – Doorknob – 2015-01-12T16:43:46.763

2

Mathematica, 195

x=XMLElement;c=Cases;i=Import;l=Infinity;FileNameTake@#~Export~i@#&/@(((c[#,x["img",{"src"->e_,_},___]:>e,l]&)@*(c[#,x[_,{__,"id"->"question",__},e_]:>e,l]&)@*(i[#,"XMLObject"] &))@InputString[])

This exports images in the same way that Martin did in his Mathematica solution, read his answer for more information about that. This approach is very different from his, instead of parsing the result from the API I parse the HTML page directly. Or rather, I parse the symbolic XML that Mathematica can generate from HTML.

user11030

Posted 2015-01-12T15:19:26.883

Reputation:

1

Python 2 - 398 342 334 bytes

The program download the SE page, extracts the post part (the post-text div element), finds urls that end in an image extension and downloads them. The images are saved as img<n>.<ext> in the current directory.

import urllib2 as u,re,sys
z=u.urlopen;i=1
p=z('http://codegolf.stackexchange.com/q/'+sys.argv[1]).read()
s=re.search(r'ss="po(.+?)/di',p,16).group(1)
for L in re.findall('"(h.+?://.*?)"',s):
 b=L.rsplit('.',1)
 if len(b)==2 and b[1].lower() in 'jpg jpeg png gif bmp'.split():
  open('img%u.%s'%(i,b[1]),'wb').write(z(L).read());i+=1

This program will also download images that are supplied as a link, not only embedded images. By giving each image a unique filename, name clashes are also avoided.

Logic Knight

Posted 2015-01-12T15:19:26.883

Reputation: 6 622

2You can save 8 chars by replacing questions with q (in the URL). – Doorknob – 2015-01-12T16:44:05.640

In question 43274, I see only 11 images, but 21 are downloaded. – feersum – 2015-01-12T16:51:23.043

My program downloads the 10 high resolution images as well as the 10 thumbnails. I am not sure the other entries fetch the high resolution versions. – Logic Knight – 2015-01-12T17:03:18.350

@Doorknob - thanks. I missed that. I will need much more though to catch the other guys. – Logic Knight – 2015-01-12T17:09:54.983

1@CarpetPython although that's arguably more useful...the intention of the spec was to download only images which are visible. – feersum – 2015-01-12T17:18:29.160

1

Bash - 86 bytes

wget -r -l1 -np -Ajpg,jpeg,png,bmp,gif http://codegolf.stackexchange.com/questions/$1

Nothing wget won't fix. -np prevents wget from entering upper directories(User Imgs) -A only grabs files with the extension matching the list presented. -r is a recursive download. -l prevents wget from going too deep. $1 is the question to grab.

HSchmale

Posted 2015-01-12T15:19:26.883

Reputation: 181

1

Is there something specific I need to do for this to work? I tried it on a couple questions, but no good. Output here.

– Geobits – 2015-01-13T03:17:19.920

1I think ou can save 8 chars by replacing questions with q in the URL. – Timtech – 2015-01-14T12:13:19.900

1

Node.js, 251 247 Bytes

r=require,g=r('request'),g('http://codegolf.stackexchange.com/q/'+process.argv[2],function(_,_,b){r('cheerio').load(b)('#question .post-text img').each(function(i,a){s=a.attribs.src,g(s).pipe(r('fs').createWriteStream(i+r('path').basename(s)))})})

Uses request to make HTTP GETs and cheerio to parse the HTML. Name collisions are resolved by prepending the index of the current image to the basename of the file's URL. Images are saved to same directory as the current file.

c.P.u1

Posted 2015-01-12T15:19:26.883

Reputation: 1 049

1

Lua, 200 bytes

r=require'socket.http'.request r('http://codegolf.stackexchange.com/questions/'.. ...):gsub('post.text(.-)div',function(p)p:gsub('src="(.-)"',function(i)io.open(i:sub(-9),'wb'):write((r(i)))end)end)

Accepts the number as a command-line argument.

Assumes any src= attribute will be for an img tag since these are the only tags with src attributes that stack exchange allows (right?).

Also note the .. .... I'm particularly proud of that one.

thenumbernine

Posted 2015-01-12T15:19:26.883

Reputation: 341