Can't make wget reject/exclude files from a list

2

The problem is this, I have a file with a list of URLs, say links.txt:

http://www.tipsfor.us/wp-content/uploads/2009/01/vim-editor-icon.png
http://wp.psyx.us/wp-content/uploads/2011/01/vi-vim-tutorial-1.gif
http://proft.me/static/img/vim/vi-vim-cheat-sheet.gif

What I'm trying to do is let wget know that I don't want it to get png like so:

$ wget -R png -i links.txt

But this has no effect and wget still gets png files along with others. Piping links.txt through grep is not an option as in actual file the links are in this form http://example.com/get/123987562 that then gets resolved/redirected into something like http://example.com/media/images/cool-pic.jpg

So the question is, how do I reject/exclude certain files with wget?

grimgav

Posted 2011-11-03T08:45:32.657

Reputation: 123

Is order important? Try wget -i links.txt -R png – Kusalananda – 2011-11-03T10:22:21.623

Nope. Order is not important. – grimgav – 2011-11-03T11:08:38.560

Answers

1

Wget, or at least the version I have appears poorly equipped to do this, the --server-response option still seems to download the file. If wget isn't critical, then curl may be a better option.

The solution to this type of problem involves looking at the Content-Type returned by the server. For example:

curl -I http://www.tipsfor.us/wp-content/uploads/2009/01/vim-editor-icon.png

writes the something like the following on stdout

http://www.tipsfor.us/wp-content/uploads/2009/01/vim-editor-icon.png
HTTP/1.1 200 OK
Server: nginx admin
Date: Thu, 03 Nov 2011 09:22:55 GMT
Content-Type: image/png
Content-Length: 35765
Last-Modified: Wed, 13 Apr 2011 05:19:19 GMT
Connection: keep-alive
Vary: Accept-Encoding
Expires: Thu, 10 Nov 2011 09:22:55 GMT
Cache-Control: max-age=604800
X-Cache: HIT from Backend
Accept-Ranges: bytes

Filtering that result with grep allows you to test for acceptable mime types. You can then generalize the approach to check for valid mime types for lists of files. Tidying this up and putting it into a shell script:

#!/bin/bash
# in mimechecker.sh

LINKFILE=$1
PATTERN=$2

function mimefilter {
    URL=$1
    PATTERN=$2
    curl -sI $URL | egrep ${PATTERN} > /dev/null 2>&1
    if [ "$?" -eq "0" ] ; then
        wget $URL
    fi
}

(
    while read line
    do
        mimefilter $line $PATTERN
    done
) < $LINKFILE

Which you would call in this way:

mimechecker links.txt 'image/png'

Andrew Walker

Posted 2011-11-03T08:45:32.657

Reputation: 215

Great idea, thank you for sharing and answering my question. That really helped. – grimgav – 2011-11-03T11:02:29.397