Wget - if / else download condition?

I want wget to prefer a certain filetype over another, if the files have the same basename.

For example:

if foo.ogg available, don't download foo.mp3

the way i use wget so far to crawl/automatically download (if anyone is interested):

wget -Dfoo.com -I /folder/ -r -l 1 -nc -A.ogg,.mp3 http://www.foo.com/folder/

but this, of course, gets me .mp3 AND .ogg files. Any Ideas?

(Syntax-Explanation:
-D: download only from this Domain
-I: download only from this subfolder of Domain
-r: recursive (follow links and directory structure)
-l 1: follow only 1 link deep
-nc: no clobber = download only if file doesn't exist
-A: accept/download only all *.ogg and *.mp3 (discard necessary html-files)
(-i (optionally in front of URL): reads URLs from the URL, but also downloads other filetypes like .png which you didn't want in the first place / discards them afterwards)

Kai

Posted 2012-04-13T15:04:26.630

Reputation: 45

Answers

Single files

To accomplish "If file x exists, download it; otherwise download file y", you can do the following:

wget x || wget y

If x exists, it is downloaded and wget returns true, thus the second part is skipped. If x does not exist, wget returns some error code (probably 8) and the second part of the expression is evaluated (which downloads y).

Recursively

That obviously won't help you much for your recursive download, though. I would be surprised if wget has the facilities to accommodate masking with this level of sophistication. The man page doesn't appear to cover any form of fancy conditionals either. A slightly modified approach could work, though.

(It appears to be difficult to convince wget to produce a list of things it wants to download. My first idea was to create this and filter it appropriately before downloading, much like @utkuerd suggests.)

A starting point would naturally be to download all the ogg files first, presumably by

wget -Dfoo.com -I /folder/ -r -l 1 -nc -A.ogg -i http://www.foo.com/folder/

The remaining mp3 files could then be downloaded by the same method, provided you have a suitable mask to supply as a --reject list. This list should contain the name of every mp3 file you don't want to download.

Assuming the I suggest you create this list as follows

bl=($(find ./ -name '*.ogg' -exec basename -s .ogg {} \+ | sed 's/\(^.\+$\)/\1.mp3/' ) )

You now have a bash array of the mp3 files to block.

To download only the unblocked mp3 files, you could use

IFS=','; wget -Dfoo.com -I /folder/ -r -l 1 -nc -A.mp3 -R"${bl[*]}" -i http://www.foo.com/folder/; unset IFS

The IFS variable must be modified so the list won't be space separated.

Obviously, this will go badly to varous degrees if the list of ogg files is longer than getconf ARG_MAX (it will break the wget command) or the filenames contain whitespace (it will break the blocklist, potentially giving you and extra file and (unlikely) a missing file). Both are fixable.

Note that superfluous commas in the reject list gives interesting results.

Writeup of @Bob's excellent suggestion

(see comment below)

After getting the ogg files with

wget -Dfoo.com -I /folder/ -r -l 1 -nc -A.ogg -i http://www.foo.com/folder/

you could create dummy mp3 files like so

find ./ -name '*.ogg' | sed 's/ogg$/mp3/' | xargs -d '\n' touch

and get the remaining mp3 files with (exploiting -nc)

wget -Dfoo.com -I /folder/ -r -l 1 -nc -A.mp3 -i http://www.foo.com/folder/

The superfluous mp3 files can then be removed with something like

find ./ -name '*.mp3' -size 0 -exec rm '{}' \+

I tested that this works with spaces in the names.

Eroen

Posted 2012-04-13T15:04:26.630

Reputation: 5 615

2I'm kinda thinking the "Windows batch file" way, but wouldn't it be possible to (instead of specifying a reject list, and since -nc is specified) download all ogg files, loop through them all touching a mp3 file with the same name (0 byte), download all as mp3 with -nc causing those that exist as an ogg and with the corresponding 0 byte mp3 to be skipped, then loop through the oggs to delete the mp3 versions of them (or just delete all 0 byte mp3s). The reject list is probably better, though this would avoid ARG_MAX and whitespace issues entirely. – Bob – 2012-04-13T17:34:19.097

Most excellent, works like a charm! Thank you all very much.

Now, I've figured, downloading with my above command can be very time consuming, especially if the files are sometimes 1 link deeper in the directory structure: First i need to download/parse everything to get to the .ogg-files, then again i need to do the same for the remaining .mp3-files, since i -A.ogg, it discarded the html files to parse...

Is there a way to not discard the .html files, to be able to parse them a second time offline? – Kai – 2012-04-17T12:10:18.023

to keep the html files i would now simply use the option -A ogg,htm,html in the first place. – Kai – 2012-05-08T11:52:56.783

I do not think -A option of wget has the power to choose among given filename patterns in a smart way. Most probably you need a script to achieve what you want. You should fetch the directory listing, parse it yourself and then download the files you want.

For .png files being downloaded and discarded, you used -i flag incorrectly. -i flag specifies a file (or URL) that contains URL's to be downloaded. You should specify the starting point without any flag. If you remove the -i flag, no other file types are downloaded but only .ogg, .mp3 and necessary html files. html files are discarded afterwards.

infiniteRefactor

Posted 2012-04-13T15:04:26.630

Reputation: 750

As an alternative, i am able to parse all URLs out of a saved html-index by using

awk 'BEGIN{ RS="<a *href *= *\""} NR>2 {sub(/".*/,"");print; }' index.html >> url-list.txt

discarding unnecessary lines by hand, then downloading via:

wget -v -nc -A ogg -i url-list.txt

How do i parse out only specific URLs, like

www.foo.com/(randomfolder)/(randomfilename).mp3 – Kai – 2012-04-17T12:14:17.393

About -I: i tried leaving out -I /folder and/or changing -D to -Dwww.foo.com/folder, however, none of that had any effect on the .png files still being downloaded.

wget manual says: _-I' option accepts a comma-separated list of directories included in the retrieval. Any other directories will simply be ignored. The directories are absolute paths."

So, if you wish to download from http://host/people/bozo/' following only links to bozo's colleagues in the `/people' directory and the bogus scripts in /cgi-bin', you can specify:_ (...) – Kai – 2012-04-17T12:15:45.410

(...) the bogus scripts in `/cgi-bin', you can specify: wget -I /people,/cgi-bin http://host/people/bozo/" ---- (apparently it also says -D makes only sense when using with -H (allowing wget "spanning", to follow links/downloads to other domains) - it still makes directories for other domains which i didn't want, though)

– Kai – 2012-04-17T12:16:42.867

1There is nothing wrong wit -I. I was referring to the -i (small I) option you used at the end, before the URL. I don't thing you need that and when you use it all URL's in the index file (including folder icons etc.) are downloaded before discard. – infiniteRefactor – 2012-04-18T01:02:05.950

most excellent, thank you! sorry, I didn't watch out, you were right. – Kai – 2012-05-08T11:50:01.107