Single files
To accomplish "If file x exists, download it; otherwise download file y", you can do the following:
wget x || wget y
If x exists, it is downloaded and wget
returns true
, thus the second part is skipped. If x does not exist, wget
returns some error code (probably 8) and the second part of the expression is evaluated (which downloads y).
Recursively
That obviously won't help you much for your recursive download, though. I would be surprised if wget
has the facilities to accommodate masking with this level of sophistication. The man page doesn't appear to cover any form of fancy conditionals either. A slightly modified approach could work, though.
(It appears to be difficult to convince wget
to produce a list of things it wants to download. My first idea was to create this and filter it appropriately before downloading, much like @utkuerd suggests.)
A starting point would naturally be to download all the ogg files first, presumably by
wget -Dfoo.com -I /folder/ -r -l 1 -nc -A.ogg -i http://www.foo.com/folder/
The remaining mp3 files could then be downloaded by the same method, provided you have a suitable mask to supply as a --reject
list. This list should contain the name of every mp3 file you don't want to download.
Assuming the I suggest you create this list as follows
bl=($(find ./ -name '*.ogg' -exec basename -s .ogg {} \+ | sed 's/\(^.\+$\)/\1.mp3/' ) )
You now have a bash array of the mp3 files to block.
To download only the unblocked mp3 files, you could use
IFS=','; wget -Dfoo.com -I /folder/ -r -l 1 -nc -A.mp3 -R"${bl[*]}" -i http://www.foo.com/folder/; unset IFS
The IFS
variable must be modified so the list won't be space separated.
Obviously, this will go badly to varous degrees if the list of ogg files is longer than getconf ARG_MAX
(it will break the wget command) or the filenames contain whitespace (it will break the blocklist, potentially giving you and extra file and (unlikely) a missing file). Both are fixable.
Note that superfluous commas in the reject list gives interesting results.
Writeup of @Bob's excellent suggestion
(see comment below)
After getting the ogg files with
wget -Dfoo.com -I /folder/ -r -l 1 -nc -A.ogg -i http://www.foo.com/folder/
you could create dummy mp3 files like so
find ./ -name '*.ogg' | sed 's/ogg$/mp3/' | xargs -d '\n' touch
and get the remaining mp3 files with (exploiting -nc
)
wget -Dfoo.com -I /folder/ -r -l 1 -nc -A.mp3 -i http://www.foo.com/folder/
The superfluous mp3 files can then be removed with something like
find ./ -name '*.mp3' -size 0 -exec rm '{}' \+
I tested that this works with spaces in the names.
2I'm kinda thinking the "Windows batch file" way, but wouldn't it be possible to (instead of specifying a reject list, and since
-nc
is specified) download allogg
files, loop through them alltouch
ing amp3
file with the same name (0 byte), download all asmp3
with-nc
causing those that exist as anogg
and with the corresponding 0 bytemp3
to be skipped, then loop through theogg
s to delete the mp3 versions of them (or just delete all 0 bytemp3
s). The reject list is probably better, though this would avoidARG_MAX
and whitespace issues entirely. – Bob – 2012-04-13T17:34:19.097Most excellent, works like a charm! Thank you all very much.
Now, I've figured, downloading with my above command can be very time consuming, especially if the files are sometimes 1 link deeper in the directory structure: First i need to download/parse everything to get to the .ogg-files, then again i need to do the same for the remaining .mp3-files, since i -A.ogg, it discarded the html files to parse...
Is there a way to not discard the .html files, to be able to parse them a second time offline? – Kai – 2012-04-17T12:10:18.023
to keep the html files i would now simply use the option
-A ogg,htm,html
in the first place. – Kai – 2012-05-08T11:52:56.783