11

I'm trying to mirror a website using wget, but I don't want to download lots of files, so I'm using wget's --reject option to not save all the files. However wget will still download all the files and then remove the file afterwards if it matches my reject option.

Is there some way to tell wget not to follow certain links if they match some shell wildcard? If wget can't do this, is there some other common linux command that can do this?

Amandasaurus
  • 30,211
  • 62
  • 184
  • 246

6 Answers6

10

You might also try HTTrack which has, IMO, more flexible and intuitive include/exclude logic. Something like this...

httrack "https://example.com" -O ExampleMirrorDirectory \
"-*" \
"+https://example.com/images/*" \
"-*.swf"

The rules will be applied in order, and will override previous rules...

  1. Exclude everything
  2. But include https://example.com/images/*
  3. But exclude anything ending in swf
lukecyca
  • 2,185
  • 13
  • 20
6

Looks like this isn't possible in wget

Amandasaurus
  • 30,211
  • 62
  • 184
  • 246
1

how do you use wget? try to use it in this way:

wget -r --reject=gif,jpg,swf http://norc.aut.ac.ir/

this command will ignore gif and jpg and swf files.

orezvani
  • 107
  • 1
  • The files that Rory McCann wants to reject are HTML files, but he wants to keep other HTML files, so this syntax doesn't apply to his question. – Royce Williams Jan 08 '12 at 16:28
1

One workaround would be to run wget through a proxy server. Set your proxy to disallow certain patterns. This would block wget from ever downloading them in the first place.

wget will download and remove a file that matches the -R pattern. it can match patterns too, not just extensions or parts of filenames. It however doesn't stop wget from downloading first and deleting later.

httrack does have some nice features but in my experience the way it saves a "file" might be a bit quirky e.g. if httrack comes across index.asp?Type=BASIC&PAGEID=2234234
it can save it but you have to tell it to preserve the parts of the query
e.g. %h%p/%n%[TYPE:@TYPE=::]%[PAGEID:PAGEID=::].%t
the @ is a placeholder for a questionmark, you can rename the files later, or maybe escape a question mark instead? problem is, the .%t will add a '.html' to the end of your URI that originally did not have a '.html' And if you take it off, images that httrack downloads will lack a file extension.

Better off to use wget IMHO

cparod
  • 11
  • 1
1

Under the --reject section of 'man wget':

"Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix."

If you are doing this, you might want to give examples of the patterns you are using and what you think should match, and that doesn't. You say they are matching, but are you sure?

Also, make sure you put this list in quotes, so the shell doesn't expand those wildcards before passing the argument(s) to wget.

Even if your system doesn't have version 1.12 , read the Types of Files section of the manual here. According to the change log, the maintainer added some caveats:

* NEWS: Added documentation change re: --no-parents, and various
caveats on accept/reject lists behavior. Rearranged some items in
order of priority.
Kyle Brandt
  • 82,107
  • 71
  • 302
  • 444
  • The --reject options are in quotes. I can see that the are matching the correct files because after the file is downloaded, wget removes the file. I just want to stop it downloading the file in the first place – Amandasaurus Oct 13 '09 at 11:53
  • Are these htm(l) files? According to the manual, these are downloaded no matter what. – Kyle Brandt Oct 13 '09 at 12:07
  • Yes. the files I want to reject are HTML files. I know they are downloaded no matter what. Is there some way to prevent that? – Amandasaurus Oct 13 '09 at 14:22
1

You could restrict the level of recursion with the -l NUMBER option, if that helps (not following a certain regex pattern).

A level of "2" downloads index.html, its subsites/images/etc and the links on the subsite.

PEra
  • 2,825
  • 17
  • 14