How to download with wget without following links with parameters

11

4

I'm trying to download two sites for inclusion on a CD:

http://boinc.berkeley.edu/trac/wiki
http://www.boinc-wiki.info

The problem I'm having is that these are both wikis. So when downloading with e.g.:

wget -r -k -np -nv -R jpg,jpeg,gif,png,tif http://www.boinc-wiki.info/

I do get a lot of files because it also follows links like ...?action=edit ...?action=diff&version=...

Does somebody know a way to get around this?

I just want the current pages, without images, and without diffs etc.

P.S.:

wget -r -k -np -nv -l 1 -R jpg,jpeg,png,gif,tif,pdf,ppt http://boinc.berkeley.edu/trac/wiki/TitleIndex

This worked for berkeley but boinc-wiki.info is still giving me trouble :/

P.P.S:

I got what appears to be the most relevant pages with:

wget -r -k -nv  -l 2 -R jpg,jpeg,png,gif,tif,pdf,ppt http://www.boinc-wiki.info

Tie-fighter

Posted 2010-06-29T21:03:42.180

Reputation: 219

No need to cross post between superuser and serverfault http://serverfault.com/questions/156045/how-to-download-with-wget-without-following-links-with-parameters

– Bryan – 2010-06-29T22:07:23.870

Where should I have posted it? – Tie-fighter – 2010-06-29T22:20:19.690

this is the right place. It's not a server question. – David Z – 2010-06-30T00:42:04.400

Still I got the better answers at serverfault ;) – Tie-fighter – 2010-06-30T00:56:56.853

Answers

5

The new version of wget (v.1.14) solves all these problems.

You have to use the new option --reject-regex=.... to handle query strings.

Note that I couldn't find the new manual that includes these new options, so you have to use the help command wget --help > help.txt

user3133076

Posted 2010-06-29T21:03:42.180

Reputation: 145

4

wget --reject-regex '(.*)\?(.*)' http://example.com

(--reject-type posix by default). Works only for recent (>=1.14) versions of wget though, according to other comments.

Beware that it seems you can use --reject-regex only once per wget call. That is, you have to use | in a single regex if you want to select on several regex :

wget --reject-regex 'expr1|expr2|…' http://example.com

Skippy le Grand Gourou

Posted 2010-06-29T21:03:42.180

Reputation: 1 349

Regex alternation using the | ("pipe") symbol isn't working for me with GNU Wget 1.16. – sampablokuper – 2015-12-24T02:38:35.283

Could be true about the version requirement. I had v1.12 and the option was not valid. After upgrade to v1.15 it was. – yunzen – 2014-04-04T12:41:14.037

0

wget -R "*?action=*"

This will exclude anything which contains ?action= in its name.

Daisetsu

Posted 2010-06-29T21:03:42.180

Reputation: 5 195

3"Note, too, that query strings (strings at the end of a URL beginning with a question mark (‘?’) are not included as part of the filename for accept/reject rules, even though these will actually contribute to the name chosen for the local file. It is expected that a future version of Wget will provide an option to allow matching against query strings." – Tie-fighter – 2010-06-29T22:39:21.267

Hmm, I must have missed that. It looks like you can't do this with wget then if it doesn't even know that they are different files. I suggest a different program. – Daisetsu – 2010-07-01T16:41:56.247

-3

I'd say that leeching a public wiki site is bad practice, because it puts additional load on it.

If a wiki is public and the site owners don't mind sharing the content, they usually provide a downloadable backend (database or whatever) dump. So you would just download the data pack, set up a local instance of the same wiki engine, import the data into it and have a local copy. After that, if you wish, you can do the leeching locally.

vtest

Posted 2010-06-29T21:03:42.180

Reputation: 4 424

there is -w seconds. e.g. -w 5. http://www.gnu.org/software/wget/manual/html_node/Download-Options.html#Download-Options

– barlop – 2013-09-10T23:56:33.723