Batch download pages from a wiki without special pages

2

2

From time to time I find some documentation on the web that I need for offline use on my notebook. Usually I fire up wget and get the whole site.

Many projects however are now switching to wikis, and that means I download every single version and every "edit me" link, too.

Is there any tool or any configuration in wget, so that I, for example, download only files without a query string or matching a certain regexp?

Cheers,

By the way: wget has the very useful -k switch, that converts any in-site links to their local counterparts. That would be another requirement. Example: Fetching http://example.com pages. Then all links to "/..." or "http://example.com/..." have to be converted to match the downloaded counterpart.

Boldewyn

Posted 2009-09-10T12:53:56.550

Reputation: 3 835

Answers

1

From the wget man page:

-R rejlist --reject rejlist

Specify comma-separated lists of file name suffixes or patterns to accept or reject. Note that if any of the wildcard characters, *, ?, [ or ], appear in an element of acclist or rejlist, it will be treated as a pattern, rather than a suffix.

This seems like exactly what you need.

Note: to reduce the load on the wiki server, you might want to look at the -w and --random-wait flags.

CarlF

Posted 2009-09-10T12:53:56.550

Reputation: 8 576

Cool, I just didn't see this option. Thanks. – Boldewyn – 2009-11-03T18:36:00.920

0

Most of them frown on that and Wikipedia actively shuts them down with robots.txt. I would stick to http://en.wikipedia.org/wiki/Special:Export

user10547

Posted 2009-09-10T12:53:56.550

Reputation: 1 089

I know, that it is quite stressful for the server, but that is one of the reasons I want to download only necessary files. Anyway, some projects just don't deliver their pages in another format than wiki pages. – Boldewyn – 2009-09-15T20:41:16.087