There are a couple of relevant flags:
-A acclist --accept acclist
(comma-separated glob-style pattern for filenames)
-I list
--include-directories=list
(comma-separated glob-style pattern for directories)
--accept-regex urlregex
(takes a regex for full URL)
Generally you would also pass -r
to recurse, and -l inf
otherwise the maximum recursion depth is 5. If you want to be able to start and stop the download, -nc
"no clobber" avoids redownloading existing files. For this, -E (--adjust-extension)
is also useful, which adds the .html
extension to HTML pages which lack it; when the extension is present and -nc
is specified, then wget
will still read URLs from the on-disk copy of the file.
Here's an example to download a word-by-word translation of the Qur'an:
wget -E -nc -l inf -nd -r --no-parent 'http://corpus.quran.com/wordbyword.jsp?chapter=1&verse=1' -A '*wordbyword*'
It starts at the first verse, and since each page links to the next verses, it eventually downloads all of them. The -A
option restricts us to just the pages we are interested in.
I think more examples are needed, so please feel free to suggest them and I will try to update this.
possible duplicate of wget recursive limited within subdomain
– Karan – 2013-05-26T23:33:23.943Yes, there was a good tip for me :) – superuser – 2013-05-26T23:42:06.377