Wget site mirror, links with rel="<content>" not followed



Whilst creating a site mirror using wget 1.12 on Ubuntu links with a rel attribute set are not downloaded:

 <a href="link" rel="tag">text</a>

Rel="tag" is a microformat (By adding rel="tag" to a hyperlink, a page indicates that the destination of that hyperlink is an author-designated "tag" (or keyword/subject) for the current page).

My WordPress theme uses this for link to tags, so 99% of the site is ignored.

Edit: it turns out all my permalinks use rel="bookmark" and are skipped as well.

I'm using the following wget command (this ignores robots.txt and also follows nofollow links):

wget -mkp -e robots=off http://site

How do I make wget follow links with rel set?


Posted 2012-03-23T10:11:12.953

Reputation: 2 933

did you try it with --follow-tags=rel already? – JohannesM – 2012-03-23T10:16:37.977

@JohannesM Manual says: "If a user wants only a subset of those tags to be considered, however, he or she should be specify such tags in a comma-separated list with this option. " your answer would only follow rel tags, which don't exist on the page. --follow-tags does not add to the internal list of tags/attributes to follow but replaces it. And no --ignore-tags= doesn't work either.. – svandragt – 2012-03-23T10:20:14.080



I compiled wget 1.13 from source and that fixes the issue (I think it's this line even though I'm not talking about CSS links): Parsing links from CSS files, and from CSS content found in HTML style tags and attributes):

cd /tmp
wget ftp://ftp.gnu.org/gnu/wget/wget-1.13.tar.gz
gunzip < wget-1.13.tar.gz | tar -xv
cd wget-1.13
./configure --with-ssl=openssl
sudo make install
mkdir ~/bin
sudo echo "export PATH=$PATH:~/bin" >> ~/.bashrc
cp /usr/local/bin/wget ~/bin


Posted 2012-03-23T10:11:12.953

Reputation: 2 933