wget: recursively retrieve urls from specific website

7

1

I'm trying to recursively retrieve all possible urls (internal page urls) from a website.

Can you please help me out with wget? or is there any better alternative to achieve this? I do not want to download the any content from the website, but just want to get the urls of the same domain.

Thanks!

EDIT

I tried doing this in wget, and grep the outlog.txt file later. Not sure, if this is the right way to do it. But, It works!

$ wget -R.jpg,.jpeg,.gif,.png,.css -c -r http://www.example.com/ -o urllog.txt
$ grep -e " http" urllog1.txt | awk '{print $3}'

abhiomkar

Posted 2011-08-29T10:40:11.677

Reputation: 171

This question isn't very clear. What do you mean by "all possible URLs"? Do you want to start with one website and then crawl to all its linked websites, recursively? If so, how do you want to achieve that without downloading the actual websites, which you need to parse for further links? – Kerrek SB – 2011-08-29T10:43:47.533

What did you try? wget -r is the recursive option. Did you try with that? What problem did you run into? – steenhulthin – 2011-08-29T10:47:29.510

Just use wget -r http://site.com. Also nice option is -p which will also fetch all prerequisites for the page, even if they are external. – dma_k – 2011-08-29T10:47:59.913

@Kerrek all possible URLs - yes, URLs which are linked to internal pages (that means, which has the same domain). & That's a good point there, wget can download only html content, at least to find the linked urls/pages, ignoring any other file types. – None – 2011-08-29T10:50:33.283

2@abhiomkar: Well, yes, you wouldn't download all the pictures and flash animations of course. The -r option already does exactly that. If you also want stylesheets that may be linked inside other style sheets, you have to work a bit harder, but for the semantic content only, -r is exactly the answer. Did you try any of this before asking the question, by the way? I think wget has a pretty decent documentation... – Kerrek SB – 2011-08-29T10:55:12.500

apparently, recursive option works. (added reject rule) Just wondering if there are any wget one-liners which does it faster. Thanks for your help guys! much appreciated. – None – 2011-08-29T11:36:42.297

Answers

1

You could also use something like nutch I've only ever used it to crawl internal links on a site and index them into solr but according to this post it can also do external links, depending on what you want to do with the results it may be a bit overkill though.

Snipzwolf

Posted 2011-08-29T10:40:11.677

Reputation: 288