7
1
I'm trying to recursively retrieve all possible urls (internal page urls) from a website.
Can you please help me out with wget? or is there any better alternative to achieve this? I do not want to download the any content from the website, but just want to get the urls of the same domain.
Thanks!
EDIT
I tried doing this in wget, and grep the outlog.txt file later. Not sure, if this is the right way to do it. But, It works!
$ wget -R.jpg,.jpeg,.gif,.png,.css -c -r http://www.example.com/ -o urllog.txt
$ grep -e " http" urllog1.txt | awk '{print $3}'
This question isn't very clear. What do you mean by "all possible URLs"? Do you want to start with one website and then crawl to all its linked websites, recursively? If so, how do you want to achieve that without downloading the actual websites, which you need to parse for further links? – Kerrek SB – 2011-08-29T10:43:47.533
What did you try? wget -r is the recursive option. Did you try with that? What problem did you run into? – steenhulthin – 2011-08-29T10:47:29.510
Just use
wget -r http://site.com
. Also nice option is-p
which will also fetch all prerequisites for the page, even if they are external. – dma_k – 2011-08-29T10:47:59.913@Kerrek all possible URLs - yes, URLs which are linked to internal pages (that means, which has the same domain). & That's a good point there, wget can download only html content, at least to find the linked urls/pages, ignoring any other file types. – None – 2011-08-29T10:50:33.283
2@abhiomkar: Well, yes, you wouldn't download all the pictures and flash animations of course. The
-r
option already does exactly that. If you also want stylesheets that may be linked inside other style sheets, you have to work a bit harder, but for the semantic content only,-r
is exactly the answer. Did you try any of this before asking the question, by the way? I think wget has a pretty decent documentation... – Kerrek SB – 2011-08-29T10:55:12.500apparently, recursive option works. (added reject rule) Just wondering if there are any wget one-liners which does it faster. Thanks for your help guys! much appreciated. – None – 2011-08-29T11:36:42.297
1
Dupe http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only
– giorgio79 – 2014-05-25T10:34:42.240