Dowloading all Urls accessible under a given domain with wget without saving the actual pages?

4

1

Hi trying to determine all the valid urls under a given domain without having the mirror the site locally.

People generally want to download all the pages but I just want to get a list of the direct urls under a given domain (e.g. www.example.com), which would be something like www.example.com/page1, www.example.com/page2, etc.

Is there a way to use wget to do this? or is there a better tool for this?

fccoelho

Posted 2013-09-24T18:39:21.230

Reputation: 185

2In order to determine the links on each page, you will need to see the page (I.e. Download it) – Brian Adkins – 2013-09-24T18:59:41.680

@BrianAdkins: I am Ok with downloading but I woant to keep only the urls not the pages' contents – fccoelho – 2013-09-24T19:05:16.377

There's a --spider option that downloads the page, but doesn't save it. – LawrenceC – 2013-09-25T22:43:40.190

Answers

2

Here is a crude script:

curl -s whaturl |
  grep -o "<a href=[^>]*>" |
  sed -r 's/<a href="([^"]*)".*>/\1/' |
  sort -u

The grep picks all the hrefs. The sed picks out the url part from the href. The sort filters out duplicate links.

It will also work with wget -O - in place of curl -s.

Example output:

$ curl -s http://stackexchange.com/users/148837/lesmana?tab=accounts | grep -o "<a href=[^>]*>" | sed -r 's/<a href="([^"]*)".*>/\1/' | sort -u
/
/about
/about/contact
/blogs
/leagues
/legal
/legal/privacy-policy
/newsletters
/questions
/sites
/users/148837/lesmana
/users/148837/lesmana?tab=activity
/users/148837/lesmana?tab=favorites
/users/148837/lesmana?tab=reputation
/users/148837/lesmana?tab=subscriptions
/users/148837/lesmana?tab=top
/users/login?returnurl=%2fusers%2f148837%2flesmana%3ftab%3daccounts
http://area51.stackexchange.com/users/16563/lesmana
http://askubuntu.com/users/1366/
http://blog.stackexchange.com
http://blog.stackoverflow.com/2009/06/attribution-required/
http://chat.stackexchange.com/
http://creativecommons.org/licenses/by-sa/3.0/
http://gaming.stackexchange.com/users/2790/
http://meta.stackoverflow.com
http://meta.stackoverflow.com/users/147747/
http://programmers.stackexchange.com/users/116/
http://serverfault.com/users/45166/
http://stackoverflow.com/users/360899/
http://superuser.com/users/39401/
http://twitter.com/stackexchange
http://unix.stackexchange.com/users/1170/
http://www.facebook.com/stackexchange
https://plus.google.com/+StackExchange

lesmana

Posted 2013-09-24T18:39:21.230

Reputation: 14 930

Very nice! I had ignored CURL because it can't recurse. I found that httrack solves this problem adequately. – fccoelho – 2013-09-24T19:54:29.513

4

Ok, I had to find my own answer:

the tool I use was httrack.

httrack -p0 -r2 -d www.example.com
  • the -p0 option tells it to just scan (not save pages);
  • the -rx option tells it the depth of the search
  • the -d options tells it to stay on the same principal domain

there is even a -%L to add the scanned URL to the specified file but it doesn't seem to work. But that's not a problem, because under the hts-cache directory you can find a TSV file named new.txt containing all the urls visited and some additional information about it. I could extract the URLs from it with the following python code:

with open("hts-cache/new.txt") as f:
    t = csv.DictReader(f,delimiter='\t')
    for l in t:
        print l['URL']

fccoelho

Posted 2013-09-24T18:39:21.230

Reputation: 185