Website crawler/spider to get site map

I need to retrieve a whole website map, in a format like :

I need it to be linked-based (no file or dir brute-force), like :

parse homepage -> retrieve all links -> explore them -> retrieve links, ...

And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found :

I need to get only once the http://example.org/product/viewproduct

I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far.

The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language.

Thanks

ack__

Posted 2012-09-03T14:23:27.997

Reputation: 89

Answers

After a lot of research no tool has satisfied me, therefore I'm coding my own using http://scrapy.org/doc/

ack__

Posted 2012-09-03T14:23:27.997

Reputation: 89

Here is an example of one made in python:

(Taken from http://theanti9.wordpress.com/2009/02/14/python-web-crawler-in-less-than-50-lines/ )

Also on that website there is a link to a github project http://github.com/theanti9/PyCrawler that is a more robust version the person made.

import sys
import re
import urllib2
import urlparse
tocrawl = set(["http://www.facebook.com/"])
crawled = set([])
keywordregex = re.compile('<meta\sname=["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
linkregex = re.compile('<a\s*href=[\'|"](.*?)[\'"].*?>')

while 1:
    try:
        crawling = tocrawl.pop()
        print crawling
    except KeyError:
        raise StopIteration
    url = urlparse.urlparse(crawling)
    try:
        response = urllib2.urlopen(crawling)
    except:
        continue
    msg = response.read()
    startPos = msg.find('<title>')
    if startPos != -1:
        endPos = msg.find('</title>', startPos+7)
        if endPos != -1:
            title = msg[startPos+7:endPos]
            print title
    keywordlist = keywordregex.findall(msg)
    if len(keywordlist) > 0:
        keywordlist = keywordlist[0]
        keywordlist = keywordlist.split(", ")
        print keywordlist
    links = linkregex.findall(msg)
    crawled.add(crawling)
    for link in (links.pop(0) for _ in xrange(len(links))):
        if link.startswith('/'):
            link = 'http://' + url[1] + link
        elif link.startswith('#'):
            link = 'http://' + url[1] + url[2] + link
        elif not link.startswith('http'):
            link = 'http://' + url[1] + '/' + link
        if link not in crawled:
            tocrawl.add(link)

d4v3y0rk

Posted 2012-09-03T14:23:27.997

Reputation: 1 187

I personnaly use Kapow Katalyst, but I guess it's out of your budget. If not, it's probably the most intuitive software to create spiders, and much more if you need.

m4573r

Posted 2012-09-03T14:23:27.997

Reputation: 5 051

Thanks, I didn't know about this one. I'll take a look, although I don't have budget for this at this time. – ack__ – 2012-09-18T08:23:07.667

Technically speaking there is no foolproof way of extracting the directory structure of a website.

This is because HTTP is not a network file system. The only thing you can do with HTTP is follow the links from the starting page. Furthermore, there's nothing that requires the starting page to have links only to its immediate subdirectory. A top level index.html page may, for example, have a direct link to "foo/baz/blah.html", deep in some subdirectory.

Edit:

To generate basic site maps, Some online tools are there commonly known as Sitemap Generator. One such tool is web-site-map.com, it gives sitemap in XML.
If you are comfortable with programming then you can write your own web-spider, with specific set of rules of a particular site.

Ankit

Posted 2012-09-03T14:23:27.997

Reputation: 4 082

Indeed I'm looking for a follow-link style spider. No problem with sites not having links only to sub-dirs, the soft can later trim the content found and organize it in tree-view. I don't want to rely on XML sitemaps as they don't present all the site's content. And as to program my own spider, this is something much more complicated than it looks (see various threads on stackoverflow), and it takes a huge lot of time. – ack__ – 2012-09-04T10:39:50.517

(Win)HTTrack does a very decent job.

It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.

Jan Doggen

Posted 2012-09-03T14:23:27.997

Reputation: 3 591