0
I need to retrieve a whole website map, in a format like :
- http://example.org/
- http://example.org/product/
- http://example.org/service/
- http://example.org/about/
- http://example.org/product/viewproduct/
I need it to be linked-based (no file or dir brute-force), like :
parse homepage -> retrieve all links -> explore them -> retrieve links, ...
And I also need the ability to detect if a page is a "template" to not retrieve all of the "child-pages". For example if the following links are found :
- http://example.org/product/viewproduct?id=1
- http://example.org/product/viewproduct?id=2
- http://example.org/product/viewproduct?id=3
I need to get only once the http://example.org/product/viewproduct
I've looked into HTTtracks, wget (with spider-option), but nothing conclusive so far.
The soft/tool should be downloadable, and I prefer if it runs on Linux. It can be written in any language.
Thanks
Thanks, I didn't know about this one. I'll take a look, although I don't have budget for this at this time. – ack__ – 2012-09-18T08:23:07.667